2023-06-17 16:35:02,072 INFO [train.py:1064] (0/4) Training started 2023-06-17 16:35:02,078 INFO [train.py:1074] (0/4) Device: cuda:0 2023-06-17 16:35:04,224 INFO [lexicon.py:168] (0/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-17 16:35:04,471 INFO [train.py:1085] (0/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '802bf98-dirty', 'icefall-git-date': 'Fri Jun 16 18:26:55 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-3-0423201227-84b4557756-8lx4n', 'IP address': '10.177.6.147'}, 'world_size': 4, 'master_port': 12537, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small_causal'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-17 16:35:04,471 INFO [train.py:1087] (0/4) About to create model 2023-06-17 16:35:05,080 INFO [train.py:1091] (0/4) Number of model parameters: 32669302 2023-06-17 16:35:10,311 INFO [train.py:1106] (0/4) Using DDP 2023-06-17 16:35:10,590 INFO [asr_datamodule.py:390] (0/4) About to get train cuts 2023-06-17 16:35:10,592 INFO [asr_datamodule.py:398] (0/4) About to get dev cuts 2023-06-17 16:35:10,594 INFO [asr_datamodule.py:211] (0/4) About to get Musan cuts 2023-06-17 16:35:13,515 INFO [asr_datamodule.py:216] (0/4) Enable MUSAN 2023-06-17 16:35:13,515 INFO [asr_datamodule.py:239] (0/4) Enable SpecAugment 2023-06-17 16:35:13,515 INFO [asr_datamodule.py:240] (0/4) Time warp factor: 80 2023-06-17 16:35:13,516 INFO [asr_datamodule.py:250] (0/4) Num frame mask: 10 2023-06-17 16:35:13,516 INFO [asr_datamodule.py:263] (0/4) About to create train dataset 2023-06-17 16:35:13,516 INFO [asr_datamodule.py:289] (0/4) Using DynamicBucketingSampler. 2023-06-17 16:35:17,422 INFO [asr_datamodule.py:305] (0/4) About to create train dataloader 2023-06-17 16:35:17,423 INFO [asr_datamodule.py:336] (0/4) About to create dev dataset 2023-06-17 16:35:18,164 INFO [asr_datamodule.py:354] (0/4) About to create dev dataloader 2023-06-17 16:37:08,030 INFO [train.py:996] (0/4) Epoch 1, batch 0, loss[loss=10.62, simple_loss=9.644, pruned_loss=9.761, over 21767.00 frames. ], tot_loss[loss=10.62, simple_loss=9.644, pruned_loss=9.761, over 21767.00 frames. ], batch size: 102, lr: 2.25e-02, grad_scale: 1.0 2023-06-17 16:37:08,034 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-17 16:37:25,626 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=10.9, simple_loss=9.897, pruned_loss=10.04, over 1796401.00 frames. 2023-06-17 16:37:25,627 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 22900MB 2023-06-17 16:37:36,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=0.0, ans=0.3 2023-06-17 16:37:45,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60.0, ans=0.2994 2023-06-17 16:37:47,297 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.18 vs. limit=7.545 2023-06-17 16:37:56,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=112.07 vs. limit=7.5225 2023-06-17 16:38:13,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=83.24 vs. limit=5.03 2023-06-17 16:38:38,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=180.0, ans=0.098875 2023-06-17 16:38:39,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=120.37 vs. limit=5.045 2023-06-17 16:39:11,378 INFO [train.py:996] (0/4) Epoch 1, batch 50, loss[loss=0.9143, simple_loss=0.8097, pruned_loss=0.935, over 16973.00 frames. ], tot_loss[loss=4.166, simple_loss=3.847, pruned_loss=3.147, over 959557.71 frames. ], batch size: 62, lr: 2.48e-02, grad_scale: 0.5 2023-06-17 16:39:22,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=300.0, ans=0.4859375 2023-06-17 16:39:36,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=360.0, ans=0.8874000000000001 2023-06-17 16:39:39,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=35.34 vs. limit=4.144 2023-06-17 16:39:46,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=246.91 vs. limit=7.6575 2023-06-17 16:39:51,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=173.69 vs. limit=7.815 2023-06-17 16:39:58,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=31.41 vs. limit=5.105 2023-06-17 16:39:58,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=61.23 vs. limit=5.105 2023-06-17 16:40:04,477 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=135.71 vs. limit=7.86 2023-06-17 16:40:06,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=73.54 vs. limit=7.68 2023-06-17 16:40:07,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=480.0, ans=0.097 2023-06-17 16:40:10,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=247.59 vs. limit=7.68 2023-06-17 16:40:10,099 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=197.92 vs. limit=7.68 2023-06-17 16:40:10,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480.0, ans=0.29519999999999996 2023-06-17 16:40:50,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=97.22 vs. limit=7.7025 2023-06-17 16:40:52,300 INFO [train.py:996] (0/4) Epoch 1, batch 100, loss[loss=1.198, simple_loss=1.036, pruned_loss=1.297, over 21290.00 frames. ], tot_loss[loss=2.613, simple_loss=2.38, pruned_loss=2.166, over 1694636.16 frames. ], batch size: 176, lr: 2.70e-02, grad_scale: 1.0 2023-06-17 16:40:53,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.01 vs. limit=7.95 2023-06-17 16:40:53,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=42.52 vs. limit=7.725 2023-06-17 16:40:56,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.080e+02 2.341e+02 3.851e+02 6.975e+03 2.847e+04, threshold=7.702e+02, percent-clipped=0.0 2023-06-17 16:41:04,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=600.0, ans=0.294 2023-06-17 16:41:35,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=720.0, ans=0.46625 2023-06-17 16:41:52,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=230.68 vs. limit=7.7925 2023-06-17 16:42:01,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.03 vs. limit=8.085 2023-06-17 16:42:14,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.57 vs. limit=8.085 2023-06-17 16:42:30,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=840.0, ans=0.460625 2023-06-17 16:42:36,698 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=7.8375 2023-06-17 16:42:37,169 INFO [train.py:996] (0/4) Epoch 1, batch 150, loss[loss=0.959, simple_loss=0.8204, pruned_loss=1.01, over 21146.00 frames. ], tot_loss[loss=2.012, simple_loss=1.809, pruned_loss=1.779, over 2272036.62 frames. ], batch size: 143, lr: 2.93e-02, grad_scale: 1.0 2023-06-17 16:42:50,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=254.05 vs. limit=7.8375 2023-06-17 16:42:54,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=105.68 vs. limit=7.86 2023-06-17 16:43:02,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=61.06 vs. limit=7.86 2023-06-17 16:43:15,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=114.71 vs. limit=7.8825 2023-06-17 16:43:24,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1020.0, ans=0.4521875 2023-06-17 16:43:27,034 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=64.07 vs. limit=5.51 2023-06-17 16:43:49,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1080.0, ans=0.365 2023-06-17 16:43:50,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=35.76 vs. limit=7.905 2023-06-17 16:43:53,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=54.39 vs. limit=7.905 2023-06-17 16:44:00,656 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.65 vs. limit=8.31 2023-06-17 16:44:01,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1080.0, ans=0.046625 2023-06-17 16:44:13,985 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 16:44:26,093 INFO [train.py:996] (0/4) Epoch 1, batch 200, loss[loss=1.094, simple_loss=0.9434, pruned_loss=1.041, over 21537.00 frames. ], tot_loss[loss=1.681, simple_loss=1.496, pruned_loss=1.529, over 2716027.32 frames. ], batch size: 414, lr: 3.15e-02, grad_scale: 2.0 2023-06-17 16:44:29,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 7.013e+01 1.220e+02 1.520e+02 2.087e+02 3.052e+02, threshold=3.040e+02, percent-clipped=0.0 2023-06-17 16:44:31,132 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.52 vs. limit=8.4 2023-06-17 16:44:42,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.79 vs. limit=5.6 2023-06-17 16:44:47,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=46.38 vs. limit=7.9725 2023-06-17 16:44:53,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=57.18 vs. limit=7.9725 2023-06-17 16:44:58,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=34.85 vs. limit=7.9725 2023-06-17 16:45:10,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.41 vs. limit=8.49 2023-06-17 16:45:41,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=180.97 vs. limit=8.0175 2023-06-17 16:46:08,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=21.07 vs. limit=8.04 2023-06-17 16:46:12,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=84.95 vs. limit=8.04 2023-06-17 16:46:16,419 INFO [train.py:996] (0/4) Epoch 1, batch 250, loss[loss=0.9764, simple_loss=0.8338, pruned_loss=0.9156, over 21611.00 frames. ], tot_loss[loss=1.476, simple_loss=1.303, pruned_loss=1.353, over 3061799.94 frames. ], batch size: 230, lr: 3.38e-02, grad_scale: 2.0 2023-06-17 16:46:21,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=20.61 vs. limit=8.625 2023-06-17 16:46:33,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1560.0, ans=0.2844 2023-06-17 16:46:41,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=22.40 vs. limit=8.085 2023-06-17 16:46:48,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=8.67 2023-06-17 16:46:54,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1620.0, ans=0.4240625 2023-06-17 16:47:24,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1680.0, ans=0.42125 2023-06-17 16:47:37,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.43 vs. limit=8.76 2023-06-17 16:47:46,489 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=21.31 vs. limit=8.1525 2023-06-17 16:47:49,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1740.0, ans=0.28259999999999996 2023-06-17 16:48:03,336 INFO [train.py:996] (0/4) Epoch 1, batch 300, loss[loss=0.8536, simple_loss=0.7225, pruned_loss=0.7879, over 21833.00 frames. ], tot_loss[loss=1.325, simple_loss=1.162, pruned_loss=1.215, over 3333105.90 frames. ], batch size: 107, lr: 3.60e-02, grad_scale: 4.0 2023-06-17 16:48:07,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 9.171e+01 1.173e+02 1.354e+02 1.820e+02 4.361e+02, threshold=2.708e+02, percent-clipped=2.0 2023-06-17 16:48:09,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1800.0, ans=0.035 2023-06-17 16:48:17,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=8.175 2023-06-17 16:48:35,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.09 vs. limit=8.1975 2023-06-17 16:48:36,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=8.895 2023-06-17 16:49:07,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=4.792 2023-06-17 16:49:10,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1980.0, ans=0.4071875 2023-06-17 16:49:35,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.24 vs. limit=8.265 2023-06-17 16:49:41,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.57 vs. limit=8.265 2023-06-17 16:49:45,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=9.03 2023-06-17 16:49:48,616 INFO [train.py:996] (0/4) Epoch 1, batch 350, loss[loss=0.8175, simple_loss=0.6884, pruned_loss=0.7348, over 21347.00 frames. ], tot_loss[loss=1.213, simple_loss=1.056, pruned_loss=1.107, over 3549084.39 frames. ], batch size: 131, lr: 3.83e-02, grad_scale: 4.0 2023-06-17 16:49:50,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2100.0, ans=0.2375 2023-06-17 16:49:53,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=9.075 2023-06-17 16:50:03,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=14.55 vs. limit=5.0 2023-06-17 16:50:07,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=19.87 vs. limit=8.31 2023-06-17 16:50:20,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2160.0, ans=0.051399999999999994 2023-06-17 16:50:20,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2160.0, ans=0.2784 2023-06-17 16:50:28,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.42 vs. limit=8.3325 2023-06-17 16:50:30,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=175.69 vs. limit=8.3325 2023-06-17 16:50:49,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=171.59 vs. limit=8.3325 2023-06-17 16:51:13,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2280.0, ans=0.393125 2023-06-17 16:51:19,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.28 vs. limit=9.254999999999999 2023-06-17 16:51:19,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2340.0, ans=0.3903125 2023-06-17 16:51:36,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=24.90 vs. limit=8.4 2023-06-17 16:51:36,720 INFO [train.py:996] (0/4) Epoch 1, batch 400, loss[loss=0.806, simple_loss=0.6743, pruned_loss=0.7106, over 21183.00 frames. ], tot_loss[loss=1.125, simple_loss=0.9738, pruned_loss=1.02, over 3707120.31 frames. ], batch size: 143, lr: 4.05e-02, grad_scale: 8.0 2023-06-17 16:51:38,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=2400.0, ans=9.3 2023-06-17 16:51:39,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=2400.0, ans=0.085 2023-06-17 16:51:39,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=4.96 2023-06-17 16:51:40,350 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 8.615e+01 1.452e+02 1.814e+02 2.451e+02 4.544e+02, threshold=3.628e+02, percent-clipped=11.0 2023-06-17 16:51:42,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2400.0, ans=0.23600000000000002 2023-06-17 16:52:00,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=8.4225 2023-06-17 16:52:04,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=144.47 vs. limit=8.4225 2023-06-17 16:52:26,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2520.0, ans=8.445 2023-06-17 16:52:59,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2580.0, ans=0.1775 2023-06-17 16:53:26,161 INFO [train.py:996] (0/4) Epoch 1, batch 450, loss[loss=0.8045, simple_loss=0.6722, pruned_loss=0.6868, over 21185.00 frames. ], tot_loss[loss=1.069, simple_loss=0.9185, pruned_loss=0.9589, over 3841884.48 frames. ], batch size: 177, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 16:53:46,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2760.0, ans=0.370625 2023-06-17 16:53:53,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2760.0, ans=0.037899999999999996 2023-06-17 16:53:53,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.47 vs. limit=8.535 2023-06-17 16:53:54,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=8.535 2023-06-17 16:54:08,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2820.0, ans=0.3678125 2023-06-17 16:54:20,173 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=20.15 vs. limit=8.557500000000001 2023-06-17 16:54:30,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2820.0, ans=0.07 2023-06-17 16:54:31,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2820.0, ans=0.09425 2023-06-17 16:54:41,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=20.49 vs. limit=8.58 2023-06-17 16:55:00,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.14 vs. limit=8.6025 2023-06-17 16:55:10,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2940.0, ans=0.1325 2023-06-17 16:55:15,281 INFO [train.py:996] (0/4) Epoch 1, batch 500, loss[loss=0.9233, simple_loss=0.7771, pruned_loss=0.7484, over 19883.00 frames. ], tot_loss[loss=1.05, simple_loss=0.8969, pruned_loss=0.9268, over 3937347.65 frames. ], batch size: 703, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:55:19,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 9.969e+01 1.768e+02 2.484e+02 3.323e+02 7.392e+02, threshold=4.968e+02, percent-clipped=16.0 2023-06-17 16:56:37,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3180.0, ans=0.2682 2023-06-17 16:56:40,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=9.885 2023-06-17 16:56:58,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.51 vs. limit=8.715 2023-06-17 16:57:02,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=8.7375 2023-06-17 16:57:02,950 INFO [train.py:996] (0/4) Epoch 1, batch 550, loss[loss=1.011, simple_loss=0.8536, pruned_loss=0.79, over 21621.00 frames. ], tot_loss[loss=1.018, simple_loss=0.8667, pruned_loss=0.8806, over 4017338.33 frames. ], batch size: 441, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:57:34,963 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.91 vs. limit=10.02 2023-06-17 16:58:08,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.69 vs. limit=8.7825 2023-06-17 16:58:27,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=3480.0, ans=0.33687500000000004 2023-06-17 16:58:46,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=3540.0, ans=0.33406250000000004 2023-06-17 16:58:46,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.96 vs. limit=5.885 2023-06-17 16:58:50,985 INFO [train.py:996] (0/4) Epoch 1, batch 600, loss[loss=0.7167, simple_loss=0.6102, pruned_loss=0.535, over 22005.00 frames. ], tot_loss[loss=0.9863, simple_loss=0.839, pruned_loss=0.8328, over 4080785.27 frames. ], batch size: 103, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:58:54,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 2.961e+02 3.893e+02 6.488e+02 1.570e+03, threshold=7.787e+02, percent-clipped=36.0 2023-06-17 16:59:08,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=10.245000000000001 2023-06-17 16:59:11,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=3660.0, ans=0.3284375 2023-06-17 16:59:25,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=3660.0, ans=0.06274999999999997 2023-06-17 16:59:25,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.21 vs. limit=6.83 2023-06-17 16:59:37,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=10.29 2023-06-17 17:00:15,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=3780.0, ans=0.014949999999999991 2023-06-17 17:00:17,947 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.35 vs. limit=8.9175 2023-06-17 17:00:36,771 INFO [train.py:996] (0/4) Epoch 1, batch 650, loss[loss=0.9426, simple_loss=0.8189, pruned_loss=0.6541, over 19746.00 frames. ], tot_loss[loss=0.955, simple_loss=0.8129, pruned_loss=0.7849, over 4113183.44 frames. ], batch size: 703, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:01:01,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=8.985 2023-06-17 17:01:25,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=9.0075 2023-06-17 17:01:36,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=4020.0, ans=0.31156249999999996 2023-06-17 17:01:50,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=4080.0, ans=0.30874999999999997 2023-06-17 17:02:21,261 INFO [train.py:996] (0/4) Epoch 1, batch 700, loss[loss=0.72, simple_loss=0.6196, pruned_loss=0.503, over 21851.00 frames. ], tot_loss[loss=0.9158, simple_loss=0.7807, pruned_loss=0.7329, over 4149803.13 frames. ], batch size: 118, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:02:24,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 4.078e+02 5.855e+02 9.456e+02 2.667e+03, threshold=1.171e+03, percent-clipped=39.0 2023-06-17 17:02:31,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=4200.0, ans=0.303125 2023-06-17 17:02:38,733 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=6.738e+01 2023-06-17 17:02:59,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=4260.0, ans=0.7509 2023-06-17 17:03:38,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=9.1425 2023-06-17 17:03:39,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=4380.0, ans=0.2946875 2023-06-17 17:04:06,001 INFO [train.py:996] (0/4) Epoch 1, batch 750, loss[loss=0.6578, simple_loss=0.5579, pruned_loss=0.4671, over 21854.00 frames. ], tot_loss[loss=0.8735, simple_loss=0.7461, pruned_loss=0.6808, over 4185740.40 frames. ], batch size: 98, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:04:44,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=4560.0, ans=0.28625 2023-06-17 17:05:04,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=4620.0, ans=0.0 2023-06-17 17:05:15,567 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=10.965 2023-06-17 17:05:32,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=4680.0, ans=0.280625 2023-06-17 17:05:34,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=5.896 2023-06-17 17:05:37,924 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.21 vs. limit=6.1850000000000005 2023-06-17 17:05:40,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=4740.0, ans=0.7341 2023-06-17 17:05:51,444 INFO [train.py:996] (0/4) Epoch 1, batch 800, loss[loss=0.6153, simple_loss=0.5409, pruned_loss=0.3969, over 21703.00 frames. ], tot_loss[loss=0.8354, simple_loss=0.7153, pruned_loss=0.6344, over 4203821.23 frames. ], batch size: 124, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:05:52,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=4800.0, ans=0.272 2023-06-17 17:05:53,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=4800.0, ans=0.275 2023-06-17 17:05:54,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 4.402e+02 7.390e+02 1.255e+03 3.583e+03, threshold=1.478e+03, percent-clipped=27.0 2023-06-17 17:06:24,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=4860.0, ans=0.2721875 2023-06-17 17:06:24,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=4860.0, ans=6.215 2023-06-17 17:06:34,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=4920.0, ans=0.7278 2023-06-17 17:06:34,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=4920.0, ans=0.04616666666666667 2023-06-17 17:06:56,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.35 vs. limit=7.46 2023-06-17 17:07:35,946 INFO [train.py:996] (0/4) Epoch 1, batch 850, loss[loss=0.5799, simple_loss=0.5036, pruned_loss=0.3798, over 21111.00 frames. ], tot_loss[loss=0.7976, simple_loss=0.6847, pruned_loss=0.5907, over 4226370.49 frames. ], batch size: 143, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:07:45,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=5100.0, ans=0.04541666666666667 2023-06-17 17:08:40,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.51 vs. limit=6.305 2023-06-17 17:08:48,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=9.48 2023-06-17 17:09:11,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.60 vs. limit=7.67 2023-06-17 17:09:19,823 INFO [train.py:996] (0/4) Epoch 1, batch 900, loss[loss=0.7007, simple_loss=0.6021, pruned_loss=0.4633, over 21730.00 frames. ], tot_loss[loss=0.7617, simple_loss=0.6565, pruned_loss=0.5494, over 4242784.35 frames. ], batch size: 389, lr: 4.48e-02, grad_scale: 16.0 2023-06-17 17:09:23,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 4.491e+02 8.246e+02 1.178e+03 2.944e+03, threshold=1.649e+03, percent-clipped=18.0 2023-06-17 17:09:33,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=5400.0, ans=0.246875 2023-06-17 17:09:43,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=5460.0, ans=0.7089000000000001 2023-06-17 17:11:01,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.87 vs. limit=3.846 2023-06-17 17:11:05,170 INFO [train.py:996] (0/4) Epoch 1, batch 950, loss[loss=0.6264, simple_loss=0.5428, pruned_loss=0.4029, over 21543.00 frames. ], tot_loss[loss=0.7365, simple_loss=0.6368, pruned_loss=0.5183, over 4257040.97 frames. ], batch size: 548, lr: 4.48e-02, grad_scale: 16.0 2023-06-17 17:11:24,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=5700.0, ans=0.23281249999999998 2023-06-17 17:11:46,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5760.0, ans=0.2424 2023-06-17 17:12:26,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=5940.0, ans=0.2215625 2023-06-17 17:12:36,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=5940.0, ans=0.2215625 2023-06-17 17:12:44,946 INFO [train.py:996] (0/4) Epoch 1, batch 1000, loss[loss=0.6289, simple_loss=0.5543, pruned_loss=0.3874, over 21792.00 frames. ], tot_loss[loss=0.7139, simple_loss=0.6195, pruned_loss=0.4905, over 4260224.52 frames. ], batch size: 124, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:12:48,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=6000.0, ans=0.29000000000000004 2023-06-17 17:12:50,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 4.573e+02 9.444e+02 1.523e+03 4.461e+03, threshold=1.889e+03, percent-clipped=19.0 2023-06-17 17:13:02,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=6.4 2023-06-17 17:13:27,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=6060.0, ans=0.2394 2023-06-17 17:14:03,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=6180.0, ans=0.2103125 2023-06-17 17:14:24,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=6240.0, ans=0.20750000000000002 2023-06-17 17:14:30,042 INFO [train.py:996] (0/4) Epoch 1, batch 1050, loss[loss=0.5564, simple_loss=0.4943, pruned_loss=0.3351, over 21453.00 frames. ], tot_loss[loss=0.6926, simple_loss=0.6038, pruned_loss=0.4645, over 4266564.07 frames. ], batch size: 212, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:14:41,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=6300.0, ans=0.20468750000000002 2023-06-17 17:15:14,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6360.0, ans=0.2364 2023-06-17 17:15:53,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=6480.0, ans=0.009460869565217392 2023-06-17 17:16:09,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6540.0, ans=0.23459999999999998 2023-06-17 17:16:19,607 INFO [train.py:996] (0/4) Epoch 1, batch 1100, loss[loss=0.7155, simple_loss=0.64, pruned_loss=0.4226, over 21536.00 frames. ], tot_loss[loss=0.6683, simple_loss=0.5859, pruned_loss=0.4378, over 4263862.08 frames. ], batch size: 471, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:16:20,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.70 vs. limit=6.65 2023-06-17 17:16:24,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.653e+02 4.618e+02 6.760e+02 9.652e+02 3.048e+03, threshold=1.352e+03, percent-clipped=4.0 2023-06-17 17:16:34,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=6600.0, ans=0.058750000000000004 2023-06-17 17:16:44,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6600.0, ans=0.23399999999999999 2023-06-17 17:16:50,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=3.999 2023-06-17 17:17:15,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=6720.0, ans=0.0 2023-06-17 17:17:21,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.64 vs. limit=6.68 2023-06-17 17:17:37,977 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=10.0425 2023-06-17 17:18:16,868 INFO [train.py:996] (0/4) Epoch 1, batch 1150, loss[loss=0.5624, simple_loss=0.5079, pruned_loss=0.3247, over 21829.00 frames. ], tot_loss[loss=0.6477, simple_loss=0.5705, pruned_loss=0.4153, over 4260262.49 frames. ], batch size: 118, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:18:50,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=6960.0, ans=0.17375000000000002 2023-06-17 17:18:53,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=7020.0, ans=0.009343478260869566 2023-06-17 17:19:05,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=7020.0, ans=0.009343478260869566 2023-06-17 17:20:04,110 INFO [train.py:996] (0/4) Epoch 1, batch 1200, loss[loss=0.6072, simple_loss=0.5546, pruned_loss=0.3421, over 21278.00 frames. ], tot_loss[loss=0.6374, simple_loss=0.5637, pruned_loss=0.4008, over 4269439.73 frames. ], batch size: 548, lr: 4.47e-02, grad_scale: 16.0 2023-06-17 17:20:09,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 4.949e+02 7.827e+02 1.470e+03 3.073e+03, threshold=1.565e+03, percent-clipped=26.0 2023-06-17 17:20:49,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=7320.0, ans=0.156875 2023-06-17 17:21:50,149 INFO [train.py:996] (0/4) Epoch 1, batch 1250, loss[loss=0.6005, simple_loss=0.5375, pruned_loss=0.3485, over 21502.00 frames. ], tot_loss[loss=0.6289, simple_loss=0.5585, pruned_loss=0.3882, over 4276594.19 frames. ], batch size: 548, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:23:24,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=10.4025 2023-06-17 17:23:30,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=7740.0, ans=0.6291 2023-06-17 17:23:34,688 INFO [train.py:996] (0/4) Epoch 1, batch 1300, loss[loss=0.5519, simple_loss=0.4734, pruned_loss=0.3392, over 20874.00 frames. ], tot_loss[loss=0.6198, simple_loss=0.5527, pruned_loss=0.3763, over 4283589.74 frames. ], batch size: 608, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:23:46,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.014e+02 6.355e+02 9.383e+02 1.437e+03 4.251e+03, threshold=1.877e+03, percent-clipped=19.0 2023-06-17 17:23:55,511 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=1.332e+00 2023-06-17 17:24:08,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=7860.0, ans=0.13156250000000003 2023-06-17 17:24:28,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=7980.0, ans=0.12593749999999998 2023-06-17 17:24:30,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=10.4925 2023-06-17 17:24:31,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=7980.0, ans=0.6207 2023-06-17 17:25:18,173 INFO [train.py:996] (0/4) Epoch 1, batch 1350, loss[loss=0.5504, simple_loss=0.5056, pruned_loss=0.3045, over 21837.00 frames. ], tot_loss[loss=0.6091, simple_loss=0.5459, pruned_loss=0.3636, over 4285518.39 frames. ], batch size: 332, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:25:41,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=8160.0, ans=0.07 2023-06-17 17:27:04,267 INFO [train.py:996] (0/4) Epoch 1, batch 1400, loss[loss=0.7614, simple_loss=0.6656, pruned_loss=0.4502, over 21571.00 frames. ], tot_loss[loss=0.5969, simple_loss=0.5369, pruned_loss=0.3516, over 4281607.47 frames. ], batch size: 507, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:27:16,084 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.966e+02 8.167e+02 1.163e+03 2.690e+03, threshold=1.633e+03, percent-clipped=5.0 2023-06-17 17:27:29,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=8460.0, ans=0.125 2023-06-17 17:27:44,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=8520.0, ans=0.6018 2023-06-17 17:28:07,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=8580.0, ans=0.1 2023-06-17 17:28:14,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8580.0, ans=0.2142 2023-06-17 17:28:14,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=8580.0, ans=0.5997 2023-06-17 17:28:47,824 INFO [train.py:996] (0/4) Epoch 1, batch 1450, loss[loss=0.5851, simple_loss=0.5307, pruned_loss=0.328, over 21568.00 frames. ], tot_loss[loss=0.5905, simple_loss=0.5329, pruned_loss=0.3437, over 4281464.15 frames. ], batch size: 230, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:29:00,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8700.0, ans=0.213 2023-06-17 17:29:16,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=8760.0, ans=0.125 2023-06-17 17:30:19,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=8940.0, ans=0.5871 2023-06-17 17:30:31,058 INFO [train.py:996] (0/4) Epoch 1, batch 1500, loss[loss=0.5777, simple_loss=0.527, pruned_loss=0.3203, over 21825.00 frames. ], tot_loss[loss=0.5873, simple_loss=0.5314, pruned_loss=0.3383, over 4284132.77 frames. ], batch size: 371, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:30:42,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.853e+02 4.683e+02 9.412e+02 1.321e+03 2.952e+03, threshold=1.882e+03, percent-clipped=11.0 2023-06-17 17:31:09,486 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=10.92 2023-06-17 17:32:00,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=9240.0, ans=0.125 2023-06-17 17:32:05,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=9240.0, ans=0.02816666666666667 2023-06-17 17:32:16,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=9240.0, ans=0.5766 2023-06-17 17:32:22,534 INFO [train.py:996] (0/4) Epoch 1, batch 1550, loss[loss=0.4586, simple_loss=0.4471, pruned_loss=0.2317, over 21173.00 frames. ], tot_loss[loss=0.5739, simple_loss=0.5223, pruned_loss=0.3263, over 4286213.35 frames. ], batch size: 143, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:32:42,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=9360.0, ans=0.125 2023-06-17 17:32:46,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=9360.0, ans=0.125 2023-06-17 17:32:50,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9360.0, ans=0.2064 2023-06-17 17:33:01,539 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.96 vs. limit=14.565000000000001 2023-06-17 17:33:59,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=9540.0, ans=0.125 2023-06-17 17:34:01,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=9540.0, ans=0.125 2023-06-17 17:34:02,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=9540.0, ans=0.2 2023-06-17 17:34:09,505 INFO [train.py:996] (0/4) Epoch 1, batch 1600, loss[loss=0.5197, simple_loss=0.5044, pruned_loss=0.2647, over 21704.00 frames. ], tot_loss[loss=0.5646, simple_loss=0.516, pruned_loss=0.3177, over 4284368.94 frames. ], batch size: 263, lr: 4.45e-02, grad_scale: 16.0 2023-06-17 17:34:14,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=9600.0, ans=0.125 2023-06-17 17:34:15,874 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.713e+02 5.768e+02 7.778e+02 1.283e+03 4.290e+03, threshold=1.556e+03, percent-clipped=12.0 2023-06-17 17:34:59,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.39 vs. limit=11.145 2023-06-17 17:35:10,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9720.0, ans=0.20279999999999998 2023-06-17 17:35:18,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=9780.0, ans=0.5577000000000001 2023-06-17 17:35:53,566 INFO [train.py:996] (0/4) Epoch 1, batch 1650, loss[loss=0.7286, simple_loss=0.63, pruned_loss=0.4255, over 21449.00 frames. ], tot_loss[loss=0.5553, simple_loss=0.5105, pruned_loss=0.3088, over 4284229.45 frames. ], batch size: 471, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:35:55,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=9900.0, ans=0.02541666666666667 2023-06-17 17:36:02,724 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=14.925 2023-06-17 17:37:33,267 INFO [train.py:996] (0/4) Epoch 1, batch 1700, loss[loss=0.5697, simple_loss=0.5289, pruned_loss=0.307, over 21604.00 frames. ], tot_loss[loss=0.5551, simple_loss=0.5124, pruned_loss=0.306, over 4284449.12 frames. ], batch size: 414, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:37:41,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.683e+02 4.849e+02 8.667e+02 1.230e+03 2.717e+03, threshold=1.733e+03, percent-clipped=16.0 2023-06-17 17:38:25,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=10320.0, ans=10.0 2023-06-17 17:38:58,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=10380.0, ans=0.0 2023-06-17 17:39:19,732 INFO [train.py:996] (0/4) Epoch 1, batch 1750, loss[loss=0.3315, simple_loss=0.3422, pruned_loss=0.1567, over 21328.00 frames. ], tot_loss[loss=0.5449, simple_loss=0.5084, pruned_loss=0.2958, over 4288718.50 frames. ], batch size: 176, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:39:22,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=10500.0, ans=0.025 2023-06-17 17:39:39,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.58 vs. limit=15.375 2023-06-17 17:40:07,168 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=11.46 2023-06-17 17:40:20,734 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.42 vs. limit=4.593 2023-06-17 17:40:34,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=10680.0, ans=0.125 2023-06-17 17:40:40,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=10680.0, ans=0.5262 2023-06-17 17:41:18,157 INFO [train.py:996] (0/4) Epoch 1, batch 1800, loss[loss=0.681, simple_loss=0.625, pruned_loss=0.3706, over 21417.00 frames. ], tot_loss[loss=0.5313, simple_loss=0.4996, pruned_loss=0.2852, over 4281799.75 frames. ], batch size: 507, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:41:19,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=15.6 2023-06-17 17:41:26,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 4.676e+02 7.112e+02 1.184e+03 2.740e+03, threshold=1.422e+03, percent-clipped=6.0 2023-06-17 17:41:32,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.74 vs. limit=7.7 2023-06-17 17:41:56,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=10920.0, ans=0.008495652173913043 2023-06-17 17:42:00,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10920.0, ans=0.19079999999999997 2023-06-17 17:42:08,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=10920.0, ans=0.09899000000000001 2023-06-17 17:43:02,050 INFO [train.py:996] (0/4) Epoch 1, batch 1850, loss[loss=0.4789, simple_loss=0.4392, pruned_loss=0.2604, over 20330.00 frames. ], tot_loss[loss=0.5268, simple_loss=0.5, pruned_loss=0.2794, over 4279918.15 frames. ], batch size: 702, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 17:43:12,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=11100.0, ans=0.02041666666666667 2023-06-17 17:44:05,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=11280.0, ans=0.125 2023-06-17 17:44:29,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=11340.0, ans=0.008404347826086957 2023-06-17 17:44:35,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=11340.0, ans=0.125 2023-06-17 17:44:39,750 INFO [train.py:996] (0/4) Epoch 1, batch 1900, loss[loss=0.4302, simple_loss=0.4217, pruned_loss=0.2186, over 21669.00 frames. ], tot_loss[loss=0.5212, simple_loss=0.4956, pruned_loss=0.2754, over 4287399.31 frames. ], batch size: 298, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 17:44:45,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=11400.0, ans=0.01916666666666667 2023-06-17 17:44:47,641 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.687e+02 6.940e+02 1.118e+03 3.518e+03, threshold=1.388e+03, percent-clipped=15.0 2023-06-17 17:45:12,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=11460.0, ans=0.125 2023-06-17 17:45:15,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=11520.0, ans=0.125 2023-06-17 17:45:26,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=11520.0, ans=0.0 2023-06-17 17:45:39,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=11580.0, ans=0.125 2023-06-17 17:46:15,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.52 vs. limit=10.82 2023-06-17 17:46:22,599 INFO [train.py:996] (0/4) Epoch 1, batch 1950, loss[loss=0.4661, simple_loss=0.4323, pruned_loss=0.2501, over 21793.00 frames. ], tot_loss[loss=0.513, simple_loss=0.4875, pruned_loss=0.2707, over 4277688.08 frames. ], batch size: 372, lr: 4.43e-02, grad_scale: 4.0 2023-06-17 17:46:23,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=11700.0, ans=0.125 2023-06-17 17:46:49,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=11760.0, ans=0.4884 2023-06-17 17:47:56,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=11940.0, ans=0.01691666666666667 2023-06-17 17:48:00,904 INFO [train.py:996] (0/4) Epoch 1, batch 2000, loss[loss=0.4682, simple_loss=0.4789, pruned_loss=0.2287, over 21762.00 frames. ], tot_loss[loss=0.5001, simple_loss=0.479, pruned_loss=0.2617, over 4275190.83 frames. ], batch size: 332, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:48:15,565 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.181e+02 5.337e+02 7.170e+02 1.145e+03 2.393e+03, threshold=1.434e+03, percent-clipped=15.0 2023-06-17 17:48:29,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=12060.0, ans=0.008247826086956522 2023-06-17 17:48:40,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.74 vs. limit=16.59 2023-06-17 17:48:41,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=12120.0, ans=0.125 2023-06-17 17:48:45,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=12.045 2023-06-17 17:49:33,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=12240.0, ans=0.17759999999999998 2023-06-17 17:49:37,877 INFO [train.py:996] (0/4) Epoch 1, batch 2050, loss[loss=0.4583, simple_loss=0.4515, pruned_loss=0.2326, over 21577.00 frames. ], tot_loss[loss=0.5027, simple_loss=0.4822, pruned_loss=0.2625, over 4276651.87 frames. ], batch size: 263, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:49:38,884 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.1125 2023-06-17 17:50:19,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12420.0, ans=0.17579999999999998 2023-06-17 17:50:52,733 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=12.18 2023-06-17 17:51:00,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=12540.0, ans=0.125 2023-06-17 17:51:15,794 INFO [train.py:996] (0/4) Epoch 1, batch 2100, loss[loss=0.3817, simple_loss=0.374, pruned_loss=0.1947, over 21414.00 frames. ], tot_loss[loss=0.5076, simple_loss=0.487, pruned_loss=0.2648, over 4282190.95 frames. ], batch size: 212, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:51:31,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.912e+02 5.111e+02 7.540e+02 1.226e+03 2.396e+03, threshold=1.508e+03, percent-clipped=15.0 2023-06-17 17:51:36,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=12660.0, ans=0.125 2023-06-17 17:52:00,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=12720.0, ans=0.125 2023-06-17 17:52:01,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=12720.0, ans=0.125 2023-06-17 17:53:00,929 INFO [train.py:996] (0/4) Epoch 1, batch 2150, loss[loss=0.4733, simple_loss=0.4486, pruned_loss=0.249, over 21348.00 frames. ], tot_loss[loss=0.5034, simple_loss=0.4831, pruned_loss=0.2624, over 4278524.12 frames. ], batch size: 131, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 17:53:19,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=12900.0, ans=0.125 2023-06-17 17:53:31,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=12960.0, ans=0.02 2023-06-17 17:53:34,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=12960.0, ans=0.07 2023-06-17 17:53:39,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13020.0, ans=0.16979999999999998 2023-06-17 17:53:46,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=13020.0, ans=0.125 2023-06-17 17:54:24,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=13080.0, ans=0.012166666666666673 2023-06-17 17:54:41,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.94 vs. limit=8.285 2023-06-17 17:54:44,032 INFO [train.py:996] (0/4) Epoch 1, batch 2200, loss[loss=0.4809, simple_loss=0.4834, pruned_loss=0.2392, over 21762.00 frames. ], tot_loss[loss=0.4993, simple_loss=0.4829, pruned_loss=0.2582, over 4278015.13 frames. ], batch size: 298, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 17:54:51,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=13200.0, ans=0.008 2023-06-17 17:54:59,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 5.225e+02 6.882e+02 1.154e+03 2.681e+03, threshold=1.376e+03, percent-clipped=19.0 2023-06-17 17:55:03,277 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:56:24,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=13440.0, ans=0.0 2023-06-17 17:56:34,503 INFO [train.py:996] (0/4) Epoch 1, batch 2250, loss[loss=0.5405, simple_loss=0.4875, pruned_loss=0.2968, over 21332.00 frames. ], tot_loss[loss=0.4866, simple_loss=0.4753, pruned_loss=0.2493, over 4275569.38 frames. ], batch size: 471, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 17:56:53,734 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.24 vs. limit=17.67 2023-06-17 17:57:20,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=13620.0, ans=12.6075 2023-06-17 17:57:36,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=13680.0, ans=0.0 2023-06-17 17:57:37,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=13680.0, ans=0.125 2023-06-17 17:58:19,055 INFO [train.py:996] (0/4) Epoch 1, batch 2300, loss[loss=0.4319, simple_loss=0.4178, pruned_loss=0.2231, over 21183.00 frames. ], tot_loss[loss=0.4794, simple_loss=0.4686, pruned_loss=0.2454, over 4267058.97 frames. ], batch size: 176, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 17:58:29,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.055e+02 5.278e+02 8.077e+02 1.161e+03 3.244e+03, threshold=1.615e+03, percent-clipped=15.0 2023-06-17 17:58:49,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=13860.0, ans=0.125 2023-06-17 17:59:03,162 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:59:43,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=14040.0, ans=0.007817391304347826 2023-06-17 17:59:52,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=14040.0, ans=0.125 2023-06-17 18:00:03,819 INFO [train.py:996] (0/4) Epoch 1, batch 2350, loss[loss=0.5538, simple_loss=0.5814, pruned_loss=0.2632, over 20713.00 frames. ], tot_loss[loss=0.4748, simple_loss=0.4633, pruned_loss=0.2433, over 4260087.29 frames. ], batch size: 607, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:00:05,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.17 vs. limit=12.05 2023-06-17 18:00:36,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=14220.0, ans=0.125 2023-06-17 18:01:15,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=14280.0, ans=10.0 2023-06-17 18:01:15,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=14280.0, ans=0.125 2023-06-17 18:01:27,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=14340.0, ans=0.125 2023-06-17 18:01:47,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=14400.0, ans=0.125 2023-06-17 18:01:49,084 INFO [train.py:996] (0/4) Epoch 1, batch 2400, loss[loss=0.5116, simple_loss=0.4862, pruned_loss=0.2685, over 21556.00 frames. ], tot_loss[loss=0.4833, simple_loss=0.4712, pruned_loss=0.2479, over 4263344.52 frames. ], batch size: 441, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:01:59,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 4.626e+02 8.072e+02 1.275e+03 2.674e+03, threshold=1.614e+03, percent-clipped=13.0 2023-06-17 18:02:54,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=14580.0, ans=0.0077 2023-06-17 18:02:56,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=14580.0, ans=0.125 2023-06-17 18:03:02,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=14580.0, ans=0.125 2023-06-17 18:03:33,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=14700.0, ans=0.125 2023-06-17 18:03:34,307 INFO [train.py:996] (0/4) Epoch 1, batch 2450, loss[loss=0.3914, simple_loss=0.3928, pruned_loss=0.1951, over 21398.00 frames. ], tot_loss[loss=0.4905, simple_loss=0.4769, pruned_loss=0.2522, over 4268230.77 frames. ], batch size: 212, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:03:59,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=14760.0, ans=0.3834000000000001 2023-06-17 18:04:03,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=14760.0, ans=0.125 2023-06-17 18:04:52,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=13.08 2023-06-17 18:05:06,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14940.0, ans=0.1506 2023-06-17 18:05:16,582 INFO [train.py:996] (0/4) Epoch 1, batch 2500, loss[loss=0.4846, simple_loss=0.4954, pruned_loss=0.2369, over 21664.00 frames. ], tot_loss[loss=0.4832, simple_loss=0.4722, pruned_loss=0.2472, over 4267041.04 frames. ], batch size: 332, lr: 4.38e-02, grad_scale: 8.0 2023-06-17 18:05:28,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.577e+02 4.881e+02 6.609e+02 9.679e+02 1.963e+03, threshold=1.322e+03, percent-clipped=4.0 2023-06-17 18:05:29,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.26 vs. limit=5.25 2023-06-17 18:06:08,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=15120.0, ans=0.0 2023-06-17 18:06:08,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=15120.0, ans=0.125 2023-06-17 18:06:41,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.85 vs. limit=5.286 2023-06-17 18:06:59,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.07 vs. limit=18.975 2023-06-17 18:07:00,748 INFO [train.py:996] (0/4) Epoch 1, batch 2550, loss[loss=0.4246, simple_loss=0.4344, pruned_loss=0.2074, over 21843.00 frames. ], tot_loss[loss=0.4797, simple_loss=0.4704, pruned_loss=0.2446, over 4258399.21 frames. ], batch size: 98, lr: 4.38e-02, grad_scale: 8.0 2023-06-17 18:07:26,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=10.144 2023-06-17 18:08:17,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=15540.0, ans=0.125 2023-06-17 18:08:37,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=15600.0, ans=0.007478260869565217 2023-06-17 18:08:38,596 INFO [train.py:996] (0/4) Epoch 1, batch 2600, loss[loss=0.5344, simple_loss=0.5115, pruned_loss=0.2786, over 21598.00 frames. ], tot_loss[loss=0.4816, simple_loss=0.4711, pruned_loss=0.2462, over 4268536.66 frames. ], batch size: 415, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:08:47,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=13.35 2023-06-17 18:08:50,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.934e+02 4.629e+02 7.030e+02 1.078e+03 2.784e+03, threshold=1.406e+03, percent-clipped=16.0 2023-06-17 18:09:41,741 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=2.541e-03 2023-06-17 18:10:17,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=7.168 2023-06-17 18:10:24,606 INFO [train.py:996] (0/4) Epoch 1, batch 2650, loss[loss=0.4091, simple_loss=0.4297, pruned_loss=0.1943, over 17423.00 frames. ], tot_loss[loss=0.4818, simple_loss=0.4716, pruned_loss=0.2461, over 4268944.09 frames. ], batch size: 60, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:11:42,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=16080.0, ans=0.125 2023-06-17 18:11:56,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=16140.0, ans=0.0 2023-06-17 18:11:59,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=16140.0, ans=0.007360869565217391 2023-06-17 18:12:10,381 INFO [train.py:996] (0/4) Epoch 1, batch 2700, loss[loss=0.5781, simple_loss=0.5688, pruned_loss=0.2936, over 21355.00 frames. ], tot_loss[loss=0.4715, simple_loss=0.4653, pruned_loss=0.2389, over 4261677.80 frames. ], batch size: 548, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:12:11,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.94 vs. limit=13.575 2023-06-17 18:12:17,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=16200.0, ans=0.125 2023-06-17 18:12:19,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=16200.0, ans=0.125 2023-06-17 18:12:21,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.712e+02 4.286e+02 6.579e+02 1.091e+03 3.152e+03, threshold=1.316e+03, percent-clipped=14.0 2023-06-17 18:12:33,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=16260.0, ans=0.0 2023-06-17 18:13:22,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=16380.0, ans=0.0 2023-06-17 18:13:54,413 INFO [train.py:996] (0/4) Epoch 1, batch 2750, loss[loss=0.4646, simple_loss=0.4934, pruned_loss=0.2179, over 21821.00 frames. ], tot_loss[loss=0.466, simple_loss=0.4619, pruned_loss=0.235, over 4262594.23 frames. ], batch size: 298, lr: 4.36e-02, grad_scale: 4.0 2023-06-17 18:14:12,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=13.71 2023-06-17 18:14:50,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=16620.0, ans=0.125 2023-06-17 18:15:15,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16680.0, ans=0.1332 2023-06-17 18:15:35,675 INFO [train.py:996] (0/4) Epoch 1, batch 2800, loss[loss=0.5277, simple_loss=0.5183, pruned_loss=0.2686, over 21766.00 frames. ], tot_loss[loss=0.4695, simple_loss=0.466, pruned_loss=0.2365, over 4270293.23 frames. ], batch size: 332, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:15:59,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.894e+02 6.832e+02 1.003e+03 4.773e+03, threshold=1.366e+03, percent-clipped=15.0 2023-06-17 18:16:17,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=13.8225 2023-06-17 18:16:30,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=16920.0, ans=0.007191304347826087 2023-06-17 18:16:59,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=16980.0, ans=0.007178260869565217 2023-06-17 18:17:20,327 INFO [train.py:996] (0/4) Epoch 1, batch 2850, loss[loss=0.3073, simple_loss=0.3134, pruned_loss=0.1506, over 21136.00 frames. ], tot_loss[loss=0.4663, simple_loss=0.4638, pruned_loss=0.2344, over 4270717.92 frames. ], batch size: 143, lr: 4.35e-02, grad_scale: 8.0 2023-06-17 18:18:09,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=17220.0, ans=0.125 2023-06-17 18:18:30,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=17280.0, ans=0.125 2023-06-17 18:19:03,969 INFO [train.py:996] (0/4) Epoch 1, batch 2900, loss[loss=0.4937, simple_loss=0.4786, pruned_loss=0.2544, over 21898.00 frames. ], tot_loss[loss=0.4597, simple_loss=0.4584, pruned_loss=0.2305, over 4267611.50 frames. ], batch size: 371, lr: 4.35e-02, grad_scale: 8.0 2023-06-17 18:19:28,666 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 4.517e+02 6.306e+02 8.812e+02 1.788e+03, threshold=1.261e+03, percent-clipped=6.0 2023-06-17 18:19:46,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=17460.0, ans=0.0 2023-06-17 18:20:21,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=17580.0, ans=0.0 2023-06-17 18:20:36,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17640.0, ans=0.12360000000000002 2023-06-17 18:20:54,504 INFO [train.py:996] (0/4) Epoch 1, batch 2950, loss[loss=0.4076, simple_loss=0.4067, pruned_loss=0.2043, over 21630.00 frames. ], tot_loss[loss=0.4618, simple_loss=0.4611, pruned_loss=0.2312, over 4272449.67 frames. ], batch size: 263, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:20:55,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=14.1375 2023-06-17 18:21:20,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=14.16 2023-06-17 18:21:54,607 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:22:17,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=17940.0, ans=0.125 2023-06-17 18:22:44,547 INFO [train.py:996] (0/4) Epoch 1, batch 3000, loss[loss=0.5766, simple_loss=0.5444, pruned_loss=0.3043, over 21376.00 frames. ], tot_loss[loss=0.4626, simple_loss=0.4641, pruned_loss=0.2305, over 4277100.82 frames. ], batch size: 508, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:22:44,547 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-17 18:23:01,483 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3658, simple_loss=0.4363, pruned_loss=0.1476, over 1796401.00 frames. 2023-06-17 18:23:01,484 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-17 18:23:20,434 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 5.025e+02 6.573e+02 9.808e+02 2.550e+03, threshold=1.315e+03, percent-clipped=11.0 2023-06-17 18:23:37,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18060.0, ans=0.1194 2023-06-17 18:23:44,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=18120.0, ans=0.0 2023-06-17 18:23:52,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=18120.0, ans=0.125 2023-06-17 18:24:08,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=18180.0, ans=0.26370000000000005 2023-06-17 18:24:14,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.81 vs. limit=21.134999999999998 2023-06-17 18:24:45,857 INFO [train.py:996] (0/4) Epoch 1, batch 3050, loss[loss=0.2512, simple_loss=0.2781, pruned_loss=0.1122, over 16457.00 frames. ], tot_loss[loss=0.46, simple_loss=0.4646, pruned_loss=0.2277, over 4270901.23 frames. ], batch size: 60, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:25:34,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=18420.0, ans=0.125 2023-06-17 18:26:35,516 INFO [train.py:996] (0/4) Epoch 1, batch 3100, loss[loss=0.3823, simple_loss=0.4248, pruned_loss=0.1699, over 21831.00 frames. ], tot_loss[loss=0.4542, simple_loss=0.4612, pruned_loss=0.2236, over 4279274.29 frames. ], batch size: 282, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:26:53,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.845e+02 4.821e+02 6.301e+02 1.043e+03 2.318e+03, threshold=1.260e+03, percent-clipped=14.0 2023-06-17 18:27:17,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=18720.0, ans=0.125 2023-06-17 18:27:20,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18720.0, ans=0.11280000000000001 2023-06-17 18:27:20,769 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:28:20,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=14.5875 2023-06-17 18:28:20,387 INFO [train.py:996] (0/4) Epoch 1, batch 3150, loss[loss=0.3882, simple_loss=0.3643, pruned_loss=0.206, over 19990.00 frames. ], tot_loss[loss=0.457, simple_loss=0.4632, pruned_loss=0.2254, over 4275603.91 frames. ], batch size: 704, lr: 4.32e-02, grad_scale: 8.0 2023-06-17 18:28:21,514 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.94 vs. limit=21.675 2023-06-17 18:29:19,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19020.0, ans=0.10980000000000001 2023-06-17 18:30:03,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=19140.0, ans=0.0 2023-06-17 18:30:07,514 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.01 vs. limit=14.57 2023-06-17 18:30:11,532 INFO [train.py:996] (0/4) Epoch 1, batch 3200, loss[loss=0.4486, simple_loss=0.4802, pruned_loss=0.2086, over 21621.00 frames. ], tot_loss[loss=0.4557, simple_loss=0.4632, pruned_loss=0.2241, over 4273251.69 frames. ], batch size: 263, lr: 4.32e-02, grad_scale: 16.0 2023-06-17 18:30:13,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=19200.0, ans=0.10800000000000001 2023-06-17 18:30:24,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 4.999e+02 6.065e+02 1.040e+03 2.031e+03, threshold=1.213e+03, percent-clipped=14.0 2023-06-17 18:30:29,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=11.704 2023-06-17 18:30:43,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19260.0, ans=0.10740000000000002 2023-06-17 18:31:32,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=19440.0, ans=0.0 2023-06-17 18:31:55,108 INFO [train.py:996] (0/4) Epoch 1, batch 3250, loss[loss=0.4306, simple_loss=0.4281, pruned_loss=0.2166, over 19905.00 frames. ], tot_loss[loss=0.4598, simple_loss=0.4649, pruned_loss=0.2274, over 4271301.82 frames. ], batch size: 702, lr: 4.31e-02, grad_scale: 8.0 2023-06-17 18:32:15,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=19560.0, ans=0.125 2023-06-17 18:33:19,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=19680.0, ans=0.125 2023-06-17 18:33:24,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=19740.0, ans=0.125 2023-06-17 18:33:24,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=14.9025 2023-06-17 18:33:32,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=19740.0, ans=0.10260000000000002 2023-06-17 18:33:37,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=19740.0, ans=14.87 2023-06-17 18:33:40,028 INFO [train.py:996] (0/4) Epoch 1, batch 3300, loss[loss=0.4582, simple_loss=0.4704, pruned_loss=0.2229, over 21270.00 frames. ], tot_loss[loss=0.453, simple_loss=0.4583, pruned_loss=0.2239, over 4270677.53 frames. ], batch size: 548, lr: 4.31e-02, grad_scale: 8.0 2023-06-17 18:33:45,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=19800.0, ans=0.948 2023-06-17 18:34:06,168 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 4.541e+02 6.764e+02 1.015e+03 2.529e+03, threshold=1.353e+03, percent-clipped=14.0 2023-06-17 18:34:17,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=19860.0, ans=0.0 2023-06-17 18:34:47,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=19980.0, ans=0.0 2023-06-17 18:34:49,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=19980.0, ans=0.125 2023-06-17 18:34:56,628 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=5.148e-03 2023-06-17 18:35:24,219 INFO [train.py:996] (0/4) Epoch 1, batch 3350, loss[loss=0.4546, simple_loss=0.4605, pruned_loss=0.2244, over 21485.00 frames. ], tot_loss[loss=0.4518, simple_loss=0.4583, pruned_loss=0.2226, over 4268182.82 frames. ], batch size: 194, lr: 4.30e-02, grad_scale: 8.0 2023-06-17 18:35:29,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=20100.0, ans=0.0 2023-06-17 18:36:12,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=20220.0, ans=0.0064739130434782605 2023-06-17 18:36:16,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=20220.0, ans=0.125 2023-06-17 18:36:20,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=20220.0, ans=0.0 2023-06-17 18:37:06,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=20400.0, ans=0.125 2023-06-17 18:37:12,841 INFO [train.py:996] (0/4) Epoch 1, batch 3400, loss[loss=0.4287, simple_loss=0.424, pruned_loss=0.2167, over 20137.00 frames. ], tot_loss[loss=0.449, simple_loss=0.4562, pruned_loss=0.2209, over 4275650.71 frames. ], batch size: 702, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 18:37:27,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20400.0, ans=0.1 2023-06-17 18:37:34,559 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.423e+02 4.338e+02 6.007e+02 8.675e+02 3.027e+03, threshold=1.201e+03, percent-clipped=6.0 2023-06-17 18:37:52,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-17 18:37:56,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=20520.0, ans=0.125 2023-06-17 18:38:21,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-17 18:39:03,062 INFO [train.py:996] (0/4) Epoch 1, batch 3450, loss[loss=0.4026, simple_loss=0.412, pruned_loss=0.1966, over 21807.00 frames. ], tot_loss[loss=0.4425, simple_loss=0.4489, pruned_loss=0.218, over 4279848.66 frames. ], batch size: 107, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 18:39:03,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=20700.0, ans=0.0 2023-06-17 18:39:27,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=20760.0, ans=0.125 2023-06-17 18:39:30,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=20760.0, ans=0.0 2023-06-17 18:39:34,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=20760.0, ans=0.125 2023-06-17 18:40:43,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=20940.0, ans=0.006317391304347827 2023-06-17 18:40:47,104 INFO [train.py:996] (0/4) Epoch 1, batch 3500, loss[loss=0.5778, simple_loss=0.5473, pruned_loss=0.3041, over 21930.00 frames. ], tot_loss[loss=0.4541, simple_loss=0.4603, pruned_loss=0.224, over 4266957.01 frames. ], batch size: 372, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 18:40:47,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=21000.0, ans=0.125 2023-06-17 18:41:09,665 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 4.958e+02 6.770e+02 9.160e+02 2.307e+03, threshold=1.354e+03, percent-clipped=16.0 2023-06-17 18:41:13,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-17 18:41:15,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.59 vs. limit=22.5 2023-06-17 18:41:28,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=21060.0, ans=0.0 2023-06-17 18:42:11,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21180.0, ans=0.1 2023-06-17 18:42:12,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=21180.0, ans=0.0 2023-06-17 18:42:32,177 INFO [train.py:996] (0/4) Epoch 1, batch 3550, loss[loss=0.413, simple_loss=0.4008, pruned_loss=0.2126, over 20190.00 frames. ], tot_loss[loss=0.4557, simple_loss=0.4622, pruned_loss=0.2245, over 4258790.37 frames. ], batch size: 703, lr: 4.28e-02, grad_scale: 4.0 2023-06-17 18:42:46,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=21300.0, ans=0.2 2023-06-17 18:43:12,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=21360.0, ans=0.0 2023-06-17 18:43:35,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=21480.0, ans=0.125 2023-06-17 18:43:56,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=21480.0, ans=0.125 2023-06-17 18:44:05,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=21540.0, ans=0.125 2023-06-17 18:44:21,645 INFO [train.py:996] (0/4) Epoch 1, batch 3600, loss[loss=0.3704, simple_loss=0.4107, pruned_loss=0.1651, over 16452.00 frames. ], tot_loss[loss=0.4513, simple_loss=0.4567, pruned_loss=0.223, over 4256710.02 frames. ], batch size: 63, lr: 4.27e-02, grad_scale: 8.0 2023-06-17 18:44:33,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=21600.0, ans=0.0 2023-06-17 18:44:37,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=21660.0, ans=0.125 2023-06-17 18:44:39,446 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 4.436e+02 5.716e+02 8.040e+02 1.927e+03, threshold=1.143e+03, percent-clipped=4.0 2023-06-17 18:44:50,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=21660.0, ans=0.0 2023-06-17 18:44:53,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=21660.0, ans=0.125 2023-06-17 18:45:01,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-17 18:46:05,212 INFO [train.py:996] (0/4) Epoch 1, batch 3650, loss[loss=0.5756, simple_loss=0.5703, pruned_loss=0.2905, over 21559.00 frames. ], tot_loss[loss=0.4534, simple_loss=0.4595, pruned_loss=0.2236, over 4264559.45 frames. ], batch size: 263, lr: 4.27e-02, grad_scale: 8.0 2023-06-17 18:46:08,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=21900.0, ans=0.0 2023-06-17 18:46:18,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=21900.0, ans=0.5 2023-06-17 18:46:50,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-17 18:47:48,613 INFO [train.py:996] (0/4) Epoch 1, batch 3700, loss[loss=0.4345, simple_loss=0.4486, pruned_loss=0.2102, over 21902.00 frames. ], tot_loss[loss=0.4504, simple_loss=0.4579, pruned_loss=0.2215, over 4270526.92 frames. ], batch size: 351, lr: 4.26e-02, grad_scale: 8.0 2023-06-17 18:47:52,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-17 18:48:06,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.875e+02 4.996e+02 7.328e+02 1.013e+03 2.628e+03, threshold=1.466e+03, percent-clipped=16.0 2023-06-17 18:48:07,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=22260.0, ans=0.125 2023-06-17 18:48:31,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=22320.0, ans=0.0 2023-06-17 18:48:32,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.06 vs. limit=22.5 2023-06-17 18:48:33,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.09 vs. limit=22.5 2023-06-17 18:48:42,220 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-17 18:49:28,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-17 18:49:32,169 INFO [train.py:996] (0/4) Epoch 1, batch 3750, loss[loss=0.4913, simple_loss=0.5317, pruned_loss=0.2254, over 20904.00 frames. ], tot_loss[loss=0.4435, simple_loss=0.452, pruned_loss=0.2175, over 4273662.95 frames. ], batch size: 608, lr: 4.26e-02, grad_scale: 8.0 2023-06-17 18:49:32,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=22500.0, ans=0.125 2023-06-17 18:50:12,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=22620.0, ans=0.125 2023-06-17 18:50:20,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=22620.0, ans=0.025 2023-06-17 18:51:11,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=22740.0, ans=0.07 2023-06-17 18:51:16,285 INFO [train.py:996] (0/4) Epoch 1, batch 3800, loss[loss=0.4547, simple_loss=0.4662, pruned_loss=0.2216, over 21309.00 frames. ], tot_loss[loss=0.441, simple_loss=0.4509, pruned_loss=0.2155, over 4275083.98 frames. ], batch size: 143, lr: 4.25e-02, grad_scale: 8.0 2023-06-17 18:51:18,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=22800.0, ans=0.00591304347826087 2023-06-17 18:51:36,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=22860.0, ans=0.125 2023-06-17 18:51:39,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.895e+02 5.418e+02 7.571e+02 2.562e+03, threshold=1.084e+03, percent-clipped=5.0 2023-06-17 18:51:50,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=22860.0, ans=0.2 2023-06-17 18:52:13,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=22920.0, ans=0.2 2023-06-17 18:52:34,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.71 vs. limit=6.0 2023-06-17 18:52:35,682 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:52:58,991 INFO [train.py:996] (0/4) Epoch 1, batch 3850, loss[loss=0.362, simple_loss=0.3756, pruned_loss=0.1742, over 21584.00 frames. ], tot_loss[loss=0.4376, simple_loss=0.4462, pruned_loss=0.2145, over 4283068.81 frames. ], batch size: 298, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 18:53:29,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=23160.0, ans=0.1 2023-06-17 18:53:29,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-17 18:53:36,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=23220.0, ans=0.035 2023-06-17 18:54:26,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=23340.0, ans=0.125 2023-06-17 18:54:38,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-17 18:54:40,132 INFO [train.py:996] (0/4) Epoch 1, batch 3900, loss[loss=0.446, simple_loss=0.4449, pruned_loss=0.2235, over 21862.00 frames. ], tot_loss[loss=0.431, simple_loss=0.4395, pruned_loss=0.2113, over 4279066.32 frames. ], batch size: 414, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 18:54:46,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=23400.0, ans=0.005782608695652174 2023-06-17 18:54:59,272 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.423e+02 4.722e+02 6.490e+02 9.055e+02 2.329e+03, threshold=1.298e+03, percent-clipped=15.0 2023-06-17 18:55:44,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=23580.0, ans=0.125 2023-06-17 18:56:21,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.04 vs. limit=10.0 2023-06-17 18:56:25,079 INFO [train.py:996] (0/4) Epoch 1, batch 3950, loss[loss=0.3661, simple_loss=0.4091, pruned_loss=0.1615, over 21807.00 frames. ], tot_loss[loss=0.4278, simple_loss=0.4386, pruned_loss=0.2085, over 4274446.44 frames. ], batch size: 371, lr: 4.23e-02, grad_scale: 8.0 2023-06-17 18:56:50,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=23760.0, ans=0.0 2023-06-17 18:57:36,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=23880.0, ans=0.125 2023-06-17 18:58:06,602 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-4000.pt 2023-06-17 18:58:09,770 INFO [train.py:996] (0/4) Epoch 1, batch 4000, loss[loss=0.4129, simple_loss=0.4038, pruned_loss=0.211, over 21452.00 frames. ], tot_loss[loss=0.4199, simple_loss=0.4326, pruned_loss=0.2036, over 4274794.01 frames. ], batch size: 441, lr: 4.23e-02, grad_scale: 16.0 2023-06-17 18:58:33,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.808e+02 4.109e+02 5.052e+02 7.332e+02 1.857e+03, threshold=1.010e+03, percent-clipped=6.0 2023-06-17 18:58:57,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=24120.0, ans=0.125 2023-06-17 18:59:22,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=24180.0, ans=0.125 2023-06-17 18:59:29,744 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-17 18:59:51,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=24300.0, ans=0.125 2023-06-17 18:59:52,455 INFO [train.py:996] (0/4) Epoch 1, batch 4050, loss[loss=0.302, simple_loss=0.367, pruned_loss=0.1185, over 21690.00 frames. ], tot_loss[loss=0.4157, simple_loss=0.4318, pruned_loss=0.1998, over 4278919.15 frames. ], batch size: 247, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 18:59:53,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-17 19:00:17,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=24360.0, ans=0.125 2023-06-17 19:00:54,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=24420.0, ans=0.125 2023-06-17 19:01:21,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=24540.0, ans=0.005534782608695652 2023-06-17 19:01:35,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-17 19:01:35,950 INFO [train.py:996] (0/4) Epoch 1, batch 4100, loss[loss=0.3982, simple_loss=0.411, pruned_loss=0.1927, over 21901.00 frames. ], tot_loss[loss=0.4185, simple_loss=0.4331, pruned_loss=0.202, over 4284162.08 frames. ], batch size: 332, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 19:02:01,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.803e+02 4.142e+02 6.350e+02 1.020e+03 2.376e+03, threshold=1.270e+03, percent-clipped=25.0 2023-06-17 19:02:37,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=24.87 vs. limit=15.0 2023-06-17 19:03:08,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24840.0, ans=0.1 2023-06-17 19:03:19,165 INFO [train.py:996] (0/4) Epoch 1, batch 4150, loss[loss=0.346, simple_loss=0.3838, pruned_loss=0.1541, over 21337.00 frames. ], tot_loss[loss=0.4145, simple_loss=0.434, pruned_loss=0.1975, over 4274323.66 frames. ], batch size: 131, lr: 4.21e-02, grad_scale: 8.0 2023-06-17 19:04:24,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=25020.0, ans=0.005430434782608695 2023-06-17 19:04:27,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.44 vs. limit=22.5 2023-06-17 19:04:34,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=25080.0, ans=0.125 2023-06-17 19:04:39,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25080.0, ans=0.1 2023-06-17 19:05:10,038 INFO [train.py:996] (0/4) Epoch 1, batch 4200, loss[loss=0.5131, simple_loss=0.4993, pruned_loss=0.2635, over 21385.00 frames. ], tot_loss[loss=0.4141, simple_loss=0.4342, pruned_loss=0.197, over 4271697.67 frames. ], batch size: 548, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:05:46,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 4.243e+02 5.422e+02 7.726e+02 1.559e+03, threshold=1.084e+03, percent-clipped=3.0 2023-06-17 19:06:19,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-17 19:06:39,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=25440.0, ans=0.2 2023-06-17 19:06:39,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=25440.0, ans=0.125 2023-06-17 19:06:41,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=25440.0, ans=0.025 2023-06-17 19:06:42,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25440.0, ans=0.1 2023-06-17 19:07:07,390 INFO [train.py:996] (0/4) Epoch 1, batch 4250, loss[loss=0.5895, simple_loss=0.5674, pruned_loss=0.3058, over 21342.00 frames. ], tot_loss[loss=0.4232, simple_loss=0.4422, pruned_loss=0.2021, over 4267681.76 frames. ], batch size: 507, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:07:10,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=25500.0, ans=0.125 2023-06-17 19:07:26,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=25500.0, ans=0.125 2023-06-17 19:07:34,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=25560.0, ans=0.125 2023-06-17 19:08:02,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.73 vs. limit=10.0 2023-06-17 19:08:03,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=25680.0, ans=0.125 2023-06-17 19:08:58,892 INFO [train.py:996] (0/4) Epoch 1, batch 4300, loss[loss=0.4652, simple_loss=0.5043, pruned_loss=0.213, over 21527.00 frames. ], tot_loss[loss=0.4302, simple_loss=0.4502, pruned_loss=0.2051, over 4261736.47 frames. ], batch size: 471, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:09:10,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=25800.0, ans=22.5 2023-06-17 19:09:18,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.691e+02 4.313e+02 6.396e+02 8.892e+02 2.391e+03, threshold=1.279e+03, percent-clipped=16.0 2023-06-17 19:09:30,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=25920.0, ans=0.2 2023-06-17 19:10:35,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=26040.0, ans=0.0 2023-06-17 19:10:42,370 INFO [train.py:996] (0/4) Epoch 1, batch 4350, loss[loss=0.409, simple_loss=0.415, pruned_loss=0.2015, over 21458.00 frames. ], tot_loss[loss=0.4266, simple_loss=0.4464, pruned_loss=0.2035, over 4254615.20 frames. ], batch size: 389, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:11:07,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=26160.0, ans=0.125 2023-06-17 19:11:40,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.28 vs. limit=22.5 2023-06-17 19:12:17,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=26340.0, ans=0.0 2023-06-17 19:12:27,488 INFO [train.py:996] (0/4) Epoch 1, batch 4400, loss[loss=0.4699, simple_loss=0.4711, pruned_loss=0.2343, over 19959.00 frames. ], tot_loss[loss=0.4208, simple_loss=0.4399, pruned_loss=0.2009, over 4251241.09 frames. ], batch size: 702, lr: 4.18e-02, grad_scale: 16.0 2023-06-17 19:12:29,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=26400.0, ans=0.125 2023-06-17 19:12:40,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-17 19:12:48,108 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.456e+02 3.803e+02 5.319e+02 7.173e+02 2.856e+03, threshold=1.064e+03, percent-clipped=8.0 2023-06-17 19:13:10,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=26520.0, ans=0.1 2023-06-17 19:14:01,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=26640.0, ans=0.125 2023-06-17 19:14:12,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=26700.0, ans=0.125 2023-06-17 19:14:13,435 INFO [train.py:996] (0/4) Epoch 1, batch 4450, loss[loss=0.4909, simple_loss=0.5094, pruned_loss=0.2362, over 21775.00 frames. ], tot_loss[loss=0.4239, simple_loss=0.4462, pruned_loss=0.2008, over 4255154.49 frames. ], batch size: 414, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 19:14:48,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=26760.0, ans=0.125 2023-06-17 19:15:07,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=26820.0, ans=0.0 2023-06-17 19:15:51,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=27000.0, ans=0.2 2023-06-17 19:15:52,007 INFO [train.py:996] (0/4) Epoch 1, batch 4500, loss[loss=0.4768, simple_loss=0.4977, pruned_loss=0.228, over 21630.00 frames. ], tot_loss[loss=0.4291, simple_loss=0.4486, pruned_loss=0.2048, over 4261652.54 frames. ], batch size: 471, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 19:16:13,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=27060.0, ans=0.004986956521739131 2023-06-17 19:16:19,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.691e+02 6.117e+02 8.779e+02 1.856e+03, threshold=1.223e+03, percent-clipped=14.0 2023-06-17 19:16:21,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=27060.0, ans=10.0 2023-06-17 19:16:59,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=27120.0, ans=0.125 2023-06-17 19:17:20,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=27240.0, ans=0.035 2023-06-17 19:17:28,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=27240.0, ans=0.0 2023-06-17 19:17:35,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.56 vs. limit=22.5 2023-06-17 19:17:36,030 INFO [train.py:996] (0/4) Epoch 1, batch 4550, loss[loss=0.5059, simple_loss=0.5074, pruned_loss=0.2523, over 21603.00 frames. ], tot_loss[loss=0.4315, simple_loss=0.4523, pruned_loss=0.2054, over 4264610.22 frames. ], batch size: 389, lr: 4.16e-02, grad_scale: 8.0 2023-06-17 19:18:21,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=27420.0, ans=0.1 2023-06-17 19:18:22,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-17 19:18:51,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=27480.0, ans=0.2 2023-06-17 19:18:59,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=27540.0, ans=0.2 2023-06-17 19:19:03,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=27540.0, ans=0.0 2023-06-17 19:19:19,254 INFO [train.py:996] (0/4) Epoch 1, batch 4600, loss[loss=0.3989, simple_loss=0.4172, pruned_loss=0.1903, over 21394.00 frames. ], tot_loss[loss=0.4321, simple_loss=0.4531, pruned_loss=0.2056, over 4267804.05 frames. ], batch size: 211, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 19:19:46,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.894e+02 4.493e+02 6.587e+02 9.549e+02 1.987e+03, threshold=1.317e+03, percent-clipped=15.0 2023-06-17 19:19:53,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=27660.0, ans=0.125 2023-06-17 19:19:55,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=27660.0, ans=0.0 2023-06-17 19:20:12,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-17 19:20:17,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=27720.0, ans=0.125 2023-06-17 19:20:29,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-17 19:20:56,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=27840.0, ans=0.125 2023-06-17 19:21:01,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27900.0, ans=0.1 2023-06-17 19:21:02,849 INFO [train.py:996] (0/4) Epoch 1, batch 4650, loss[loss=0.3244, simple_loss=0.3557, pruned_loss=0.1466, over 21766.00 frames. ], tot_loss[loss=0.42, simple_loss=0.4413, pruned_loss=0.1993, over 4276019.98 frames. ], batch size: 247, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 19:21:13,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=27900.0, ans=0.004804347826086957 2023-06-17 19:21:55,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=28020.0, ans=0.2 2023-06-17 19:22:06,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=28080.0, ans=0.125 2023-06-17 19:22:17,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=28140.0, ans=0.2 2023-06-17 19:22:31,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=28140.0, ans=0.125 2023-06-17 19:22:40,799 INFO [train.py:996] (0/4) Epoch 1, batch 4700, loss[loss=0.3652, simple_loss=0.3841, pruned_loss=0.1731, over 21741.00 frames. ], tot_loss[loss=0.4106, simple_loss=0.4304, pruned_loss=0.1954, over 4274033.40 frames. ], batch size: 112, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 19:22:58,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=28200.0, ans=0.2 2023-06-17 19:23:08,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=28260.0, ans=0.125 2023-06-17 19:23:12,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.969e+02 4.560e+02 5.738e+02 6.731e+02 1.328e+03, threshold=1.148e+03, percent-clipped=1.0 2023-06-17 19:23:25,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=28260.0, ans=0.0 2023-06-17 19:23:30,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=12.0 2023-06-17 19:23:46,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=28380.0, ans=0.125 2023-06-17 19:23:58,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=28380.0, ans=0.125 2023-06-17 19:24:17,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.96 vs. limit=22.5 2023-06-17 19:24:22,967 INFO [train.py:996] (0/4) Epoch 1, batch 4750, loss[loss=0.4451, simple_loss=0.4694, pruned_loss=0.2103, over 20767.00 frames. ], tot_loss[loss=0.4078, simple_loss=0.4245, pruned_loss=0.1955, over 4277213.28 frames. ], batch size: 608, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 19:25:07,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-17 19:25:13,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=28620.0, ans=0.125 2023-06-17 19:25:23,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-17 19:25:51,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=28740.0, ans=0.125 2023-06-17 19:26:00,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-17 19:26:01,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=28800.0, ans=0.05 2023-06-17 19:26:08,440 INFO [train.py:996] (0/4) Epoch 1, batch 4800, loss[loss=0.3775, simple_loss=0.4061, pruned_loss=0.1745, over 21374.00 frames. ], tot_loss[loss=0.4094, simple_loss=0.4258, pruned_loss=0.1965, over 4280047.95 frames. ], batch size: 131, lr: 4.13e-02, grad_scale: 16.0 2023-06-17 19:26:21,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=28800.0, ans=0.004608695652173913 2023-06-17 19:26:40,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.777e+02 4.396e+02 5.630e+02 9.544e+02 1.768e+03, threshold=1.126e+03, percent-clipped=14.0 2023-06-17 19:27:01,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-17 19:27:19,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=28980.0, ans=0.125 2023-06-17 19:27:31,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=29040.0, ans=0.125 2023-06-17 19:27:44,668 INFO [train.py:996] (0/4) Epoch 1, batch 4850, loss[loss=0.4612, simple_loss=0.5054, pruned_loss=0.2084, over 19962.00 frames. ], tot_loss[loss=0.4101, simple_loss=0.4268, pruned_loss=0.1967, over 4279125.38 frames. ], batch size: 703, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 19:28:16,576 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:28:17,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=29160.0, ans=0.125 2023-06-17 19:28:19,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29160.0, ans=0.1 2023-06-17 19:28:38,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=29220.0, ans=0.125 2023-06-17 19:29:29,295 INFO [train.py:996] (0/4) Epoch 1, batch 4900, loss[loss=0.4331, simple_loss=0.4376, pruned_loss=0.2143, over 21602.00 frames. ], tot_loss[loss=0.4126, simple_loss=0.4291, pruned_loss=0.1981, over 4282637.68 frames. ], batch size: 548, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 19:29:56,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=29460.0, ans=0.07 2023-06-17 19:29:58,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-17 19:30:02,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-17 19:30:02,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 4.351e+02 5.424e+02 7.801e+02 1.566e+03, threshold=1.085e+03, percent-clipped=9.0 2023-06-17 19:30:52,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=29580.0, ans=10.0 2023-06-17 19:31:13,103 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-17 19:31:24,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29700.0, ans=0.1 2023-06-17 19:31:25,681 INFO [train.py:996] (0/4) Epoch 1, batch 4950, loss[loss=0.3583, simple_loss=0.422, pruned_loss=0.1473, over 21833.00 frames. ], tot_loss[loss=0.4106, simple_loss=0.4321, pruned_loss=0.1946, over 4284337.39 frames. ], batch size: 317, lr: 4.11e-02, grad_scale: 16.0 2023-06-17 19:31:51,493 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:31:56,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=29760.0, ans=0.2 2023-06-17 19:32:08,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=29820.0, ans=0.125 2023-06-17 19:32:14,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=29820.0, ans=0.0 2023-06-17 19:32:17,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=29880.0, ans=0.125 2023-06-17 19:33:07,743 INFO [train.py:996] (0/4) Epoch 1, batch 5000, loss[loss=0.4537, simple_loss=0.4476, pruned_loss=0.2299, over 21492.00 frames. ], tot_loss[loss=0.4045, simple_loss=0.431, pruned_loss=0.189, over 4279795.53 frames. ], batch size: 548, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 19:33:34,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 4.453e+02 5.189e+02 7.873e+02 1.529e+03, threshold=1.038e+03, percent-clipped=6.0 2023-06-17 19:33:34,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=30060.0, ans=0.125 2023-06-17 19:33:41,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30060.0, ans=0.1 2023-06-17 19:33:45,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=30120.0, ans=0.0 2023-06-17 19:33:58,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=30120.0, ans=0.05 2023-06-17 19:34:05,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.79 vs. limit=22.5 2023-06-17 19:34:07,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=30180.0, ans=0.125 2023-06-17 19:34:13,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=30180.0, ans=0.004308695652173913 2023-06-17 19:34:29,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=30240.0, ans=0.004295652173913044 2023-06-17 19:34:45,183 INFO [train.py:996] (0/4) Epoch 1, batch 5050, loss[loss=0.3496, simple_loss=0.3996, pruned_loss=0.1498, over 21638.00 frames. ], tot_loss[loss=0.4086, simple_loss=0.4329, pruned_loss=0.1922, over 4286570.36 frames. ], batch size: 263, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 19:34:46,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-17 19:34:59,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=30300.0, ans=0.125 2023-06-17 19:35:18,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=30360.0, ans=0.0 2023-06-17 19:35:58,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=30480.0, ans=0.125 2023-06-17 19:36:01,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=30480.0, ans=0.125 2023-06-17 19:36:27,880 INFO [train.py:996] (0/4) Epoch 1, batch 5100, loss[loss=0.3812, simple_loss=0.3992, pruned_loss=0.1816, over 21648.00 frames. ], tot_loss[loss=0.4088, simple_loss=0.4322, pruned_loss=0.1927, over 4289249.46 frames. ], batch size: 263, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 19:36:43,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=30600.0, ans=0.125 2023-06-17 19:36:59,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.656e+02 4.521e+02 5.607e+02 7.657e+02 1.284e+03, threshold=1.121e+03, percent-clipped=8.0 2023-06-17 19:38:06,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=30840.0, ans=0.05 2023-06-17 19:38:10,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=30900.0, ans=0.0 2023-06-17 19:38:11,341 INFO [train.py:996] (0/4) Epoch 1, batch 5150, loss[loss=0.4188, simple_loss=0.4413, pruned_loss=0.1982, over 21864.00 frames. ], tot_loss[loss=0.4066, simple_loss=0.4301, pruned_loss=0.1916, over 4285988.29 frames. ], batch size: 371, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 19:38:20,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=30900.0, ans=0.2 2023-06-17 19:38:45,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=30960.0, ans=0.125 2023-06-17 19:38:55,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=31020.0, ans=0.00412608695652174 2023-06-17 19:39:00,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=31020.0, ans=0.125 2023-06-17 19:40:01,190 INFO [train.py:996] (0/4) Epoch 1, batch 5200, loss[loss=0.4299, simple_loss=0.4791, pruned_loss=0.1903, over 21862.00 frames. ], tot_loss[loss=0.4117, simple_loss=0.4342, pruned_loss=0.1946, over 4290748.74 frames. ], batch size: 316, lr: 4.08e-02, grad_scale: 32.0 2023-06-17 19:40:17,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=31200.0, ans=0.125 2023-06-17 19:40:27,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.519e+02 4.450e+02 5.949e+02 9.427e+02 1.654e+03, threshold=1.190e+03, percent-clipped=14.0 2023-06-17 19:40:50,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=31320.0, ans=0.0 2023-06-17 19:41:05,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=31380.0, ans=0.0 2023-06-17 19:41:29,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.00 vs. limit=6.0 2023-06-17 19:41:44,219 INFO [train.py:996] (0/4) Epoch 1, batch 5250, loss[loss=0.3875, simple_loss=0.4114, pruned_loss=0.1818, over 21831.00 frames. ], tot_loss[loss=0.4066, simple_loss=0.4347, pruned_loss=0.1893, over 4288023.14 frames. ], batch size: 107, lr: 4.07e-02, grad_scale: 16.0 2023-06-17 19:42:15,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=31560.0, ans=0.125 2023-06-17 19:42:36,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=31620.0, ans=0.003995652173913044 2023-06-17 19:42:38,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=31620.0, ans=0.125 2023-06-17 19:42:46,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=31680.0, ans=0.125 2023-06-17 19:42:51,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=31680.0, ans=0.0 2023-06-17 19:42:53,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=31680.0, ans=0.125 2023-06-17 19:42:55,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.85 vs. limit=22.5 2023-06-17 19:43:21,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=31740.0, ans=0.07 2023-06-17 19:43:25,595 INFO [train.py:996] (0/4) Epoch 1, batch 5300, loss[loss=0.3961, simple_loss=0.4177, pruned_loss=0.1872, over 21857.00 frames. ], tot_loss[loss=0.4089, simple_loss=0.4355, pruned_loss=0.1911, over 4295057.41 frames. ], batch size: 351, lr: 4.07e-02, grad_scale: 16.0 2023-06-17 19:43:51,604 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-17 19:43:53,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 4.200e+02 5.076e+02 7.002e+02 1.420e+03, threshold=1.015e+03, percent-clipped=3.0 2023-06-17 19:44:20,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=31920.0, ans=0.5 2023-06-17 19:44:39,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=32040.0, ans=0.025 2023-06-17 19:45:07,650 INFO [train.py:996] (0/4) Epoch 1, batch 5350, loss[loss=0.4638, simple_loss=0.4488, pruned_loss=0.2395, over 21847.00 frames. ], tot_loss[loss=0.4115, simple_loss=0.4353, pruned_loss=0.1939, over 4294915.01 frames. ], batch size: 298, lr: 4.06e-02, grad_scale: 16.0 2023-06-17 19:45:21,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-17 19:45:29,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=32160.0, ans=0.0 2023-06-17 19:45:30,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=32160.0, ans=0.015 2023-06-17 19:45:32,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=32160.0, ans=0.0 2023-06-17 19:46:03,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=32220.0, ans=0.125 2023-06-17 19:46:54,988 INFO [train.py:996] (0/4) Epoch 1, batch 5400, loss[loss=0.5008, simple_loss=0.5486, pruned_loss=0.2265, over 21235.00 frames. ], tot_loss[loss=0.413, simple_loss=0.4349, pruned_loss=0.1956, over 4300780.66 frames. ], batch size: 548, lr: 4.05e-02, grad_scale: 16.0 2023-06-17 19:47:02,186 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:47:23,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 4.680e+02 5.760e+02 7.952e+02 1.690e+03, threshold=1.152e+03, percent-clipped=11.0 2023-06-17 19:47:26,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32460.0, ans=0.1 2023-06-17 19:48:19,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=32640.0, ans=0.0 2023-06-17 19:48:22,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=32640.0, ans=0.05 2023-06-17 19:48:22,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.81 vs. limit=6.0 2023-06-17 19:48:38,287 INFO [train.py:996] (0/4) Epoch 1, batch 5450, loss[loss=0.3941, simple_loss=0.4432, pruned_loss=0.1725, over 21374.00 frames. ], tot_loss[loss=0.4103, simple_loss=0.434, pruned_loss=0.1932, over 4303618.71 frames. ], batch size: 194, lr: 4.05e-02, grad_scale: 16.0 2023-06-17 19:48:44,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32700.0, ans=0.1 2023-06-17 19:48:44,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=32700.0, ans=0.125 2023-06-17 19:48:56,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.68 vs. limit=6.0 2023-06-17 19:49:02,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=32760.0, ans=0.2 2023-06-17 19:49:17,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.98 vs. limit=6.0 2023-06-17 19:49:50,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=32880.0, ans=0.125 2023-06-17 19:50:24,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=32940.0, ans=0.125 2023-06-17 19:50:26,880 INFO [train.py:996] (0/4) Epoch 1, batch 5500, loss[loss=0.4005, simple_loss=0.4737, pruned_loss=0.1637, over 21776.00 frames. ], tot_loss[loss=0.404, simple_loss=0.4332, pruned_loss=0.1874, over 4292745.78 frames. ], batch size: 351, lr: 4.04e-02, grad_scale: 16.0 2023-06-17 19:50:42,521 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:50:49,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.509e+02 4.085e+02 5.638e+02 7.299e+02 1.416e+03, threshold=1.128e+03, percent-clipped=6.0 2023-06-17 19:51:02,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=33060.0, ans=0.0 2023-06-17 19:51:21,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=33120.0, ans=0.125 2023-06-17 19:51:40,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33180.0, ans=0.1 2023-06-17 19:52:13,376 INFO [train.py:996] (0/4) Epoch 1, batch 5550, loss[loss=0.3162, simple_loss=0.386, pruned_loss=0.1232, over 21795.00 frames. ], tot_loss[loss=0.4017, simple_loss=0.4344, pruned_loss=0.1845, over 4287660.98 frames. ], batch size: 371, lr: 4.03e-02, grad_scale: 16.0 2023-06-17 19:52:13,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=33300.0, ans=0.0 2023-06-17 19:52:17,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=33300.0, ans=0.003630434782608696 2023-06-17 19:52:51,892 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:53:19,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=33480.0, ans=0.0035913043478260865 2023-06-17 19:53:21,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=33480.0, ans=0.0 2023-06-17 19:53:26,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=33480.0, ans=0.125 2023-06-17 19:53:56,959 INFO [train.py:996] (0/4) Epoch 1, batch 5600, loss[loss=0.4269, simple_loss=0.4789, pruned_loss=0.1874, over 21836.00 frames. ], tot_loss[loss=0.3957, simple_loss=0.4322, pruned_loss=0.1796, over 4285925.73 frames. ], batch size: 371, lr: 4.03e-02, grad_scale: 32.0 2023-06-17 19:54:29,863 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 4.088e+02 5.346e+02 7.510e+02 1.919e+03, threshold=1.069e+03, percent-clipped=8.0 2023-06-17 19:54:35,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=33660.0, ans=0.0 2023-06-17 19:54:36,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=33660.0, ans=0.2 2023-06-17 19:55:32,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33840.0, ans=0.1 2023-06-17 19:55:38,033 INFO [train.py:996] (0/4) Epoch 1, batch 5650, loss[loss=0.4036, simple_loss=0.4251, pruned_loss=0.1911, over 20956.00 frames. ], tot_loss[loss=0.4003, simple_loss=0.4347, pruned_loss=0.1829, over 4292128.50 frames. ], batch size: 607, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 19:55:43,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=33900.0, ans=0.125 2023-06-17 19:55:48,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=33900.0, ans=0.125 2023-06-17 19:57:11,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=34140.0, ans=0.125 2023-06-17 19:57:27,472 INFO [train.py:996] (0/4) Epoch 1, batch 5700, loss[loss=0.3593, simple_loss=0.3984, pruned_loss=0.1601, over 21711.00 frames. ], tot_loss[loss=0.4017, simple_loss=0.4344, pruned_loss=0.1845, over 4291508.81 frames. ], batch size: 247, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 19:57:58,609 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-17 19:58:00,896 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.144e+02 5.223e+02 7.602e+02 1.708e+03, threshold=1.045e+03, percent-clipped=9.0 2023-06-17 19:58:01,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=34260.0, ans=0.125 2023-06-17 19:58:32,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=34380.0, ans=0.125 2023-06-17 19:58:52,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=34440.0, ans=0.125 2023-06-17 19:59:11,785 INFO [train.py:996] (0/4) Epoch 1, batch 5750, loss[loss=0.3421, simple_loss=0.388, pruned_loss=0.1481, over 21173.00 frames. ], tot_loss[loss=0.3933, simple_loss=0.4282, pruned_loss=0.1792, over 4286850.81 frames. ], batch size: 159, lr: 4.01e-02, grad_scale: 32.0 2023-06-17 19:59:29,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=34500.0, ans=0.0 2023-06-17 20:00:28,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=34740.0, ans=0.0 2023-06-17 20:00:37,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=34740.0, ans=15.0 2023-06-17 20:00:44,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=34740.0, ans=0.0 2023-06-17 20:00:44,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=34740.0, ans=0.2 2023-06-17 20:00:59,967 INFO [train.py:996] (0/4) Epoch 1, batch 5800, loss[loss=0.3495, simple_loss=0.4166, pruned_loss=0.1413, over 21792.00 frames. ], tot_loss[loss=0.384, simple_loss=0.422, pruned_loss=0.173, over 4289478.37 frames. ], batch size: 282, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 20:01:28,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.873e+02 4.586e+02 6.036e+02 1.114e+03, threshold=9.172e+02, percent-clipped=1.0 2023-06-17 20:01:31,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-17 20:01:40,642 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=12.0 2023-06-17 20:02:00,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=34980.0, ans=0.125 2023-06-17 20:02:05,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=34980.0, ans=0.2 2023-06-17 20:02:39,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.84 vs. limit=22.5 2023-06-17 20:02:43,519 INFO [train.py:996] (0/4) Epoch 1, batch 5850, loss[loss=0.4401, simple_loss=0.4941, pruned_loss=0.193, over 21188.00 frames. ], tot_loss[loss=0.3713, simple_loss=0.4151, pruned_loss=0.1638, over 4286464.92 frames. ], batch size: 548, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 20:03:02,990 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.44 vs. limit=15.0 2023-06-17 20:03:11,038 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:04:02,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=35280.0, ans=0.0 2023-06-17 20:04:20,757 INFO [train.py:996] (0/4) Epoch 1, batch 5900, loss[loss=0.2434, simple_loss=0.3181, pruned_loss=0.08438, over 21284.00 frames. ], tot_loss[loss=0.352, simple_loss=0.4012, pruned_loss=0.1514, over 4279360.81 frames. ], batch size: 176, lr: 3.99e-02, grad_scale: 32.0 2023-06-17 20:04:21,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=35400.0, ans=0.025 2023-06-17 20:04:37,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=35400.0, ans=0.125 2023-06-17 20:04:45,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=35460.0, ans=0.125 2023-06-17 20:04:46,512 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-17 20:04:47,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=35460.0, ans=0.2 2023-06-17 20:04:48,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 3.363e+02 4.037e+02 5.226e+02 1.298e+03, threshold=8.074e+02, percent-clipped=7.0 2023-06-17 20:04:57,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=35520.0, ans=0.125 2023-06-17 20:05:18,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=35580.0, ans=0.0 2023-06-17 20:05:49,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=35640.0, ans=0.0 2023-06-17 20:05:51,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=35640.0, ans=0.125 2023-06-17 20:05:51,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-17 20:06:08,251 INFO [train.py:996] (0/4) Epoch 1, batch 5950, loss[loss=0.4601, simple_loss=0.4608, pruned_loss=0.2297, over 15571.00 frames. ], tot_loss[loss=0.3658, simple_loss=0.4065, pruned_loss=0.1625, over 4283662.99 frames. ], batch size: 61, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 20:06:10,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35700.0, ans=0.1 2023-06-17 20:06:14,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=35700.0, ans=0.2 2023-06-17 20:06:39,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35820.0, ans=0.1 2023-06-17 20:07:37,794 INFO [train.py:996] (0/4) Epoch 1, batch 6000, loss[loss=0.3495, simple_loss=0.364, pruned_loss=0.1675, over 21411.00 frames. ], tot_loss[loss=0.3732, simple_loss=0.4066, pruned_loss=0.1699, over 4287140.45 frames. ], batch size: 195, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 20:07:37,796 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-17 20:07:46,506 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.3331, 1.7269, 1.9103, 2.0549, 1.5139, 1.7480, 1.9895, 1.9698], device='cuda:0') 2023-06-17 20:07:56,504 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3636, simple_loss=0.4388, pruned_loss=0.1442, over 1796401.00 frames. 2023-06-17 20:07:56,505 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-17 20:08:11,189 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-17 20:08:19,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.037e+02 4.782e+02 6.358e+02 7.928e+02 1.970e+03, threshold=1.272e+03, percent-clipped=23.0 2023-06-17 20:08:46,042 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=15.0 2023-06-17 20:08:51,570 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:08:53,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=36120.0, ans=0.125 2023-06-17 20:09:06,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36180.0, ans=0.1 2023-06-17 20:09:34,154 INFO [train.py:996] (0/4) Epoch 1, batch 6050, loss[loss=0.3202, simple_loss=0.3511, pruned_loss=0.1447, over 21597.00 frames. ], tot_loss[loss=0.3745, simple_loss=0.4049, pruned_loss=0.172, over 4279450.26 frames. ], batch size: 263, lr: 3.97e-02, grad_scale: 32.0 2023-06-17 20:09:41,870 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-06-17 20:09:49,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-17 20:11:00,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=36540.0, ans=0.125 2023-06-17 20:11:15,940 INFO [train.py:996] (0/4) Epoch 1, batch 6100, loss[loss=0.3734, simple_loss=0.4054, pruned_loss=0.1707, over 21818.00 frames. ], tot_loss[loss=0.3754, simple_loss=0.4058, pruned_loss=0.1725, over 4272695.88 frames. ], batch size: 282, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 20:11:24,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=36600.0, ans=0.0 2023-06-17 20:11:32,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=36660.0, ans=0.0029 2023-06-17 20:11:38,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 4.105e+02 5.881e+02 8.261e+02 1.678e+03, threshold=1.176e+03, percent-clipped=6.0 2023-06-17 20:11:54,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=36660.0, ans=0.5 2023-06-17 20:12:09,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-17 20:12:39,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=36840.0, ans=0.125 2023-06-17 20:12:57,175 INFO [train.py:996] (0/4) Epoch 1, batch 6150, loss[loss=0.3835, simple_loss=0.4108, pruned_loss=0.1781, over 21713.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.4073, pruned_loss=0.1762, over 4272655.44 frames. ], batch size: 298, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 20:13:03,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=36900.0, ans=0.125 2023-06-17 20:13:21,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=36960.0, ans=0.2 2023-06-17 20:13:39,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=20.09 vs. limit=22.5 2023-06-17 20:14:07,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2023-06-17 20:14:20,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37080.0, ans=0.1 2023-06-17 20:14:29,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=37140.0, ans=0.2 2023-06-17 20:14:39,010 INFO [train.py:996] (0/4) Epoch 1, batch 6200, loss[loss=0.3677, simple_loss=0.4084, pruned_loss=0.1635, over 21392.00 frames. ], tot_loss[loss=0.3811, simple_loss=0.4097, pruned_loss=0.1762, over 4273782.57 frames. ], batch size: 211, lr: 3.95e-02, grad_scale: 32.0 2023-06-17 20:14:40,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-17 20:15:07,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.629e+02 4.899e+02 6.626e+02 1.862e+03, threshold=9.798e+02, percent-clipped=4.0 2023-06-17 20:15:26,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=37320.0, ans=0.025 2023-06-17 20:16:02,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=37380.0, ans=0.2 2023-06-17 20:16:06,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.86 vs. limit=10.0 2023-06-17 20:16:22,550 INFO [train.py:996] (0/4) Epoch 1, batch 6250, loss[loss=0.4869, simple_loss=0.5266, pruned_loss=0.2236, over 19782.00 frames. ], tot_loss[loss=0.3834, simple_loss=0.4146, pruned_loss=0.1761, over 4269154.85 frames. ], batch size: 703, lr: 3.94e-02, grad_scale: 32.0 2023-06-17 20:17:11,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=37620.0, ans=0.125 2023-06-17 20:17:44,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=37680.0, ans=0.125 2023-06-17 20:17:56,782 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-17 20:18:02,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=21.09 vs. limit=15.0 2023-06-17 20:18:03,417 INFO [train.py:996] (0/4) Epoch 1, batch 6300, loss[loss=0.3651, simple_loss=0.4109, pruned_loss=0.1597, over 17000.00 frames. ], tot_loss[loss=0.3848, simple_loss=0.4193, pruned_loss=0.1752, over 4270239.29 frames. ], batch size: 60, lr: 3.94e-02, grad_scale: 32.0 2023-06-17 20:18:30,920 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-17 20:18:41,524 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.261e+02 6.027e+02 8.452e+02 1.541e+03, threshold=1.205e+03, percent-clipped=13.0 2023-06-17 20:18:47,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=37860.0, ans=0.0026391304347826083 2023-06-17 20:18:47,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=10.0 2023-06-17 20:19:46,084 INFO [train.py:996] (0/4) Epoch 1, batch 6350, loss[loss=0.4246, simple_loss=0.444, pruned_loss=0.2025, over 21756.00 frames. ], tot_loss[loss=0.3974, simple_loss=0.4275, pruned_loss=0.1837, over 4278556.79 frames. ], batch size: 298, lr: 3.93e-02, grad_scale: 32.0 2023-06-17 20:19:49,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-17 20:21:22,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=38340.0, ans=0.025 2023-06-17 20:21:46,049 INFO [train.py:996] (0/4) Epoch 1, batch 6400, loss[loss=0.4337, simple_loss=0.4532, pruned_loss=0.2071, over 21690.00 frames. ], tot_loss[loss=0.4092, simple_loss=0.4357, pruned_loss=0.1914, over 4285604.78 frames. ], batch size: 351, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 20:22:15,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.706e+02 4.358e+02 5.224e+02 7.258e+02 1.926e+03, threshold=1.045e+03, percent-clipped=7.0 2023-06-17 20:22:32,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-06-17 20:22:42,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=38580.0, ans=0.125 2023-06-17 20:22:56,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=38640.0, ans=0.125 2023-06-17 20:22:57,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-17 20:23:27,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=38640.0, ans=0.0 2023-06-17 20:23:29,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=38700.0, ans=0.0024565217391304354 2023-06-17 20:23:30,214 INFO [train.py:996] (0/4) Epoch 1, batch 6450, loss[loss=0.3407, simple_loss=0.3733, pruned_loss=0.1541, over 21378.00 frames. ], tot_loss[loss=0.4077, simple_loss=0.4365, pruned_loss=0.1895, over 4284327.87 frames. ], batch size: 131, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 20:24:25,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=38880.0, ans=0.125 2023-06-17 20:24:29,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=38880.0, ans=15.0 2023-06-17 20:25:09,308 INFO [train.py:996] (0/4) Epoch 1, batch 6500, loss[loss=0.3587, simple_loss=0.3824, pruned_loss=0.1675, over 21509.00 frames. ], tot_loss[loss=0.4002, simple_loss=0.4265, pruned_loss=0.1869, over 4279750.76 frames. ], batch size: 230, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 20:25:30,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-17 20:25:37,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.797e+02 4.922e+02 6.987e+02 1.536e+03, threshold=9.843e+02, percent-clipped=9.0 2023-06-17 20:25:45,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=39060.0, ans=0.1 2023-06-17 20:26:09,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=39180.0, ans=0.125 2023-06-17 20:26:40,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39240.0, ans=0.1 2023-06-17 20:26:49,861 INFO [train.py:996] (0/4) Epoch 1, batch 6550, loss[loss=0.4111, simple_loss=0.4209, pruned_loss=0.2006, over 21902.00 frames. ], tot_loss[loss=0.3968, simple_loss=0.4246, pruned_loss=0.1845, over 4279441.07 frames. ], batch size: 316, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 20:26:50,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=39300.0, ans=0.0 2023-06-17 20:27:02,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=39300.0, ans=0.0 2023-06-17 20:27:11,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-17 20:27:17,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=39360.0, ans=0.125 2023-06-17 20:27:37,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=39420.0, ans=0.125 2023-06-17 20:28:10,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=39540.0, ans=0.09899494936611666 2023-06-17 20:28:28,295 INFO [train.py:996] (0/4) Epoch 1, batch 6600, loss[loss=0.3604, simple_loss=0.3749, pruned_loss=0.1729, over 21444.00 frames. ], tot_loss[loss=0.3929, simple_loss=0.4195, pruned_loss=0.1831, over 4271846.07 frames. ], batch size: 212, lr: 3.90e-02, grad_scale: 16.0 2023-06-17 20:28:36,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.02 vs. limit=15.0 2023-06-17 20:28:57,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 4.216e+02 5.009e+02 6.284e+02 1.954e+03, threshold=1.002e+03, percent-clipped=7.0 2023-06-17 20:29:09,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=39720.0, ans=0.1 2023-06-17 20:29:48,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=39780.0, ans=0.125 2023-06-17 20:30:07,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=39840.0, ans=0.125 2023-06-17 20:30:17,386 INFO [train.py:996] (0/4) Epoch 1, batch 6650, loss[loss=0.4316, simple_loss=0.4196, pruned_loss=0.2219, over 21396.00 frames. ], tot_loss[loss=0.3839, simple_loss=0.4111, pruned_loss=0.1784, over 4275522.93 frames. ], batch size: 509, lr: 3.89e-02, grad_scale: 16.0 2023-06-17 20:30:46,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=39960.0, ans=0.002182608695652174 2023-06-17 20:31:10,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=6.0 2023-06-17 20:31:26,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.15 vs. limit=22.5 2023-06-17 20:31:26,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=40080.0, ans=0.015 2023-06-17 20:31:35,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-17 20:31:38,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=40140.0, ans=0.125 2023-06-17 20:31:54,709 INFO [train.py:996] (0/4) Epoch 1, batch 6700, loss[loss=0.362, simple_loss=0.3787, pruned_loss=0.1727, over 21876.00 frames. ], tot_loss[loss=0.3787, simple_loss=0.4041, pruned_loss=0.1767, over 4277167.99 frames. ], batch size: 107, lr: 3.89e-02, grad_scale: 16.0 2023-06-17 20:32:24,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.374e+02 3.778e+02 4.910e+02 6.670e+02 1.888e+03, threshold=9.820e+02, percent-clipped=8.0 2023-06-17 20:32:57,884 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:33:38,603 INFO [train.py:996] (0/4) Epoch 1, batch 6750, loss[loss=0.4116, simple_loss=0.4187, pruned_loss=0.2022, over 21814.00 frames. ], tot_loss[loss=0.3785, simple_loss=0.4024, pruned_loss=0.1773, over 4273898.99 frames. ], batch size: 282, lr: 3.88e-02, grad_scale: 16.0 2023-06-17 20:34:07,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=40560.0, ans=0.0 2023-06-17 20:34:11,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=40560.0, ans=0.04949747468305833 2023-06-17 20:34:54,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-06-17 20:35:21,771 INFO [train.py:996] (0/4) Epoch 1, batch 6800, loss[loss=0.3687, simple_loss=0.3991, pruned_loss=0.1692, over 21876.00 frames. ], tot_loss[loss=0.3823, simple_loss=0.4051, pruned_loss=0.1798, over 4267941.33 frames. ], batch size: 124, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 20:35:34,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.84 vs. limit=15.0 2023-06-17 20:35:50,488 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 4.135e+02 5.210e+02 7.018e+02 1.112e+03, threshold=1.042e+03, percent-clipped=5.0 2023-06-17 20:36:42,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=40980.0, ans=10.0 2023-06-17 20:36:44,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=40980.0, ans=0.95 2023-06-17 20:36:58,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=41040.0, ans=0.0 2023-06-17 20:37:02,301 INFO [train.py:996] (0/4) Epoch 1, batch 6850, loss[loss=0.3973, simple_loss=0.3979, pruned_loss=0.1983, over 21454.00 frames. ], tot_loss[loss=0.3822, simple_loss=0.4025, pruned_loss=0.1809, over 4276146.63 frames. ], batch size: 212, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 20:37:31,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=41160.0, ans=0.0019217391304347815 2023-06-17 20:37:40,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=15.0 2023-06-17 20:38:04,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=41280.0, ans=0.125 2023-06-17 20:38:29,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41340.0, ans=0.1 2023-06-17 20:38:34,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=41340.0, ans=0.125 2023-06-17 20:38:36,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=41340.0, ans=0.125 2023-06-17 20:38:44,865 INFO [train.py:996] (0/4) Epoch 1, batch 6900, loss[loss=0.4124, simple_loss=0.4305, pruned_loss=0.1972, over 21578.00 frames. ], tot_loss[loss=0.3822, simple_loss=0.4029, pruned_loss=0.1808, over 4281110.90 frames. ], batch size: 548, lr: 3.86e-02, grad_scale: 32.0 2023-06-17 20:39:15,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 4.043e+02 5.136e+02 6.723e+02 1.147e+03, threshold=1.027e+03, percent-clipped=4.0 2023-06-17 20:39:27,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-17 20:39:37,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-17 20:40:01,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=41580.0, ans=0.0018304347826086954 2023-06-17 20:40:16,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=41640.0, ans=0.125 2023-06-17 20:40:33,710 INFO [train.py:996] (0/4) Epoch 1, batch 6950, loss[loss=0.381, simple_loss=0.4118, pruned_loss=0.1752, over 21634.00 frames. ], tot_loss[loss=0.3797, simple_loss=0.4062, pruned_loss=0.1766, over 4287615.93 frames. ], batch size: 263, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 20:40:35,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=41700.0, ans=0.125 2023-06-17 20:40:37,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-17 20:41:38,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=41880.0, ans=0.125 2023-06-17 20:41:38,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=41880.0, ans=0.0 2023-06-17 20:41:58,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.46 vs. limit=22.5 2023-06-17 20:42:09,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=41940.0, ans=0.125 2023-06-17 20:42:15,404 INFO [train.py:996] (0/4) Epoch 1, batch 7000, loss[loss=0.4037, simple_loss=0.4113, pruned_loss=0.198, over 21866.00 frames. ], tot_loss[loss=0.3877, simple_loss=0.4111, pruned_loss=0.1822, over 4288065.42 frames. ], batch size: 98, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 20:42:40,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.856e+02 5.678e+02 7.793e+02 1.284e+03, threshold=1.136e+03, percent-clipped=9.0 2023-06-17 20:43:19,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=42180.0, ans=0.2 2023-06-17 20:43:35,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=42240.0, ans=0.09899494936611666 2023-06-17 20:43:58,217 INFO [train.py:996] (0/4) Epoch 1, batch 7050, loss[loss=0.3098, simple_loss=0.3563, pruned_loss=0.1316, over 21367.00 frames. ], tot_loss[loss=0.3821, simple_loss=0.4068, pruned_loss=0.1787, over 4282680.31 frames. ], batch size: 176, lr: 3.84e-02, grad_scale: 32.0 2023-06-17 20:44:18,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=42360.0, ans=0.125 2023-06-17 20:44:31,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=42360.0, ans=0.125 2023-06-17 20:44:49,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42420.0, ans=0.125 2023-06-17 20:45:02,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-17 20:45:04,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2023-06-17 20:45:08,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=42480.0, ans=0.125 2023-06-17 20:45:20,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=42480.0, ans=0.125 2023-06-17 20:45:41,238 INFO [train.py:996] (0/4) Epoch 1, batch 7100, loss[loss=0.3936, simple_loss=0.4325, pruned_loss=0.1774, over 21815.00 frames. ], tot_loss[loss=0.3898, simple_loss=0.4147, pruned_loss=0.1824, over 4283429.48 frames. ], batch size: 371, lr: 3.83e-02, grad_scale: 16.0 2023-06-17 20:45:43,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42600.0, ans=0.1 2023-06-17 20:46:23,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.465e+02 3.499e+02 4.765e+02 6.343e+02 1.936e+03, threshold=9.530e+02, percent-clipped=5.0 2023-06-17 20:46:38,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=42720.0, ans=0.0 2023-06-17 20:46:50,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=42780.0, ans=0.125 2023-06-17 20:46:52,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.52 vs. limit=22.5 2023-06-17 20:46:55,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=42780.0, ans=0.2 2023-06-17 20:47:23,762 INFO [train.py:996] (0/4) Epoch 1, batch 7150, loss[loss=0.3951, simple_loss=0.4239, pruned_loss=0.1832, over 21636.00 frames. ], tot_loss[loss=0.3781, simple_loss=0.4074, pruned_loss=0.1744, over 4276460.17 frames. ], batch size: 263, lr: 3.83e-02, grad_scale: 16.0 2023-06-17 20:48:39,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=43080.0, ans=0.0 2023-06-17 20:49:07,065 INFO [train.py:996] (0/4) Epoch 1, batch 7200, loss[loss=0.3685, simple_loss=0.3847, pruned_loss=0.1761, over 21624.00 frames. ], tot_loss[loss=0.3822, simple_loss=0.41, pruned_loss=0.1772, over 4268072.41 frames. ], batch size: 298, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 20:49:47,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.06 vs. limit=15.0 2023-06-17 20:49:48,763 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.531e+02 4.300e+02 5.251e+02 6.410e+02 9.416e+02, threshold=1.050e+03, percent-clipped=0.0 2023-06-17 20:50:10,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-17 20:50:33,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=43440.0, ans=0.125 2023-06-17 20:50:34,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-17 20:50:49,422 INFO [train.py:996] (0/4) Epoch 1, batch 7250, loss[loss=0.4286, simple_loss=0.4141, pruned_loss=0.2215, over 21418.00 frames. ], tot_loss[loss=0.3808, simple_loss=0.4054, pruned_loss=0.1781, over 4266515.60 frames. ], batch size: 475, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 20:50:59,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=43500.0, ans=0.125 2023-06-17 20:51:59,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.05 vs. limit=10.0 2023-06-17 20:52:06,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43680.0, ans=0.1 2023-06-17 20:52:09,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-17 20:52:31,527 INFO [train.py:996] (0/4) Epoch 1, batch 7300, loss[loss=0.3044, simple_loss=0.3327, pruned_loss=0.138, over 21355.00 frames. ], tot_loss[loss=0.3746, simple_loss=0.3974, pruned_loss=0.1759, over 4266061.15 frames. ], batch size: 144, lr: 3.81e-02, grad_scale: 32.0 2023-06-17 20:52:50,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=43800.0, ans=0.125 2023-06-17 20:53:07,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.808e+02 5.144e+02 6.713e+02 1.157e+03, threshold=1.029e+03, percent-clipped=4.0 2023-06-17 20:53:08,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=43860.0, ans=0.2 2023-06-17 20:53:08,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=43860.0, ans=10.0 2023-06-17 20:54:25,394 INFO [train.py:996] (0/4) Epoch 1, batch 7350, loss[loss=0.4511, simple_loss=0.4579, pruned_loss=0.2221, over 21299.00 frames. ], tot_loss[loss=0.3744, simple_loss=0.395, pruned_loss=0.1769, over 4266231.26 frames. ], batch size: 143, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 20:54:28,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-17 20:55:52,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44340.0, ans=0.1 2023-06-17 20:56:02,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=44340.0, ans=0.0012304347826086956 2023-06-17 20:56:05,310 INFO [train.py:996] (0/4) Epoch 1, batch 7400, loss[loss=0.3437, simple_loss=0.3895, pruned_loss=0.1489, over 21615.00 frames. ], tot_loss[loss=0.3843, simple_loss=0.4063, pruned_loss=0.1812, over 4263036.32 frames. ], batch size: 247, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 20:56:36,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.553e+02 4.273e+02 5.813e+02 7.639e+02 1.411e+03, threshold=1.163e+03, percent-clipped=7.0 2023-06-17 20:56:55,881 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=15.0 2023-06-17 20:57:29,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-17 20:57:42,572 INFO [train.py:996] (0/4) Epoch 1, batch 7450, loss[loss=0.4306, simple_loss=0.4298, pruned_loss=0.2157, over 15810.00 frames. ], tot_loss[loss=0.3806, simple_loss=0.4027, pruned_loss=0.1793, over 4262036.59 frames. ], batch size: 64, lr: 3.79e-02, grad_scale: 32.0 2023-06-17 20:57:58,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44700.0, ans=0.1 2023-06-17 20:58:08,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44760.0, ans=0.1 2023-06-17 20:58:11,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=44760.0, ans=0.125 2023-06-17 20:58:29,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44820.0, ans=0.1 2023-06-17 20:58:54,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.74 vs. limit=6.0 2023-06-17 20:59:34,491 INFO [train.py:996] (0/4) Epoch 1, batch 7500, loss[loss=0.3847, simple_loss=0.4389, pruned_loss=0.1652, over 21504.00 frames. ], tot_loss[loss=0.3855, simple_loss=0.408, pruned_loss=0.1815, over 4266739.91 frames. ], batch size: 194, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 20:59:51,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=45060.0, ans=0.125 2023-06-17 21:00:01,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.011e+02 4.419e+02 5.234e+02 7.057e+02 1.215e+03, threshold=1.047e+03, percent-clipped=2.0 2023-06-17 21:00:08,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=45120.0, ans=0.2 2023-06-17 21:01:13,070 INFO [train.py:996] (0/4) Epoch 1, batch 7550, loss[loss=0.3331, simple_loss=0.3772, pruned_loss=0.1444, over 21902.00 frames. ], tot_loss[loss=0.3843, simple_loss=0.4137, pruned_loss=0.1774, over 4261275.18 frames. ], batch size: 98, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 21:01:29,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.66 vs. limit=6.0 2023-06-17 21:01:35,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=45360.0, ans=0.0 2023-06-17 21:01:49,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=45420.0, ans=0.125 2023-06-17 21:02:10,957 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-17 21:02:35,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=45540.0, ans=0.125 2023-06-17 21:02:56,481 INFO [train.py:996] (0/4) Epoch 1, batch 7600, loss[loss=0.3692, simple_loss=0.3899, pruned_loss=0.1743, over 21743.00 frames. ], tot_loss[loss=0.383, simple_loss=0.4136, pruned_loss=0.1762, over 4266895.55 frames. ], batch size: 230, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 21:03:22,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.546e+02 4.998e+02 6.623e+02 1.459e+03, threshold=9.996e+02, percent-clipped=5.0 2023-06-17 21:04:38,391 INFO [train.py:996] (0/4) Epoch 1, batch 7650, loss[loss=0.4453, simple_loss=0.4645, pruned_loss=0.213, over 21866.00 frames. ], tot_loss[loss=0.3872, simple_loss=0.4136, pruned_loss=0.1804, over 4276655.04 frames. ], batch size: 107, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 21:04:42,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.34 vs. limit=10.0 2023-06-17 21:04:44,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45900.0, ans=0.1 2023-06-17 21:05:02,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=45960.0, ans=0.125 2023-06-17 21:05:02,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=45960.0, ans=0.125 2023-06-17 21:05:31,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=46020.0, ans=0.125 2023-06-17 21:05:43,280 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:06:24,119 INFO [train.py:996] (0/4) Epoch 1, batch 7700, loss[loss=0.3893, simple_loss=0.4257, pruned_loss=0.1764, over 21641.00 frames. ], tot_loss[loss=0.3939, simple_loss=0.4193, pruned_loss=0.1843, over 4280969.59 frames. ], batch size: 263, lr: 3.76e-02, grad_scale: 32.0 2023-06-17 21:06:32,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.43 vs. limit=22.5 2023-06-17 21:06:50,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.909e+02 4.180e+02 5.539e+02 6.663e+02 1.200e+03, threshold=1.108e+03, percent-clipped=4.0 2023-06-17 21:08:08,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=46500.0, ans=0.2 2023-06-17 21:08:09,053 INFO [train.py:996] (0/4) Epoch 1, batch 7750, loss[loss=0.4212, simple_loss=0.4659, pruned_loss=0.1883, over 21624.00 frames. ], tot_loss[loss=0.3999, simple_loss=0.4275, pruned_loss=0.1861, over 4278622.59 frames. ], batch size: 230, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 21:08:09,646 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:08:26,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=46560.0, ans=0.125 2023-06-17 21:08:29,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=46560.0, ans=0.125 2023-06-17 21:09:52,881 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.70 vs. limit=22.5 2023-06-17 21:09:53,168 INFO [train.py:996] (0/4) Epoch 1, batch 7800, loss[loss=0.3413, simple_loss=0.3813, pruned_loss=0.1507, over 21677.00 frames. ], tot_loss[loss=0.3975, simple_loss=0.426, pruned_loss=0.1845, over 4276620.41 frames. ], batch size: 263, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 21:10:06,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=46800.0, ans=0.125 2023-06-17 21:10:08,925 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-17 21:10:28,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=46860.0, ans=0.2 2023-06-17 21:10:29,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.732e+02 4.420e+02 5.608e+02 7.244e+02 1.529e+03, threshold=1.122e+03, percent-clipped=4.0 2023-06-17 21:10:59,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=46920.0, ans=0.125 2023-06-17 21:11:35,314 INFO [train.py:996] (0/4) Epoch 1, batch 7850, loss[loss=0.3664, simple_loss=0.3816, pruned_loss=0.1757, over 21671.00 frames. ], tot_loss[loss=0.3879, simple_loss=0.4148, pruned_loss=0.1805, over 4266425.86 frames. ], batch size: 283, lr: 3.74e-02, grad_scale: 32.0 2023-06-17 21:11:53,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=47160.0, ans=0.125 2023-06-17 21:13:19,408 INFO [train.py:996] (0/4) Epoch 1, batch 7900, loss[loss=0.3173, simple_loss=0.361, pruned_loss=0.1368, over 21612.00 frames. ], tot_loss[loss=0.3831, simple_loss=0.409, pruned_loss=0.1785, over 4273409.60 frames. ], batch size: 230, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 21:13:30,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=47400.0, ans=0.125 2023-06-17 21:13:56,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.667e+02 5.315e+02 6.476e+02 7.892e+02 1.492e+03, threshold=1.295e+03, percent-clipped=7.0 2023-06-17 21:14:26,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.88 vs. limit=6.0 2023-06-17 21:14:26,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.05 vs. limit=15.0 2023-06-17 21:14:58,006 INFO [train.py:996] (0/4) Epoch 1, batch 7950, loss[loss=0.4129, simple_loss=0.4374, pruned_loss=0.1942, over 21245.00 frames. ], tot_loss[loss=0.3884, simple_loss=0.4173, pruned_loss=0.1798, over 4270241.28 frames. ], batch size: 143, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 21:15:51,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-17 21:16:17,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47880.0, ans=0.1 2023-06-17 21:16:47,545 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-8000.pt 2023-06-17 21:16:50,558 INFO [train.py:996] (0/4) Epoch 1, batch 8000, loss[loss=0.4465, simple_loss=0.4928, pruned_loss=0.2001, over 21635.00 frames. ], tot_loss[loss=0.3958, simple_loss=0.4227, pruned_loss=0.1844, over 4268168.33 frames. ], batch size: 414, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 21:17:28,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.757e+02 4.328e+02 5.465e+02 6.460e+02 1.072e+03, threshold=1.093e+03, percent-clipped=0.0 2023-06-17 21:17:35,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-17 21:17:48,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=48120.0, ans=0.125 2023-06-17 21:18:43,563 INFO [train.py:996] (0/4) Epoch 1, batch 8050, loss[loss=0.473, simple_loss=0.509, pruned_loss=0.2185, over 21521.00 frames. ], tot_loss[loss=0.3948, simple_loss=0.4225, pruned_loss=0.1835, over 4261479.47 frames. ], batch size: 471, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 21:18:52,038 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:18:53,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=48300.0, ans=0.0 2023-06-17 21:19:03,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=48360.0, ans=0.0003565217391304342 2023-06-17 21:20:27,382 INFO [train.py:996] (0/4) Epoch 1, batch 8100, loss[loss=0.3713, simple_loss=0.3871, pruned_loss=0.1777, over 21579.00 frames. ], tot_loss[loss=0.3952, simple_loss=0.4235, pruned_loss=0.1834, over 4267138.15 frames. ], batch size: 195, lr: 3.71e-02, grad_scale: 32.0 2023-06-17 21:20:54,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 4.589e+02 6.477e+02 8.462e+02 1.426e+03, threshold=1.295e+03, percent-clipped=5.0 2023-06-17 21:21:28,656 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:21:46,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=48780.0, ans=0.125 2023-06-17 21:21:51,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2023-06-17 21:22:14,038 INFO [train.py:996] (0/4) Epoch 1, batch 8150, loss[loss=0.3072, simple_loss=0.3443, pruned_loss=0.1351, over 21201.00 frames. ], tot_loss[loss=0.3948, simple_loss=0.425, pruned_loss=0.1823, over 4265223.38 frames. ], batch size: 143, lr: 3.70e-02, grad_scale: 16.0 2023-06-17 21:22:17,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=48900.0, ans=0.125 2023-06-17 21:22:21,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=48900.0, ans=0.95 2023-06-17 21:22:24,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.58 vs. limit=6.0 2023-06-17 21:22:34,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-17 21:23:01,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=49020.0, ans=0.05 2023-06-17 21:23:33,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=49080.0, ans=0.125 2023-06-17 21:23:41,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.55 vs. limit=10.0 2023-06-17 21:23:58,530 INFO [train.py:996] (0/4) Epoch 1, batch 8200, loss[loss=0.3315, simple_loss=0.3548, pruned_loss=0.1541, over 21406.00 frames. ], tot_loss[loss=0.3874, simple_loss=0.418, pruned_loss=0.1784, over 4268226.85 frames. ], batch size: 212, lr: 3.70e-02, grad_scale: 16.0 2023-06-17 21:24:18,892 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:24:29,836 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-17 21:24:36,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.823e+02 4.911e+02 6.054e+02 7.943e+02 1.649e+03, threshold=1.211e+03, percent-clipped=3.0 2023-06-17 21:25:18,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.26 vs. limit=15.0 2023-06-17 21:25:32,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=49440.0, ans=0.125 2023-06-17 21:25:42,202 INFO [train.py:996] (0/4) Epoch 1, batch 8250, loss[loss=0.5008, simple_loss=0.5139, pruned_loss=0.2439, over 21531.00 frames. ], tot_loss[loss=0.3897, simple_loss=0.419, pruned_loss=0.1802, over 4273817.24 frames. ], batch size: 471, lr: 3.69e-02, grad_scale: 16.0 2023-06-17 21:25:43,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-17 21:26:22,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=49560.0, ans=0.0 2023-06-17 21:27:18,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-17 21:27:25,195 INFO [train.py:996] (0/4) Epoch 1, batch 8300, loss[loss=0.396, simple_loss=0.4357, pruned_loss=0.1782, over 21653.00 frames. ], tot_loss[loss=0.384, simple_loss=0.4163, pruned_loss=0.1758, over 4273020.82 frames. ], batch size: 414, lr: 3.68e-02, grad_scale: 16.0 2023-06-17 21:27:28,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.31 vs. limit=6.0 2023-06-17 21:27:39,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-17 21:28:03,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 3.951e+02 4.948e+02 6.196e+02 1.080e+03, threshold=9.896e+02, percent-clipped=0.0 2023-06-17 21:28:30,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=49980.0, ans=0.2 2023-06-17 21:28:33,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-17 21:28:54,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=50040.0, ans=0.125 2023-06-17 21:29:02,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=50100.0, ans=0.0 2023-06-17 21:29:09,844 INFO [train.py:996] (0/4) Epoch 1, batch 8350, loss[loss=0.4556, simple_loss=0.4664, pruned_loss=0.2224, over 20660.00 frames. ], tot_loss[loss=0.3792, simple_loss=0.4136, pruned_loss=0.1724, over 4266727.01 frames. ], batch size: 607, lr: 3.68e-02, grad_scale: 16.0 2023-06-17 21:29:30,241 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.636e-02 2023-06-17 21:30:19,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=50280.0, ans=0.2 2023-06-17 21:30:35,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=50340.0, ans=0.125 2023-06-17 21:30:43,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=50340.0, ans=0.2 2023-06-17 21:30:43,836 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=15.0 2023-06-17 21:30:53,935 INFO [train.py:996] (0/4) Epoch 1, batch 8400, loss[loss=0.3347, simple_loss=0.3941, pruned_loss=0.1376, over 21728.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.4082, pruned_loss=0.1668, over 4265322.32 frames. ], batch size: 351, lr: 3.67e-02, grad_scale: 32.0 2023-06-17 21:30:54,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=50400.0, ans=0.0 2023-06-17 21:31:03,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=50400.0, ans=0.125 2023-06-17 21:31:06,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=50400.0, ans=0.125 2023-06-17 21:31:32,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.584e+02 3.530e+02 4.836e+02 6.875e+02 1.901e+03, threshold=9.672e+02, percent-clipped=8.0 2023-06-17 21:32:41,514 INFO [train.py:996] (0/4) Epoch 1, batch 8450, loss[loss=0.379, simple_loss=0.402, pruned_loss=0.178, over 21698.00 frames. ], tot_loss[loss=0.3739, simple_loss=0.4093, pruned_loss=0.1692, over 4274139.62 frames. ], batch size: 263, lr: 3.67e-02, grad_scale: 16.0 2023-06-17 21:33:19,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=50820.0, ans=0.125 2023-06-17 21:33:45,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=50880.0, ans=0.5 2023-06-17 21:34:13,633 INFO [train.py:996] (0/4) Epoch 1, batch 8500, loss[loss=0.3379, simple_loss=0.3608, pruned_loss=0.1575, over 21264.00 frames. ], tot_loss[loss=0.3749, simple_loss=0.4064, pruned_loss=0.1717, over 4280167.84 frames. ], batch size: 548, lr: 3.66e-02, grad_scale: 16.0 2023-06-17 21:34:51,212 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-17 21:34:52,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=22.5 2023-06-17 21:35:00,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.023e+02 4.457e+02 5.550e+02 7.009e+02 1.801e+03, threshold=1.110e+03, percent-clipped=10.0 2023-06-17 21:35:59,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=51240.0, ans=0.125 2023-06-17 21:36:04,112 INFO [train.py:996] (0/4) Epoch 1, batch 8550, loss[loss=0.3433, simple_loss=0.3607, pruned_loss=0.163, over 21496.00 frames. ], tot_loss[loss=0.3845, simple_loss=0.4145, pruned_loss=0.1773, over 4268776.17 frames. ], batch size: 195, lr: 3.65e-02, grad_scale: 16.0 2023-06-17 21:37:56,919 INFO [train.py:996] (0/4) Epoch 1, batch 8600, loss[loss=0.4511, simple_loss=0.4887, pruned_loss=0.2067, over 20980.00 frames. ], tot_loss[loss=0.3936, simple_loss=0.4251, pruned_loss=0.1811, over 4267879.14 frames. ], batch size: 607, lr: 3.65e-02, grad_scale: 16.0 2023-06-17 21:38:24,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=51660.0, ans=15.0 2023-06-17 21:38:38,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.423e+02 4.881e+02 5.851e+02 7.697e+02 1.206e+03, threshold=1.170e+03, percent-clipped=2.0 2023-06-17 21:39:47,596 INFO [train.py:996] (0/4) Epoch 1, batch 8650, loss[loss=0.3414, simple_loss=0.3819, pruned_loss=0.1505, over 21765.00 frames. ], tot_loss[loss=0.3986, simple_loss=0.4317, pruned_loss=0.1827, over 4267254.53 frames. ], batch size: 124, lr: 3.64e-02, grad_scale: 16.0 2023-06-17 21:39:47,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=51900.0, ans=0.125 2023-06-17 21:39:49,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=51900.0, ans=0.0 2023-06-17 21:40:35,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=52020.0, ans=0.5 2023-06-17 21:41:12,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-17 21:41:18,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-17 21:41:23,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=52200.0, ans=0.0 2023-06-17 21:41:24,320 INFO [train.py:996] (0/4) Epoch 1, batch 8700, loss[loss=0.3634, simple_loss=0.3777, pruned_loss=0.1746, over 21755.00 frames. ], tot_loss[loss=0.3863, simple_loss=0.4221, pruned_loss=0.1753, over 4262532.08 frames. ], batch size: 124, lr: 3.64e-02, grad_scale: 16.0 2023-06-17 21:41:57,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-17 21:42:03,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=52260.0, ans=0.125 2023-06-17 21:42:04,522 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.951e+02 4.948e+02 6.720e+02 1.137e+03, threshold=9.897e+02, percent-clipped=0.0 2023-06-17 21:42:21,963 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-17 21:42:25,016 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-17 21:43:00,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=52440.0, ans=0.0 2023-06-17 21:43:00,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=52440.0, ans=0.125 2023-06-17 21:43:13,275 INFO [train.py:996] (0/4) Epoch 1, batch 8750, loss[loss=0.3956, simple_loss=0.4067, pruned_loss=0.1923, over 21643.00 frames. ], tot_loss[loss=0.3875, simple_loss=0.4189, pruned_loss=0.178, over 4262750.34 frames. ], batch size: 230, lr: 3.63e-02, grad_scale: 16.0 2023-06-17 21:43:13,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=52500.0, ans=0.5 2023-06-17 21:43:42,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-17 21:43:48,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.16 vs. limit=22.5 2023-06-17 21:44:42,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=52740.0, ans=0.2 2023-06-17 21:45:03,175 INFO [train.py:996] (0/4) Epoch 1, batch 8800, loss[loss=0.4896, simple_loss=0.5034, pruned_loss=0.2379, over 21593.00 frames. ], tot_loss[loss=0.3951, simple_loss=0.4261, pruned_loss=0.182, over 4264244.52 frames. ], batch size: 414, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 21:45:32,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 5.221e+02 6.385e+02 9.121e+02 2.025e+03, threshold=1.277e+03, percent-clipped=20.0 2023-06-17 21:45:46,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=52920.0, ans=0.05 2023-06-17 21:46:04,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=52980.0, ans=0.125 2023-06-17 21:46:39,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53040.0, ans=0.1 2023-06-17 21:46:49,110 INFO [train.py:996] (0/4) Epoch 1, batch 8850, loss[loss=0.5421, simple_loss=0.6281, pruned_loss=0.2281, over 20798.00 frames. ], tot_loss[loss=0.4041, simple_loss=0.4359, pruned_loss=0.1862, over 4265526.77 frames. ], batch size: 607, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 21:47:10,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=53160.0, ans=0.09899494936611666 2023-06-17 21:47:22,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=53220.0, ans=0.125 2023-06-17 21:48:29,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=53340.0, ans=0.125 2023-06-17 21:48:33,216 INFO [train.py:996] (0/4) Epoch 1, batch 8900, loss[loss=0.3602, simple_loss=0.4034, pruned_loss=0.1584, over 21521.00 frames. ], tot_loss[loss=0.3966, simple_loss=0.4271, pruned_loss=0.1831, over 4265693.11 frames. ], batch size: 389, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 21:49:10,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.740e+02 3.829e+02 5.153e+02 6.429e+02 1.062e+03, threshold=1.031e+03, percent-clipped=0.0 2023-06-17 21:49:33,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=53520.0, ans=0.09899494936611666 2023-06-17 21:49:50,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53580.0, ans=0.1 2023-06-17 21:50:19,534 INFO [train.py:996] (0/4) Epoch 1, batch 8950, loss[loss=0.3131, simple_loss=0.3368, pruned_loss=0.1447, over 21292.00 frames. ], tot_loss[loss=0.3949, simple_loss=0.4286, pruned_loss=0.1806, over 4268069.88 frames. ], batch size: 176, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 21:50:28,712 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-17 21:50:45,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=53760.0, ans=0.0 2023-06-17 21:51:00,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-17 21:51:48,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=53940.0, ans=15.0 2023-06-17 21:52:03,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=54000.0, ans=15.0 2023-06-17 21:52:04,305 INFO [train.py:996] (0/4) Epoch 1, batch 9000, loss[loss=0.4482, simple_loss=0.5266, pruned_loss=0.1849, over 19739.00 frames. ], tot_loss[loss=0.3887, simple_loss=0.419, pruned_loss=0.1791, over 4265200.65 frames. ], batch size: 702, lr: 3.60e-02, grad_scale: 32.0 2023-06-17 21:52:04,311 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-17 21:52:23,529 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3404, simple_loss=0.4251, pruned_loss=0.1278, over 1796401.00 frames. 2023-06-17 21:52:23,529 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-17 21:53:06,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.196e+02 5.638e+02 6.877e+02 1.385e+03, threshold=1.128e+03, percent-clipped=3.0 2023-06-17 21:53:30,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54180.0, ans=0.1 2023-06-17 21:54:04,856 INFO [train.py:996] (0/4) Epoch 1, batch 9050, loss[loss=0.4551, simple_loss=0.4665, pruned_loss=0.2218, over 21443.00 frames. ], tot_loss[loss=0.3828, simple_loss=0.4167, pruned_loss=0.1745, over 4268918.70 frames. ], batch size: 471, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 21:54:56,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=54420.0, ans=0.125 2023-06-17 21:55:45,377 INFO [train.py:996] (0/4) Epoch 1, batch 9100, loss[loss=0.3629, simple_loss=0.4285, pruned_loss=0.1486, over 21642.00 frames. ], tot_loss[loss=0.3907, simple_loss=0.424, pruned_loss=0.1787, over 4274062.06 frames. ], batch size: 389, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 21:56:31,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.753e+02 4.940e+02 6.814e+02 2.174e+03, threshold=9.881e+02, percent-clipped=7.0 2023-06-17 21:56:48,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=54720.0, ans=0.125 2023-06-17 21:57:35,317 INFO [train.py:996] (0/4) Epoch 1, batch 9150, loss[loss=0.3537, simple_loss=0.4172, pruned_loss=0.145, over 21738.00 frames. ], tot_loss[loss=0.3849, simple_loss=0.423, pruned_loss=0.1734, over 4276937.40 frames. ], batch size: 351, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 21:58:16,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=54960.0, ans=0.0 2023-06-17 21:58:16,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=54960.0, ans=0.0 2023-06-17 21:59:22,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55140.0, ans=0.1 2023-06-17 21:59:29,990 INFO [train.py:996] (0/4) Epoch 1, batch 9200, loss[loss=0.4733, simple_loss=0.4889, pruned_loss=0.2289, over 21731.00 frames. ], tot_loss[loss=0.3849, simple_loss=0.4238, pruned_loss=0.173, over 4272676.94 frames. ], batch size: 441, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 21:59:59,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.127e+02 5.458e+02 7.694e+02 1.391e+03, threshold=1.092e+03, percent-clipped=9.0 2023-06-17 22:00:29,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=55380.0, ans=0.2 2023-06-17 22:01:13,164 INFO [train.py:996] (0/4) Epoch 1, batch 9250, loss[loss=0.4686, simple_loss=0.458, pruned_loss=0.2396, over 21644.00 frames. ], tot_loss[loss=0.3956, simple_loss=0.4294, pruned_loss=0.1809, over 4274909.83 frames. ], batch size: 441, lr: 3.57e-02, grad_scale: 32.0 2023-06-17 22:02:58,754 INFO [train.py:996] (0/4) Epoch 1, batch 9300, loss[loss=0.3431, simple_loss=0.3796, pruned_loss=0.1533, over 21362.00 frames. ], tot_loss[loss=0.3929, simple_loss=0.4243, pruned_loss=0.1807, over 4270149.60 frames. ], batch size: 131, lr: 3.57e-02, grad_scale: 32.0 2023-06-17 22:03:01,489 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2023-06-17 22:03:11,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=55800.0, ans=0.0 2023-06-17 22:03:15,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.64 vs. limit=15.0 2023-06-17 22:03:28,668 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 4.090e+02 5.143e+02 6.278e+02 1.452e+03, threshold=1.029e+03, percent-clipped=2.0 2023-06-17 22:03:55,449 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-17 22:04:13,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=55980.0, ans=0.125 2023-06-17 22:04:35,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.01 vs. limit=6.0 2023-06-17 22:04:44,718 INFO [train.py:996] (0/4) Epoch 1, batch 9350, loss[loss=0.3965, simple_loss=0.4331, pruned_loss=0.1799, over 21661.00 frames. ], tot_loss[loss=0.3956, simple_loss=0.4297, pruned_loss=0.1808, over 4260387.88 frames. ], batch size: 230, lr: 3.56e-02, grad_scale: 32.0 2023-06-17 22:05:05,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.72 vs. limit=6.0 2023-06-17 22:05:22,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=56220.0, ans=0.125 2023-06-17 22:06:07,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=56280.0, ans=0.0 2023-06-17 22:06:17,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=56340.0, ans=10.0 2023-06-17 22:06:28,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56400.0, ans=0.1 2023-06-17 22:06:29,240 INFO [train.py:996] (0/4) Epoch 1, batch 9400, loss[loss=0.3695, simple_loss=0.387, pruned_loss=0.176, over 15696.00 frames. ], tot_loss[loss=0.3981, simple_loss=0.4318, pruned_loss=0.1822, over 4257724.27 frames. ], batch size: 60, lr: 3.55e-02, grad_scale: 32.0 2023-06-17 22:06:31,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=56400.0, ans=0.0 2023-06-17 22:06:55,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-17 22:07:03,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.872e+02 4.762e+02 5.781e+02 7.006e+02 1.289e+03, threshold=1.156e+03, percent-clipped=1.0 2023-06-17 22:08:11,255 INFO [train.py:996] (0/4) Epoch 1, batch 9450, loss[loss=0.3455, simple_loss=0.3776, pruned_loss=0.1567, over 21834.00 frames. ], tot_loss[loss=0.3887, simple_loss=0.4204, pruned_loss=0.1784, over 4268468.00 frames. ], batch size: 107, lr: 3.55e-02, grad_scale: 16.0 2023-06-17 22:08:27,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=56760.0, ans=0.0 2023-06-17 22:09:24,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=56880.0, ans=0.125 2023-06-17 22:09:40,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=56940.0, ans=0.2 2023-06-17 22:09:51,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=57000.0, ans=0.125 2023-06-17 22:09:52,859 INFO [train.py:996] (0/4) Epoch 1, batch 9500, loss[loss=0.3554, simple_loss=0.4049, pruned_loss=0.1529, over 21445.00 frames. ], tot_loss[loss=0.3789, simple_loss=0.41, pruned_loss=0.1739, over 4264533.66 frames. ], batch size: 471, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 22:10:00,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=57000.0, ans=0.125 2023-06-17 22:10:35,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.522e+02 3.981e+02 4.935e+02 6.509e+02 1.656e+03, threshold=9.871e+02, percent-clipped=4.0 2023-06-17 22:11:01,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-17 22:11:37,279 INFO [train.py:996] (0/4) Epoch 1, batch 9550, loss[loss=0.4182, simple_loss=0.4424, pruned_loss=0.197, over 21350.00 frames. ], tot_loss[loss=0.3869, simple_loss=0.4161, pruned_loss=0.1789, over 4272191.26 frames. ], batch size: 548, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 22:12:21,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=57420.0, ans=0.125 2023-06-17 22:12:37,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=57420.0, ans=0.125 2023-06-17 22:12:53,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=57480.0, ans=0.0 2023-06-17 22:13:07,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=57540.0, ans=0.5 2023-06-17 22:13:11,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=57540.0, ans=0.125 2023-06-17 22:13:22,040 INFO [train.py:996] (0/4) Epoch 1, batch 9600, loss[loss=0.4452, simple_loss=0.5147, pruned_loss=0.1878, over 20774.00 frames. ], tot_loss[loss=0.39, simple_loss=0.4188, pruned_loss=0.1806, over 4278360.35 frames. ], batch size: 607, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 22:13:28,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.82 vs. limit=15.0 2023-06-17 22:13:32,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=57600.0, ans=0.2 2023-06-17 22:13:59,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=57660.0, ans=0.125 2023-06-17 22:14:01,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=57660.0, ans=0.125 2023-06-17 22:14:08,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 4.156e+02 5.294e+02 7.045e+02 1.358e+03, threshold=1.059e+03, percent-clipped=6.0 2023-06-17 22:14:53,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=57840.0, ans=0.0 2023-06-17 22:15:01,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57840.0, ans=0.1 2023-06-17 22:15:06,112 INFO [train.py:996] (0/4) Epoch 1, batch 9650, loss[loss=0.4264, simple_loss=0.4484, pruned_loss=0.2022, over 21694.00 frames. ], tot_loss[loss=0.3884, simple_loss=0.4176, pruned_loss=0.1796, over 4282871.51 frames. ], batch size: 351, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 22:15:08,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=57900.0, ans=0.125 2023-06-17 22:16:14,247 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:16:25,309 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:16:40,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=58140.0, ans=0.125 2023-06-17 22:16:47,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=58140.0, ans=0.2 2023-06-17 22:16:49,790 INFO [train.py:996] (0/4) Epoch 1, batch 9700, loss[loss=0.4014, simple_loss=0.4281, pruned_loss=0.1873, over 21416.00 frames. ], tot_loss[loss=0.3906, simple_loss=0.4212, pruned_loss=0.18, over 4281050.68 frames. ], batch size: 548, lr: 3.52e-02, grad_scale: 32.0 2023-06-17 22:17:21,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=58260.0, ans=0.2 2023-06-17 22:17:23,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=58260.0, ans=0.07 2023-06-17 22:17:37,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.787e+02 4.137e+02 5.402e+02 6.942e+02 1.239e+03, threshold=1.080e+03, percent-clipped=2.0 2023-06-17 22:17:55,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=58320.0, ans=0.2 2023-06-17 22:18:09,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58380.0, ans=0.1 2023-06-17 22:18:27,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=58440.0, ans=0.125 2023-06-17 22:18:33,487 INFO [train.py:996] (0/4) Epoch 1, batch 9750, loss[loss=0.342, simple_loss=0.3707, pruned_loss=0.1567, over 21852.00 frames. ], tot_loss[loss=0.3837, simple_loss=0.4125, pruned_loss=0.1775, over 4281319.15 frames. ], batch size: 107, lr: 3.51e-02, grad_scale: 32.0 2023-06-17 22:18:48,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=58500.0, ans=0.125 2023-06-17 22:19:48,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=58680.0, ans=0.2 2023-06-17 22:20:00,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=58740.0, ans=22.5 2023-06-17 22:20:04,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=58740.0, ans=0.2 2023-06-17 22:20:07,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=58740.0, ans=0.09899494936611666 2023-06-17 22:20:15,545 INFO [train.py:996] (0/4) Epoch 1, batch 9800, loss[loss=0.3949, simple_loss=0.4092, pruned_loss=0.1903, over 21796.00 frames. ], tot_loss[loss=0.3832, simple_loss=0.4118, pruned_loss=0.1773, over 4285578.76 frames. ], batch size: 441, lr: 3.51e-02, grad_scale: 16.0 2023-06-17 22:20:21,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=58800.0, ans=0.2 2023-06-17 22:21:03,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.967e+02 4.740e+02 5.847e+02 8.148e+02 2.070e+03, threshold=1.169e+03, percent-clipped=10.0 2023-06-17 22:21:08,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=58920.0, ans=0.125 2023-06-17 22:21:38,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=58980.0, ans=0.125 2023-06-17 22:21:56,352 INFO [train.py:996] (0/4) Epoch 1, batch 9850, loss[loss=0.4056, simple_loss=0.4079, pruned_loss=0.2016, over 21627.00 frames. ], tot_loss[loss=0.3817, simple_loss=0.4093, pruned_loss=0.1771, over 4286611.86 frames. ], batch size: 391, lr: 3.50e-02, grad_scale: 16.0 2023-06-17 22:22:18,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=59160.0, ans=0.0 2023-06-17 22:22:29,486 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-17 22:22:30,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=59160.0, ans=0.125 2023-06-17 22:22:41,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=59160.0, ans=0.07 2023-06-17 22:22:45,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=59220.0, ans=0.05 2023-06-17 22:23:20,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=59280.0, ans=0.125 2023-06-17 22:23:20,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=59280.0, ans=0.125 2023-06-17 22:23:23,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=59340.0, ans=0.0 2023-06-17 22:23:28,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=59340.0, ans=0.0 2023-06-17 22:23:34,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=59340.0, ans=0.015 2023-06-17 22:23:39,154 INFO [train.py:996] (0/4) Epoch 1, batch 9900, loss[loss=0.4497, simple_loss=0.4703, pruned_loss=0.2145, over 21571.00 frames. ], tot_loss[loss=0.3795, simple_loss=0.4063, pruned_loss=0.1763, over 4260729.01 frames. ], batch size: 389, lr: 3.50e-02, grad_scale: 16.0 2023-06-17 22:24:28,782 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.114e+02 4.325e+02 5.228e+02 6.725e+02 1.103e+03, threshold=1.046e+03, percent-clipped=0.0 2023-06-17 22:24:46,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=59520.0, ans=0.04949747468305833 2023-06-17 22:25:23,915 INFO [train.py:996] (0/4) Epoch 1, batch 9950, loss[loss=0.3857, simple_loss=0.3904, pruned_loss=0.1905, over 21546.00 frames. ], tot_loss[loss=0.3837, simple_loss=0.4087, pruned_loss=0.1793, over 4258977.40 frames. ], batch size: 263, lr: 3.49e-02, grad_scale: 16.0 2023-06-17 22:26:18,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=59820.0, ans=0.125 2023-06-17 22:26:33,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=59880.0, ans=0.0 2023-06-17 22:26:40,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=59880.0, ans=0.125 2023-06-17 22:26:53,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=59940.0, ans=0.0 2023-06-17 22:26:56,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=59940.0, ans=0.2 2023-06-17 22:27:12,390 INFO [train.py:996] (0/4) Epoch 1, batch 10000, loss[loss=0.4909, simple_loss=0.4633, pruned_loss=0.2593, over 21275.00 frames. ], tot_loss[loss=0.3825, simple_loss=0.4072, pruned_loss=0.1789, over 4263810.15 frames. ], batch size: 471, lr: 3.49e-02, grad_scale: 32.0 2023-06-17 22:27:13,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=60000.0, ans=0.0 2023-06-17 22:27:23,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60000.0, ans=0.1 2023-06-17 22:27:40,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=60060.0, ans=0.125 2023-06-17 22:27:53,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=60060.0, ans=0.125 2023-06-17 22:28:02,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=60120.0, ans=0.125 2023-06-17 22:28:02,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=60120.0, ans=0.0 2023-06-17 22:28:03,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.788e+02 4.452e+02 5.196e+02 6.727e+02 1.360e+03, threshold=1.039e+03, percent-clipped=5.0 2023-06-17 22:28:42,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=60240.0, ans=0.125 2023-06-17 22:28:42,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=60240.0, ans=0.0 2023-06-17 22:28:47,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=60240.0, ans=0.125 2023-06-17 22:29:04,523 INFO [train.py:996] (0/4) Epoch 1, batch 10050, loss[loss=0.3425, simple_loss=0.3803, pruned_loss=0.1523, over 21266.00 frames. ], tot_loss[loss=0.3839, simple_loss=0.4088, pruned_loss=0.1795, over 4269634.68 frames. ], batch size: 549, lr: 3.48e-02, grad_scale: 32.0 2023-06-17 22:29:10,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=60300.0, ans=0.0 2023-06-17 22:29:22,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=60300.0, ans=0.125 2023-06-17 22:29:30,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=60360.0, ans=0.125 2023-06-17 22:29:46,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=60360.0, ans=0.125 2023-06-17 22:29:56,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=60420.0, ans=0.125 2023-06-17 22:30:21,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.63 vs. limit=6.0 2023-06-17 22:30:22,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=60480.0, ans=0.2 2023-06-17 22:30:54,323 INFO [train.py:996] (0/4) Epoch 1, batch 10100, loss[loss=0.3988, simple_loss=0.4331, pruned_loss=0.1823, over 21747.00 frames. ], tot_loss[loss=0.3771, simple_loss=0.4051, pruned_loss=0.1746, over 4266248.16 frames. ], batch size: 351, lr: 3.47e-02, grad_scale: 32.0 2023-06-17 22:31:00,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=60600.0, ans=0.2 2023-06-17 22:31:18,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=60660.0, ans=0.2 2023-06-17 22:31:27,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-17 22:31:32,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.854e+02 4.188e+02 5.288e+02 6.297e+02 1.348e+03, threshold=1.058e+03, percent-clipped=5.0 2023-06-17 22:31:41,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=60720.0, ans=0.2 2023-06-17 22:32:37,661 INFO [train.py:996] (0/4) Epoch 1, batch 10150, loss[loss=0.3872, simple_loss=0.4291, pruned_loss=0.1727, over 21361.00 frames. ], tot_loss[loss=0.3846, simple_loss=0.4121, pruned_loss=0.1786, over 4264219.17 frames. ], batch size: 131, lr: 3.47e-02, grad_scale: 32.0 2023-06-17 22:32:43,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=60900.0, ans=0.125 2023-06-17 22:33:11,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=60960.0, ans=0.125 2023-06-17 22:33:43,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.70 vs. limit=6.0 2023-06-17 22:34:22,098 INFO [train.py:996] (0/4) Epoch 1, batch 10200, loss[loss=0.414, simple_loss=0.4451, pruned_loss=0.1915, over 21534.00 frames. ], tot_loss[loss=0.3787, simple_loss=0.4093, pruned_loss=0.174, over 4257276.69 frames. ], batch size: 441, lr: 3.46e-02, grad_scale: 32.0 2023-06-17 22:34:48,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=61260.0, ans=0.1 2023-06-17 22:35:01,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 3.806e+02 4.726e+02 6.535e+02 1.145e+03, threshold=9.453e+02, percent-clipped=1.0 2023-06-17 22:36:11,359 INFO [train.py:996] (0/4) Epoch 1, batch 10250, loss[loss=0.4361, simple_loss=0.4655, pruned_loss=0.2033, over 21481.00 frames. ], tot_loss[loss=0.3682, simple_loss=0.4053, pruned_loss=0.1656, over 4260480.52 frames. ], batch size: 131, lr: 3.46e-02, grad_scale: 32.0 2023-06-17 22:36:47,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61560.0, ans=0.1 2023-06-17 22:37:11,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=61680.0, ans=0.0 2023-06-17 22:37:13,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=61680.0, ans=0.2 2023-06-17 22:37:58,261 INFO [train.py:996] (0/4) Epoch 1, batch 10300, loss[loss=0.421, simple_loss=0.4423, pruned_loss=0.1999, over 20715.00 frames. ], tot_loss[loss=0.3693, simple_loss=0.4064, pruned_loss=0.1661, over 4259469.41 frames. ], batch size: 607, lr: 3.45e-02, grad_scale: 16.0 2023-06-17 22:38:03,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=61800.0, ans=0.125 2023-06-17 22:38:34,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=61860.0, ans=0.0 2023-06-17 22:38:36,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=61920.0, ans=0.0 2023-06-17 22:38:39,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 4.187e+02 5.738e+02 8.381e+02 2.086e+03, threshold=1.148e+03, percent-clipped=17.0 2023-06-17 22:39:08,461 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:39:43,549 INFO [train.py:996] (0/4) Epoch 1, batch 10350, loss[loss=0.4081, simple_loss=0.4393, pruned_loss=0.1885, over 21475.00 frames. ], tot_loss[loss=0.3673, simple_loss=0.4061, pruned_loss=0.1642, over 4260984.23 frames. ], batch size: 471, lr: 3.45e-02, grad_scale: 16.0 2023-06-17 22:39:46,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-17 22:40:11,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=62160.0, ans=0.0 2023-06-17 22:40:16,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62160.0, ans=0.1 2023-06-17 22:40:26,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=62220.0, ans=0.125 2023-06-17 22:40:43,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=62220.0, ans=0.025 2023-06-17 22:40:47,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=62220.0, ans=0.0 2023-06-17 22:40:50,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=62280.0, ans=0.0 2023-06-17 22:41:24,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62340.0, ans=0.1 2023-06-17 22:41:29,021 INFO [train.py:996] (0/4) Epoch 1, batch 10400, loss[loss=0.3523, simple_loss=0.3971, pruned_loss=0.1538, over 21554.00 frames. ], tot_loss[loss=0.3559, simple_loss=0.394, pruned_loss=0.1589, over 4252693.59 frames. ], batch size: 441, lr: 3.44e-02, grad_scale: 32.0 2023-06-17 22:42:19,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.646e+02 4.942e+02 6.227e+02 1.303e+03, threshold=9.884e+02, percent-clipped=2.0 2023-06-17 22:42:29,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=62520.0, ans=0.0 2023-06-17 22:42:30,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-17 22:42:31,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=62520.0, ans=0.0 2023-06-17 22:42:33,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62580.0, ans=0.1 2023-06-17 22:43:17,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=62700.0, ans=0.125 2023-06-17 22:43:18,631 INFO [train.py:996] (0/4) Epoch 1, batch 10450, loss[loss=0.4024, simple_loss=0.43, pruned_loss=0.1874, over 21387.00 frames. ], tot_loss[loss=0.3636, simple_loss=0.3989, pruned_loss=0.1642, over 4252329.29 frames. ], batch size: 131, lr: 3.44e-02, grad_scale: 32.0 2023-06-17 22:43:46,797 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.83 vs. limit=12.0 2023-06-17 22:44:26,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=62880.0, ans=0.125 2023-06-17 22:44:59,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=62940.0, ans=0.125 2023-06-17 22:45:02,393 INFO [train.py:996] (0/4) Epoch 1, batch 10500, loss[loss=0.3932, simple_loss=0.4068, pruned_loss=0.1898, over 21463.00 frames. ], tot_loss[loss=0.3652, simple_loss=0.4018, pruned_loss=0.1643, over 4257237.57 frames. ], batch size: 441, lr: 3.43e-02, grad_scale: 32.0 2023-06-17 22:45:41,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=63060.0, ans=0.2 2023-06-17 22:45:48,075 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.889e+02 5.007e+02 6.898e+02 1.631e+03, threshold=1.001e+03, percent-clipped=5.0 2023-06-17 22:46:02,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=63120.0, ans=0.0 2023-06-17 22:46:45,948 INFO [train.py:996] (0/4) Epoch 1, batch 10550, loss[loss=0.3774, simple_loss=0.3942, pruned_loss=0.1803, over 21743.00 frames. ], tot_loss[loss=0.3643, simple_loss=0.3974, pruned_loss=0.1656, over 4257344.55 frames. ], batch size: 351, lr: 3.43e-02, grad_scale: 32.0 2023-06-17 22:47:37,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=63420.0, ans=0.125 2023-06-17 22:47:50,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=63480.0, ans=0.125 2023-06-17 22:47:53,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=63480.0, ans=0.05 2023-06-17 22:47:57,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=63480.0, ans=0.025 2023-06-17 22:48:08,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=63540.0, ans=0.0 2023-06-17 22:48:29,761 INFO [train.py:996] (0/4) Epoch 1, batch 10600, loss[loss=0.2925, simple_loss=0.351, pruned_loss=0.117, over 21248.00 frames. ], tot_loss[loss=0.3588, simple_loss=0.3915, pruned_loss=0.1631, over 4247762.14 frames. ], batch size: 176, lr: 3.42e-02, grad_scale: 32.0 2023-06-17 22:49:22,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.854e+02 4.619e+02 6.310e+02 1.881e+03, threshold=9.238e+02, percent-clipped=9.0 2023-06-17 22:49:27,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.88 vs. limit=22.5 2023-06-17 22:49:35,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.74 vs. limit=22.5 2023-06-17 22:50:28,336 INFO [train.py:996] (0/4) Epoch 1, batch 10650, loss[loss=0.3292, simple_loss=0.3894, pruned_loss=0.1345, over 21577.00 frames. ], tot_loss[loss=0.3594, simple_loss=0.3956, pruned_loss=0.1616, over 4253328.65 frames. ], batch size: 441, lr: 3.41e-02, grad_scale: 32.0 2023-06-17 22:50:55,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=63960.0, ans=0.125 2023-06-17 22:51:00,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=63960.0, ans=0.0 2023-06-17 22:51:03,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=63960.0, ans=0.0 2023-06-17 22:51:05,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=64020.0, ans=0.125 2023-06-17 22:51:12,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=64020.0, ans=0.125 2023-06-17 22:51:24,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=64080.0, ans=0.125 2023-06-17 22:51:38,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-17 22:52:14,311 INFO [train.py:996] (0/4) Epoch 1, batch 10700, loss[loss=0.385, simple_loss=0.3983, pruned_loss=0.1859, over 21321.00 frames. ], tot_loss[loss=0.3592, simple_loss=0.3946, pruned_loss=0.1619, over 4246120.33 frames. ], batch size: 471, lr: 3.41e-02, grad_scale: 32.0 2023-06-17 22:52:15,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=64200.0, ans=0.95 2023-06-17 22:52:23,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.69 vs. limit=6.0 2023-06-17 22:52:55,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.608e+02 4.131e+02 5.113e+02 6.555e+02 1.006e+03, threshold=1.023e+03, percent-clipped=2.0 2023-06-17 22:53:00,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=64320.0, ans=0.0 2023-06-17 22:53:56,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=64440.0, ans=0.125 2023-06-17 22:53:59,522 INFO [train.py:996] (0/4) Epoch 1, batch 10750, loss[loss=0.494, simple_loss=0.5271, pruned_loss=0.2304, over 21533.00 frames. ], tot_loss[loss=0.373, simple_loss=0.4073, pruned_loss=0.1694, over 4254040.81 frames. ], batch size: 471, lr: 3.40e-02, grad_scale: 32.0 2023-06-17 22:54:25,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=64560.0, ans=0.0 2023-06-17 22:54:40,246 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.514e-03 2023-06-17 22:55:24,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=64680.0, ans=0.125 2023-06-17 22:55:30,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-17 22:55:49,330 INFO [train.py:996] (0/4) Epoch 1, batch 10800, loss[loss=0.4043, simple_loss=0.4309, pruned_loss=0.1889, over 21744.00 frames. ], tot_loss[loss=0.3755, simple_loss=0.4117, pruned_loss=0.1697, over 4256735.40 frames. ], batch size: 298, lr: 3.40e-02, grad_scale: 32.0 2023-06-17 22:56:06,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=64860.0, ans=0.0 2023-06-17 22:56:11,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.63 vs. limit=6.0 2023-06-17 22:56:30,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.758e+02 4.502e+02 5.308e+02 7.377e+02 1.430e+03, threshold=1.062e+03, percent-clipped=5.0 2023-06-17 22:56:36,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=64920.0, ans=0.0 2023-06-17 22:57:23,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=65040.0, ans=0.125 2023-06-17 22:57:26,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.15 vs. limit=6.0 2023-06-17 22:57:33,930 INFO [train.py:996] (0/4) Epoch 1, batch 10850, loss[loss=0.434, simple_loss=0.4287, pruned_loss=0.2196, over 21285.00 frames. ], tot_loss[loss=0.3776, simple_loss=0.4134, pruned_loss=0.1709, over 4260583.55 frames. ], batch size: 471, lr: 3.39e-02, grad_scale: 32.0 2023-06-17 22:57:53,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=65160.0, ans=0.125 2023-06-17 22:58:05,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=65160.0, ans=0.125 2023-06-17 22:58:27,515 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:58:49,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-06-17 22:59:01,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-17 22:59:05,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=65340.0, ans=0.2 2023-06-17 22:59:17,711 INFO [train.py:996] (0/4) Epoch 1, batch 10900, loss[loss=0.3353, simple_loss=0.3992, pruned_loss=0.1357, over 21409.00 frames. ], tot_loss[loss=0.3739, simple_loss=0.4087, pruned_loss=0.1696, over 4254552.47 frames. ], batch size: 211, lr: 3.39e-02, grad_scale: 32.0 2023-06-17 22:59:44,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=65460.0, ans=0.125 2023-06-17 22:59:59,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.489e+02 3.764e+02 4.430e+02 5.513e+02 1.224e+03, threshold=8.861e+02, percent-clipped=2.0 2023-06-17 23:00:43,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=65580.0, ans=0.07 2023-06-17 23:01:01,593 INFO [train.py:996] (0/4) Epoch 1, batch 10950, loss[loss=0.4084, simple_loss=0.4348, pruned_loss=0.191, over 20640.00 frames. ], tot_loss[loss=0.3688, simple_loss=0.4037, pruned_loss=0.167, over 4255784.93 frames. ], batch size: 607, lr: 3.38e-02, grad_scale: 32.0 2023-06-17 23:01:16,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=65760.0, ans=0.2 2023-06-17 23:01:48,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=65820.0, ans=0.04949747468305833 2023-06-17 23:01:58,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=65820.0, ans=0.125 2023-06-17 23:02:44,535 INFO [train.py:996] (0/4) Epoch 1, batch 11000, loss[loss=0.4219, simple_loss=0.4464, pruned_loss=0.1987, over 21722.00 frames. ], tot_loss[loss=0.3684, simple_loss=0.4018, pruned_loss=0.1675, over 4258852.43 frames. ], batch size: 112, lr: 3.38e-02, grad_scale: 32.0 2023-06-17 23:02:56,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=66000.0, ans=0.0 2023-06-17 23:03:18,146 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:03:24,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=66120.0, ans=0.125 2023-06-17 23:03:25,868 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.455e+02 4.369e+02 5.427e+02 7.022e+02 1.248e+03, threshold=1.085e+03, percent-clipped=10.0 2023-06-17 23:03:42,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=66120.0, ans=0.125 2023-06-17 23:04:21,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=66240.0, ans=0.125 2023-06-17 23:04:25,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=66240.0, ans=0.125 2023-06-17 23:04:27,724 INFO [train.py:996] (0/4) Epoch 1, batch 11050, loss[loss=0.3938, simple_loss=0.4038, pruned_loss=0.192, over 21961.00 frames. ], tot_loss[loss=0.3679, simple_loss=0.3987, pruned_loss=0.1686, over 4269646.70 frames. ], batch size: 103, lr: 3.37e-02, grad_scale: 32.0 2023-06-17 23:04:48,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=66360.0, ans=0.125 2023-06-17 23:05:03,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66360.0, ans=0.1 2023-06-17 23:05:27,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.96 vs. limit=6.0 2023-06-17 23:05:36,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=66480.0, ans=0.0 2023-06-17 23:05:36,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=66480.0, ans=0.04949747468305833 2023-06-17 23:05:44,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=66480.0, ans=0.125 2023-06-17 23:05:55,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66540.0, ans=0.1 2023-06-17 23:06:11,024 INFO [train.py:996] (0/4) Epoch 1, batch 11100, loss[loss=0.3633, simple_loss=0.385, pruned_loss=0.1708, over 21322.00 frames. ], tot_loss[loss=0.3641, simple_loss=0.3937, pruned_loss=0.1673, over 4272427.44 frames. ], batch size: 194, lr: 3.37e-02, grad_scale: 32.0 2023-06-17 23:06:15,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=66600.0, ans=0.125 2023-06-17 23:06:31,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=66660.0, ans=0.125 2023-06-17 23:06:47,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=66660.0, ans=0.2 2023-06-17 23:06:58,054 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.675e+02 3.924e+02 4.981e+02 6.262e+02 1.185e+03, threshold=9.963e+02, percent-clipped=1.0 2023-06-17 23:07:15,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=66780.0, ans=0.125 2023-06-17 23:07:54,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66900.0, ans=0.1 2023-06-17 23:07:55,936 INFO [train.py:996] (0/4) Epoch 1, batch 11150, loss[loss=0.3074, simple_loss=0.346, pruned_loss=0.1344, over 21259.00 frames. ], tot_loss[loss=0.3621, simple_loss=0.3916, pruned_loss=0.1663, over 4277023.78 frames. ], batch size: 176, lr: 3.36e-02, grad_scale: 32.0 2023-06-17 23:08:11,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=66900.0, ans=0.125 2023-06-17 23:08:13,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66900.0, ans=0.1 2023-06-17 23:08:24,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=66960.0, ans=0.125 2023-06-17 23:08:52,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=67020.0, ans=0.125 2023-06-17 23:09:03,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=67080.0, ans=0.2 2023-06-17 23:09:38,668 INFO [train.py:996] (0/4) Epoch 1, batch 11200, loss[loss=0.3321, simple_loss=0.352, pruned_loss=0.1561, over 21627.00 frames. ], tot_loss[loss=0.3599, simple_loss=0.3889, pruned_loss=0.1654, over 4273008.63 frames. ], batch size: 282, lr: 3.36e-02, grad_scale: 32.0 2023-06-17 23:10:03,965 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:10:12,421 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=5.180e-03 2023-06-17 23:10:13,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=67260.0, ans=0.125 2023-06-17 23:10:25,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 3.969e+02 4.814e+02 6.139e+02 9.199e+02, threshold=9.628e+02, percent-clipped=0.0 2023-06-17 23:11:21,151 INFO [train.py:996] (0/4) Epoch 1, batch 11250, loss[loss=0.3548, simple_loss=0.3714, pruned_loss=0.1691, over 21730.00 frames. ], tot_loss[loss=0.3584, simple_loss=0.388, pruned_loss=0.1643, over 4270913.42 frames. ], batch size: 300, lr: 3.35e-02, grad_scale: 32.0 2023-06-17 23:11:52,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=22.5 2023-06-17 23:11:54,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=67560.0, ans=0.125 2023-06-17 23:12:25,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=67680.0, ans=0.025 2023-06-17 23:12:53,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=67740.0, ans=0.125 2023-06-17 23:12:58,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67740.0, ans=0.1 2023-06-17 23:13:04,122 INFO [train.py:996] (0/4) Epoch 1, batch 11300, loss[loss=0.3247, simple_loss=0.3645, pruned_loss=0.1425, over 21264.00 frames. ], tot_loss[loss=0.3591, simple_loss=0.3891, pruned_loss=0.1645, over 4267472.31 frames. ], batch size: 143, lr: 3.35e-02, grad_scale: 32.0 2023-06-17 23:13:04,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=67800.0, ans=0.0 2023-06-17 23:13:11,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67800.0, ans=0.1 2023-06-17 23:13:21,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=67800.0, ans=0.2 2023-06-17 23:13:48,319 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:13:51,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.668e+02 4.732e+02 6.264e+02 1.219e+03, threshold=9.465e+02, percent-clipped=6.0 2023-06-17 23:14:01,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=67920.0, ans=0.125 2023-06-17 23:14:09,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=67980.0, ans=0.125 2023-06-17 23:14:36,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=68040.0, ans=0.1 2023-06-17 23:14:49,568 INFO [train.py:996] (0/4) Epoch 1, batch 11350, loss[loss=0.5207, simple_loss=0.5071, pruned_loss=0.2671, over 21336.00 frames. ], tot_loss[loss=0.3602, simple_loss=0.3916, pruned_loss=0.1644, over 4271939.62 frames. ], batch size: 507, lr: 3.34e-02, grad_scale: 32.0 2023-06-17 23:16:41,432 INFO [train.py:996] (0/4) Epoch 1, batch 11400, loss[loss=0.3756, simple_loss=0.4074, pruned_loss=0.1719, over 21228.00 frames. ], tot_loss[loss=0.3692, simple_loss=0.401, pruned_loss=0.1687, over 4275947.85 frames. ], batch size: 143, lr: 3.34e-02, grad_scale: 32.0 2023-06-17 23:17:20,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=22.5 2023-06-17 23:17:28,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.138e+02 5.254e+02 6.973e+02 1.408e+03, threshold=1.051e+03, percent-clipped=10.0 2023-06-17 23:17:55,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=68580.0, ans=0.125 2023-06-17 23:18:02,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68580.0, ans=0.1 2023-06-17 23:18:22,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=68640.0, ans=0.0 2023-06-17 23:18:22,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=68640.0, ans=0.1 2023-06-17 23:18:27,550 INFO [train.py:996] (0/4) Epoch 1, batch 11450, loss[loss=0.4423, simple_loss=0.4607, pruned_loss=0.2119, over 21827.00 frames. ], tot_loss[loss=0.3694, simple_loss=0.4035, pruned_loss=0.1676, over 4272281.16 frames. ], batch size: 124, lr: 3.33e-02, grad_scale: 32.0 2023-06-17 23:18:33,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=68700.0, ans=0.125 2023-06-17 23:19:20,079 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.84 vs. limit=6.0 2023-06-17 23:19:31,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=68820.0, ans=10.0 2023-06-17 23:19:36,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=68880.0, ans=0.0 2023-06-17 23:19:39,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=68880.0, ans=0.125 2023-06-17 23:19:43,955 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-17 23:20:13,534 INFO [train.py:996] (0/4) Epoch 1, batch 11500, loss[loss=0.3318, simple_loss=0.3988, pruned_loss=0.1324, over 21898.00 frames. ], tot_loss[loss=0.3713, simple_loss=0.4065, pruned_loss=0.1681, over 4275373.95 frames. ], batch size: 316, lr: 3.33e-02, grad_scale: 32.0 2023-06-17 23:20:38,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-17 23:20:38,594 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-17 23:21:00,431 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 4.282e+02 5.552e+02 6.865e+02 1.531e+03, threshold=1.110e+03, percent-clipped=3.0 2023-06-17 23:21:16,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=69120.0, ans=0.125 2023-06-17 23:21:45,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=69240.0, ans=0.125 2023-06-17 23:22:09,467 INFO [train.py:996] (0/4) Epoch 1, batch 11550, loss[loss=0.4891, simple_loss=0.5434, pruned_loss=0.2174, over 21852.00 frames. ], tot_loss[loss=0.3724, simple_loss=0.411, pruned_loss=0.1669, over 4271978.97 frames. ], batch size: 371, lr: 3.32e-02, grad_scale: 32.0 2023-06-17 23:22:43,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=69360.0, ans=0.0 2023-06-17 23:22:47,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.95 vs. limit=15.0 2023-06-17 23:22:54,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=69420.0, ans=0.0 2023-06-17 23:23:49,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.21 vs. limit=15.0 2023-06-17 23:23:54,989 INFO [train.py:996] (0/4) Epoch 1, batch 11600, loss[loss=0.4145, simple_loss=0.4978, pruned_loss=0.1656, over 21197.00 frames. ], tot_loss[loss=0.3815, simple_loss=0.4255, pruned_loss=0.1688, over 4267521.43 frames. ], batch size: 548, lr: 3.32e-02, grad_scale: 32.0 2023-06-17 23:24:02,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=69600.0, ans=15.0 2023-06-17 23:24:39,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.776e+02 4.538e+02 6.004e+02 8.984e+02 1.767e+03, threshold=1.201e+03, percent-clipped=15.0 2023-06-17 23:25:04,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=69780.0, ans=0.05 2023-06-17 23:25:13,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=69780.0, ans=0.125 2023-06-17 23:25:16,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=69840.0, ans=0.0 2023-06-17 23:25:32,120 INFO [train.py:996] (0/4) Epoch 1, batch 11650, loss[loss=0.3542, simple_loss=0.4305, pruned_loss=0.139, over 21620.00 frames. ], tot_loss[loss=0.3816, simple_loss=0.4285, pruned_loss=0.1673, over 4271594.16 frames. ], batch size: 230, lr: 3.31e-02, grad_scale: 16.0 2023-06-17 23:26:04,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=69960.0, ans=0.0 2023-06-17 23:26:14,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=70020.0, ans=0.125 2023-06-17 23:26:17,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70020.0, ans=0.125 2023-06-17 23:26:55,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=70080.0, ans=0.125 2023-06-17 23:27:06,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=70140.0, ans=0.1 2023-06-17 23:27:06,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=70140.0, ans=0.0 2023-06-17 23:27:15,730 INFO [train.py:996] (0/4) Epoch 1, batch 11700, loss[loss=0.3517, simple_loss=0.3728, pruned_loss=0.1653, over 21675.00 frames. ], tot_loss[loss=0.3789, simple_loss=0.4207, pruned_loss=0.1686, over 4270625.02 frames. ], batch size: 282, lr: 3.31e-02, grad_scale: 16.0 2023-06-17 23:27:43,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70260.0, ans=0.125 2023-06-17 23:27:48,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.15 vs. limit=22.5 2023-06-17 23:27:49,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=70260.0, ans=0.125 2023-06-17 23:28:00,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.900e+02 4.129e+02 5.507e+02 7.167e+02 1.590e+03, threshold=1.101e+03, percent-clipped=1.0 2023-06-17 23:28:08,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70320.0, ans=0.1 2023-06-17 23:28:18,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70380.0, ans=0.125 2023-06-17 23:28:52,887 INFO [train.py:996] (0/4) Epoch 1, batch 11750, loss[loss=0.4191, simple_loss=0.4278, pruned_loss=0.2051, over 21636.00 frames. ], tot_loss[loss=0.374, simple_loss=0.4106, pruned_loss=0.1688, over 4269713.19 frames. ], batch size: 263, lr: 3.30e-02, grad_scale: 16.0 2023-06-17 23:29:18,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=70560.0, ans=0.125 2023-06-17 23:29:32,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=70620.0, ans=0.125 2023-06-17 23:29:38,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=70620.0, ans=0.0 2023-06-17 23:30:18,052 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-17 23:30:38,201 INFO [train.py:996] (0/4) Epoch 1, batch 11800, loss[loss=0.3399, simple_loss=0.4091, pruned_loss=0.1354, over 21575.00 frames. ], tot_loss[loss=0.3803, simple_loss=0.414, pruned_loss=0.1733, over 4276520.16 frames. ], batch size: 230, lr: 3.30e-02, grad_scale: 16.0 2023-06-17 23:30:54,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=70800.0, ans=0.125 2023-06-17 23:31:09,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=70860.0, ans=0.035 2023-06-17 23:31:29,239 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.727e+02 4.871e+02 6.879e+02 1.447e+03, threshold=9.741e+02, percent-clipped=5.0 2023-06-17 23:32:22,089 INFO [train.py:996] (0/4) Epoch 1, batch 11850, loss[loss=0.3281, simple_loss=0.3932, pruned_loss=0.1316, over 21749.00 frames. ], tot_loss[loss=0.3782, simple_loss=0.4137, pruned_loss=0.1713, over 4280240.42 frames. ], batch size: 298, lr: 3.29e-02, grad_scale: 16.0 2023-06-17 23:32:53,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=71160.0, ans=0.035 2023-06-17 23:32:53,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=71160.0, ans=0.1 2023-06-17 23:33:49,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2023-06-17 23:33:53,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71340.0, ans=0.1 2023-06-17 23:34:12,223 INFO [train.py:996] (0/4) Epoch 1, batch 11900, loss[loss=0.3157, simple_loss=0.3764, pruned_loss=0.1275, over 21696.00 frames. ], tot_loss[loss=0.3767, simple_loss=0.4158, pruned_loss=0.1688, over 4279279.18 frames. ], batch size: 247, lr: 3.29e-02, grad_scale: 16.0 2023-06-17 23:35:08,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.592e+02 4.787e+02 5.877e+02 1.275e+03, threshold=9.575e+02, percent-clipped=4.0 2023-06-17 23:35:10,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=10.0 2023-06-17 23:35:15,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=71520.0, ans=0.125 2023-06-17 23:35:28,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=71580.0, ans=0.125 2023-06-17 23:35:56,086 INFO [train.py:996] (0/4) Epoch 1, batch 11950, loss[loss=0.2803, simple_loss=0.3456, pruned_loss=0.1075, over 21205.00 frames. ], tot_loss[loss=0.3695, simple_loss=0.4139, pruned_loss=0.1625, over 4277147.48 frames. ], batch size: 176, lr: 3.28e-02, grad_scale: 16.0 2023-06-17 23:35:58,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.92 vs. limit=10.0 2023-06-17 23:36:01,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71700.0, ans=0.1 2023-06-17 23:36:16,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=71760.0, ans=0.0 2023-06-17 23:37:04,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71880.0, ans=0.1 2023-06-17 23:37:26,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71940.0, ans=0.1 2023-06-17 23:37:36,954 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-12000.pt 2023-06-17 23:37:39,785 INFO [train.py:996] (0/4) Epoch 1, batch 12000, loss[loss=0.3012, simple_loss=0.335, pruned_loss=0.1337, over 21516.00 frames. ], tot_loss[loss=0.3636, simple_loss=0.4067, pruned_loss=0.1602, over 4273755.81 frames. ], batch size: 230, lr: 3.28e-02, grad_scale: 32.0 2023-06-17 23:37:39,786 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-17 23:37:57,342 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3348, simple_loss=0.4196, pruned_loss=0.125, over 1796401.00 frames. 2023-06-17 23:37:57,343 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-17 23:38:52,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 3.693e+02 4.861e+02 6.052e+02 1.192e+03, threshold=9.721e+02, percent-clipped=3.0 2023-06-17 23:39:13,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=72180.0, ans=0.0 2023-06-17 23:39:35,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=72240.0, ans=0.2 2023-06-17 23:39:41,340 INFO [train.py:996] (0/4) Epoch 1, batch 12050, loss[loss=0.39, simple_loss=0.4123, pruned_loss=0.1838, over 21867.00 frames. ], tot_loss[loss=0.3672, simple_loss=0.406, pruned_loss=0.1642, over 4280368.09 frames. ], batch size: 371, lr: 3.27e-02, grad_scale: 32.0 2023-06-17 23:40:32,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=22.5 2023-06-17 23:40:52,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=72480.0, ans=0.0 2023-06-17 23:41:32,433 INFO [train.py:996] (0/4) Epoch 1, batch 12100, loss[loss=0.4079, simple_loss=0.429, pruned_loss=0.1934, over 21384.00 frames. ], tot_loss[loss=0.3828, simple_loss=0.4182, pruned_loss=0.1737, over 4282348.78 frames. ], batch size: 548, lr: 3.27e-02, grad_scale: 16.0 2023-06-17 23:41:50,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=72600.0, ans=0.015 2023-06-17 23:42:26,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.143e+02 4.439e+02 6.434e+02 8.417e+02 1.460e+03, threshold=1.287e+03, percent-clipped=16.0 2023-06-17 23:43:17,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=72840.0, ans=0.0 2023-06-17 23:43:23,521 INFO [train.py:996] (0/4) Epoch 1, batch 12150, loss[loss=0.3965, simple_loss=0.4209, pruned_loss=0.186, over 21805.00 frames. ], tot_loss[loss=0.3845, simple_loss=0.4217, pruned_loss=0.1737, over 4280157.93 frames. ], batch size: 124, lr: 3.26e-02, grad_scale: 16.0 2023-06-17 23:43:44,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.76 vs. limit=22.5 2023-06-17 23:43:51,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=72960.0, ans=0.2 2023-06-17 23:43:55,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=72960.0, ans=0.0 2023-06-17 23:44:03,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=73020.0, ans=0.125 2023-06-17 23:44:16,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=73080.0, ans=0.125 2023-06-17 23:44:22,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=73080.0, ans=0.125 2023-06-17 23:44:48,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=73140.0, ans=0.95 2023-06-17 23:44:48,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73140.0, ans=0.1 2023-06-17 23:45:00,511 INFO [train.py:996] (0/4) Epoch 1, batch 12200, loss[loss=0.321, simple_loss=0.347, pruned_loss=0.1475, over 21588.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.4164, pruned_loss=0.1717, over 4275317.64 frames. ], batch size: 247, lr: 3.26e-02, grad_scale: 16.0 2023-06-17 23:45:02,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=73200.0, ans=0.125 2023-06-17 23:45:45,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.575e+02 3.853e+02 4.664e+02 5.869e+02 1.070e+03, threshold=9.327e+02, percent-clipped=0.0 2023-06-17 23:45:57,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=73380.0, ans=0.125 2023-06-17 23:46:28,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73440.0, ans=0.1 2023-06-17 23:46:29,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=73440.0, ans=0.125 2023-06-17 23:46:42,527 INFO [train.py:996] (0/4) Epoch 1, batch 12250, loss[loss=0.2737, simple_loss=0.3367, pruned_loss=0.1054, over 21650.00 frames. ], tot_loss[loss=0.3661, simple_loss=0.4043, pruned_loss=0.164, over 4270384.57 frames. ], batch size: 247, lr: 3.25e-02, grad_scale: 16.0 2023-06-17 23:47:31,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=73620.0, ans=0.5 2023-06-17 23:47:31,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=73620.0, ans=0.2 2023-06-17 23:48:15,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-17 23:48:25,678 INFO [train.py:996] (0/4) Epoch 1, batch 12300, loss[loss=0.3459, simple_loss=0.402, pruned_loss=0.1449, over 21746.00 frames. ], tot_loss[loss=0.3475, simple_loss=0.391, pruned_loss=0.152, over 4273433.06 frames. ], batch size: 332, lr: 3.25e-02, grad_scale: 16.0 2023-06-17 23:48:36,783 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.03 vs. limit=10.0 2023-06-17 23:49:01,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2023-06-17 23:49:12,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.740e+02 4.870e+02 6.587e+02 1.091e+03, threshold=9.740e+02, percent-clipped=4.0 2023-06-17 23:50:08,223 INFO [train.py:996] (0/4) Epoch 1, batch 12350, loss[loss=0.4274, simple_loss=0.4517, pruned_loss=0.2016, over 21732.00 frames. ], tot_loss[loss=0.3494, simple_loss=0.3941, pruned_loss=0.1524, over 4279059.34 frames. ], batch size: 441, lr: 3.24e-02, grad_scale: 16.0 2023-06-17 23:50:21,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=74100.0, ans=0.0 2023-06-17 23:50:45,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=74220.0, ans=0.1 2023-06-17 23:50:56,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=22.5 2023-06-17 23:51:03,745 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:51:49,160 INFO [train.py:996] (0/4) Epoch 1, batch 12400, loss[loss=0.3743, simple_loss=0.3982, pruned_loss=0.1752, over 21401.00 frames. ], tot_loss[loss=0.3548, simple_loss=0.3959, pruned_loss=0.1569, over 4276520.48 frames. ], batch size: 144, lr: 3.24e-02, grad_scale: 32.0 2023-06-17 23:51:52,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=74400.0, ans=0.0 2023-06-17 23:52:24,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-17 23:52:34,933 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.026e+02 5.096e+02 6.661e+02 1.103e+03, threshold=1.019e+03, percent-clipped=2.0 2023-06-17 23:52:47,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=74580.0, ans=0.0 2023-06-17 23:52:54,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=74580.0, ans=0.125 2023-06-17 23:53:18,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=74640.0, ans=0.125 2023-06-17 23:53:31,330 INFO [train.py:996] (0/4) Epoch 1, batch 12450, loss[loss=0.4443, simple_loss=0.4671, pruned_loss=0.2107, over 21379.00 frames. ], tot_loss[loss=0.3625, simple_loss=0.4001, pruned_loss=0.1624, over 4276586.35 frames. ], batch size: 131, lr: 3.23e-02, grad_scale: 32.0 2023-06-17 23:53:55,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-17 23:54:00,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=74760.0, ans=0.2 2023-06-17 23:54:13,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74820.0, ans=0.1 2023-06-17 23:54:46,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=74880.0, ans=0.125 2023-06-17 23:55:16,043 INFO [train.py:996] (0/4) Epoch 1, batch 12500, loss[loss=0.3625, simple_loss=0.4269, pruned_loss=0.149, over 20797.00 frames. ], tot_loss[loss=0.3773, simple_loss=0.4146, pruned_loss=0.17, over 4274220.52 frames. ], batch size: 607, lr: 3.23e-02, grad_scale: 32.0 2023-06-17 23:55:23,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=22.5 2023-06-17 23:55:39,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-17 23:56:14,398 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.070e+02 4.603e+02 5.505e+02 7.191e+02 1.270e+03, threshold=1.101e+03, percent-clipped=4.0 2023-06-17 23:56:15,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=75120.0, ans=0.2 2023-06-17 23:57:02,628 INFO [train.py:996] (0/4) Epoch 1, batch 12550, loss[loss=0.3599, simple_loss=0.4101, pruned_loss=0.1549, over 21975.00 frames. ], tot_loss[loss=0.3881, simple_loss=0.4245, pruned_loss=0.1759, over 4275316.26 frames. ], batch size: 317, lr: 3.22e-02, grad_scale: 32.0 2023-06-17 23:57:39,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=75360.0, ans=0.125 2023-06-17 23:57:48,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=75360.0, ans=0.2 2023-06-17 23:58:10,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-17 23:58:17,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-17 23:58:44,745 INFO [train.py:996] (0/4) Epoch 1, batch 12600, loss[loss=0.3645, simple_loss=0.4468, pruned_loss=0.1411, over 21309.00 frames. ], tot_loss[loss=0.3819, simple_loss=0.4215, pruned_loss=0.1711, over 4272677.87 frames. ], batch size: 549, lr: 3.22e-02, grad_scale: 32.0 2023-06-17 23:58:45,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=75600.0, ans=0.125 2023-06-17 23:59:00,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=75600.0, ans=0.0 2023-06-17 23:59:41,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.279e+02 3.697e+02 4.571e+02 5.714e+02 1.241e+03, threshold=9.141e+02, percent-clipped=1.0 2023-06-18 00:00:19,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=75840.0, ans=0.125 2023-06-18 00:00:21,828 INFO [train.py:996] (0/4) Epoch 1, batch 12650, loss[loss=0.3465, simple_loss=0.4013, pruned_loss=0.1459, over 21306.00 frames. ], tot_loss[loss=0.3678, simple_loss=0.4104, pruned_loss=0.1626, over 4277022.27 frames. ], batch size: 548, lr: 3.21e-02, grad_scale: 32.0 2023-06-18 00:00:40,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=75900.0, ans=0.0 2023-06-18 00:01:53,888 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:02:03,348 INFO [train.py:996] (0/4) Epoch 1, batch 12700, loss[loss=0.3909, simple_loss=0.4189, pruned_loss=0.1815, over 21929.00 frames. ], tot_loss[loss=0.3725, simple_loss=0.4111, pruned_loss=0.167, over 4281630.01 frames. ], batch size: 372, lr: 3.21e-02, grad_scale: 32.0 2023-06-18 00:02:35,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=76260.0, ans=0.125 2023-06-18 00:02:44,421 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:02:49,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=76320.0, ans=0.125 2023-06-18 00:02:54,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.709e+02 4.032e+02 5.068e+02 7.064e+02 1.461e+03, threshold=1.014e+03, percent-clipped=9.0 2023-06-18 00:03:05,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=76380.0, ans=0.125 2023-06-18 00:03:16,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=76380.0, ans=0.125 2023-06-18 00:03:29,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-18 00:03:30,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=76440.0, ans=0.125 2023-06-18 00:03:40,096 INFO [train.py:996] (0/4) Epoch 1, batch 12750, loss[loss=0.3255, simple_loss=0.3757, pruned_loss=0.1376, over 21632.00 frames. ], tot_loss[loss=0.3748, simple_loss=0.4134, pruned_loss=0.1681, over 4282409.44 frames. ], batch size: 263, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 00:04:58,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=76680.0, ans=0.2 2023-06-18 00:05:16,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.38 vs. limit=6.0 2023-06-18 00:05:32,675 INFO [train.py:996] (0/4) Epoch 1, batch 12800, loss[loss=0.3895, simple_loss=0.4154, pruned_loss=0.1818, over 21846.00 frames. ], tot_loss[loss=0.3771, simple_loss=0.4137, pruned_loss=0.1703, over 4292429.99 frames. ], batch size: 371, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 00:05:40,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=76800.0, ans=0.125 2023-06-18 00:05:49,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76860.0, ans=0.1 2023-06-18 00:05:49,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=76860.0, ans=0.0 2023-06-18 00:06:20,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.94 vs. limit=15.0 2023-06-18 00:06:20,448 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 3.969e+02 4.961e+02 6.426e+02 1.503e+03, threshold=9.923e+02, percent-clipped=9.0 2023-06-18 00:06:30,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.33 vs. limit=6.0 2023-06-18 00:06:48,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=76980.0, ans=0.125 2023-06-18 00:07:09,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=77040.0, ans=0.125 2023-06-18 00:07:12,472 INFO [train.py:996] (0/4) Epoch 1, batch 12850, loss[loss=0.4349, simple_loss=0.4862, pruned_loss=0.1918, over 19823.00 frames. ], tot_loss[loss=0.3834, simple_loss=0.4197, pruned_loss=0.1735, over 4282539.52 frames. ], batch size: 704, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:07:37,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=77160.0, ans=0.125 2023-06-18 00:07:51,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=77220.0, ans=0.0 2023-06-18 00:08:47,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=77340.0, ans=0.2 2023-06-18 00:08:58,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=77340.0, ans=0.125 2023-06-18 00:08:59,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=77400.0, ans=0.125 2023-06-18 00:09:00,989 INFO [train.py:996] (0/4) Epoch 1, batch 12900, loss[loss=0.253, simple_loss=0.3088, pruned_loss=0.09857, over 21785.00 frames. ], tot_loss[loss=0.3757, simple_loss=0.4161, pruned_loss=0.1676, over 4282987.15 frames. ], batch size: 118, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:09:09,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=77400.0, ans=0.125 2023-06-18 00:09:13,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-18 00:09:19,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=77460.0, ans=0.05 2023-06-18 00:09:46,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.853e+02 4.882e+02 6.013e+02 9.581e+02, threshold=9.764e+02, percent-clipped=0.0 2023-06-18 00:09:47,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-18 00:10:29,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=77640.0, ans=0.2 2023-06-18 00:10:36,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-18 00:10:43,817 INFO [train.py:996] (0/4) Epoch 1, batch 12950, loss[loss=0.4121, simple_loss=0.4444, pruned_loss=0.1899, over 21326.00 frames. ], tot_loss[loss=0.3692, simple_loss=0.4117, pruned_loss=0.1633, over 4280819.10 frames. ], batch size: 549, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:11:57,259 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=7.580e-03 2023-06-18 00:12:24,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=77940.0, ans=0.125 2023-06-18 00:12:28,721 INFO [train.py:996] (0/4) Epoch 1, batch 13000, loss[loss=0.3349, simple_loss=0.3897, pruned_loss=0.1401, over 21631.00 frames. ], tot_loss[loss=0.3728, simple_loss=0.4139, pruned_loss=0.1659, over 4277285.54 frames. ], batch size: 441, lr: 3.18e-02, grad_scale: 16.0 2023-06-18 00:12:32,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78000.0, ans=0.1 2023-06-18 00:12:48,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=78060.0, ans=0.2 2023-06-18 00:13:20,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 4.234e+02 5.570e+02 6.916e+02 1.204e+03, threshold=1.114e+03, percent-clipped=4.0 2023-06-18 00:13:40,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=78180.0, ans=0.0 2023-06-18 00:14:09,925 INFO [train.py:996] (0/4) Epoch 1, batch 13050, loss[loss=0.2916, simple_loss=0.3481, pruned_loss=0.1176, over 17292.00 frames. ], tot_loss[loss=0.3622, simple_loss=0.406, pruned_loss=0.1593, over 4276719.59 frames. ], batch size: 60, lr: 3.18e-02, grad_scale: 16.0 2023-06-18 00:14:19,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=78300.0, ans=0.2 2023-06-18 00:14:21,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=78300.0, ans=0.0 2023-06-18 00:14:51,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=78420.0, ans=0.125 2023-06-18 00:15:18,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=78480.0, ans=0.2 2023-06-18 00:15:55,022 INFO [train.py:996] (0/4) Epoch 1, batch 13100, loss[loss=0.3455, simple_loss=0.3912, pruned_loss=0.1499, over 21485.00 frames. ], tot_loss[loss=0.3629, simple_loss=0.4073, pruned_loss=0.1592, over 4275801.45 frames. ], batch size: 194, lr: 3.17e-02, grad_scale: 16.0 2023-06-18 00:16:54,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.696e+02 5.724e+02 7.991e+02 1.405e+03, threshold=1.145e+03, percent-clipped=4.0 2023-06-18 00:17:38,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=78840.0, ans=0.0 2023-06-18 00:17:45,926 INFO [train.py:996] (0/4) Epoch 1, batch 13150, loss[loss=0.3159, simple_loss=0.3149, pruned_loss=0.1584, over 20164.00 frames. ], tot_loss[loss=0.3701, simple_loss=0.4103, pruned_loss=0.1649, over 4274334.70 frames. ], batch size: 710, lr: 3.17e-02, grad_scale: 16.0 2023-06-18 00:18:08,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=78960.0, ans=0.2 2023-06-18 00:18:32,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=78960.0, ans=0.0 2023-06-18 00:18:34,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.45 vs. limit=6.0 2023-06-18 00:19:26,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=79140.0, ans=0.0 2023-06-18 00:19:30,330 INFO [train.py:996] (0/4) Epoch 1, batch 13200, loss[loss=0.3882, simple_loss=0.4573, pruned_loss=0.1596, over 20833.00 frames. ], tot_loss[loss=0.366, simple_loss=0.406, pruned_loss=0.163, over 4274826.67 frames. ], batch size: 608, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 00:20:00,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=79260.0, ans=0.125 2023-06-18 00:20:00,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-18 00:20:08,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=79260.0, ans=0.125 2023-06-18 00:20:27,588 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.628e+02 3.872e+02 4.776e+02 6.394e+02 8.489e+02, threshold=9.552e+02, percent-clipped=0.0 2023-06-18 00:20:49,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=79440.0, ans=0.0 2023-06-18 00:20:51,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=79440.0, ans=0.2 2023-06-18 00:21:12,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-18 00:21:18,112 INFO [train.py:996] (0/4) Epoch 1, batch 13250, loss[loss=0.3846, simple_loss=0.4347, pruned_loss=0.1673, over 21673.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.4074, pruned_loss=0.1666, over 4282401.19 frames. ], batch size: 414, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 00:21:35,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=79500.0, ans=0.0 2023-06-18 00:21:43,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=79560.0, ans=0.95 2023-06-18 00:22:05,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=79620.0, ans=0.0 2023-06-18 00:22:19,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=79680.0, ans=0.2 2023-06-18 00:23:07,139 INFO [train.py:996] (0/4) Epoch 1, batch 13300, loss[loss=0.4637, simple_loss=0.4816, pruned_loss=0.2229, over 21737.00 frames. ], tot_loss[loss=0.3715, simple_loss=0.4109, pruned_loss=0.1661, over 4278782.69 frames. ], batch size: 441, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 00:23:11,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=79800.0, ans=10.0 2023-06-18 00:23:24,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=79800.0, ans=0.0 2023-06-18 00:23:55,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.809e+02 3.934e+02 5.014e+02 6.811e+02 1.186e+03, threshold=1.003e+03, percent-clipped=5.0 2023-06-18 00:23:56,788 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-18 00:23:59,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=79920.0, ans=0.0 2023-06-18 00:24:02,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=79980.0, ans=0.2 2023-06-18 00:24:51,654 INFO [train.py:996] (0/4) Epoch 1, batch 13350, loss[loss=0.4966, simple_loss=0.5043, pruned_loss=0.2444, over 21398.00 frames. ], tot_loss[loss=0.379, simple_loss=0.4169, pruned_loss=0.1706, over 4279957.85 frames. ], batch size: 507, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 00:24:53,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=80100.0, ans=0.0 2023-06-18 00:25:10,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=80100.0, ans=0.0 2023-06-18 00:25:50,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.67 vs. limit=12.0 2023-06-18 00:25:53,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=80280.0, ans=0.125 2023-06-18 00:26:18,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80340.0, ans=0.1 2023-06-18 00:26:30,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=80340.0, ans=0.125 2023-06-18 00:26:39,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80400.0, ans=0.1 2023-06-18 00:26:40,435 INFO [train.py:996] (0/4) Epoch 1, batch 13400, loss[loss=0.3821, simple_loss=0.4157, pruned_loss=0.1742, over 21736.00 frames. ], tot_loss[loss=0.3813, simple_loss=0.418, pruned_loss=0.1723, over 4287632.97 frames. ], batch size: 298, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:27:27,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.979e+02 4.393e+02 5.548e+02 7.060e+02 1.249e+03, threshold=1.110e+03, percent-clipped=4.0 2023-06-18 00:28:02,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=80640.0, ans=0.125 2023-06-18 00:28:23,698 INFO [train.py:996] (0/4) Epoch 1, batch 13450, loss[loss=0.3798, simple_loss=0.4013, pruned_loss=0.1791, over 21361.00 frames. ], tot_loss[loss=0.3865, simple_loss=0.4202, pruned_loss=0.1764, over 4288559.49 frames. ], batch size: 194, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:28:43,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=80760.0, ans=0.0 2023-06-18 00:28:43,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=80760.0, ans=0.125 2023-06-18 00:28:47,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-18 00:29:46,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=80940.0, ans=0.2 2023-06-18 00:30:08,373 INFO [train.py:996] (0/4) Epoch 1, batch 13500, loss[loss=0.3917, simple_loss=0.4243, pruned_loss=0.1796, over 21693.00 frames. ], tot_loss[loss=0.3704, simple_loss=0.4044, pruned_loss=0.1682, over 4284053.22 frames. ], batch size: 391, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:30:41,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=81060.0, ans=0.0 2023-06-18 00:31:05,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81120.0, ans=0.1 2023-06-18 00:31:07,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 4.002e+02 4.680e+02 6.090e+02 1.151e+03, threshold=9.360e+02, percent-clipped=1.0 2023-06-18 00:31:19,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81180.0, ans=0.1 2023-06-18 00:31:41,848 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.36 vs. limit=22.5 2023-06-18 00:31:43,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=81240.0, ans=0.125 2023-06-18 00:31:52,226 INFO [train.py:996] (0/4) Epoch 1, batch 13550, loss[loss=0.4275, simple_loss=0.4839, pruned_loss=0.1855, over 21697.00 frames. ], tot_loss[loss=0.3715, simple_loss=0.409, pruned_loss=0.1669, over 4280158.90 frames. ], batch size: 389, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 00:32:24,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=22.5 2023-06-18 00:32:52,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81420.0, ans=0.1 2023-06-18 00:32:58,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=81480.0, ans=0.125 2023-06-18 00:32:59,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=81480.0, ans=0.125 2023-06-18 00:33:34,654 INFO [train.py:996] (0/4) Epoch 1, batch 13600, loss[loss=0.3342, simple_loss=0.3842, pruned_loss=0.1421, over 16376.00 frames. ], tot_loss[loss=0.3744, simple_loss=0.4123, pruned_loss=0.1683, over 4277781.36 frames. ], batch size: 61, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 00:33:59,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=81660.0, ans=0.05 2023-06-18 00:34:27,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 4.484e+02 6.125e+02 7.575e+02 1.688e+03, threshold=1.225e+03, percent-clipped=13.0 2023-06-18 00:34:37,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=81780.0, ans=0.0 2023-06-18 00:35:04,187 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.89 vs. limit=12.0 2023-06-18 00:35:10,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81900.0, ans=0.1 2023-06-18 00:35:11,064 INFO [train.py:996] (0/4) Epoch 1, batch 13650, loss[loss=0.3204, simple_loss=0.3581, pruned_loss=0.1414, over 21637.00 frames. ], tot_loss[loss=0.3675, simple_loss=0.407, pruned_loss=0.164, over 4270516.10 frames. ], batch size: 332, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 00:35:11,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=81900.0, ans=0.125 2023-06-18 00:35:36,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=81960.0, ans=0.125 2023-06-18 00:36:03,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=82020.0, ans=0.0 2023-06-18 00:36:03,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=82020.0, ans=0.125 2023-06-18 00:36:18,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=82080.0, ans=0.0 2023-06-18 00:36:22,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.03 vs. limit=10.0 2023-06-18 00:36:35,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=82080.0, ans=0.2 2023-06-18 00:36:35,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=82080.0, ans=0.125 2023-06-18 00:36:59,446 INFO [train.py:996] (0/4) Epoch 1, batch 13700, loss[loss=0.3665, simple_loss=0.4031, pruned_loss=0.165, over 21775.00 frames. ], tot_loss[loss=0.364, simple_loss=0.4011, pruned_loss=0.1635, over 4266259.94 frames. ], batch size: 332, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 00:37:10,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=82200.0, ans=0.125 2023-06-18 00:37:46,411 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-18 00:37:50,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=82320.0, ans=0.125 2023-06-18 00:37:53,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.850e+02 5.196e+02 6.756e+02 1.127e+03, threshold=1.039e+03, percent-clipped=0.0 2023-06-18 00:38:13,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=82380.0, ans=0.125 2023-06-18 00:38:42,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=82500.0, ans=0.0 2023-06-18 00:38:43,669 INFO [train.py:996] (0/4) Epoch 1, batch 13750, loss[loss=0.3779, simple_loss=0.424, pruned_loss=0.1659, over 21584.00 frames. ], tot_loss[loss=0.358, simple_loss=0.3965, pruned_loss=0.1597, over 4265385.33 frames. ], batch size: 441, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:39:32,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=82620.0, ans=0.125 2023-06-18 00:39:36,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=82620.0, ans=0.125 2023-06-18 00:39:51,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=82680.0, ans=0.2 2023-06-18 00:40:06,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=82680.0, ans=0.125 2023-06-18 00:40:31,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=82740.0, ans=0.0 2023-06-18 00:40:40,346 INFO [train.py:996] (0/4) Epoch 1, batch 13800, loss[loss=0.2372, simple_loss=0.276, pruned_loss=0.09924, over 16390.00 frames. ], tot_loss[loss=0.3594, simple_loss=0.4016, pruned_loss=0.1586, over 4261458.28 frames. ], batch size: 61, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:41:25,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=82920.0, ans=0.0 2023-06-18 00:41:28,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.821e+02 3.900e+02 5.256e+02 6.721e+02 1.169e+03, threshold=1.051e+03, percent-clipped=1.0 2023-06-18 00:41:41,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=82980.0, ans=0.0 2023-06-18 00:42:22,650 INFO [train.py:996] (0/4) Epoch 1, batch 13850, loss[loss=0.376, simple_loss=0.4229, pruned_loss=0.1646, over 21787.00 frames. ], tot_loss[loss=0.3635, simple_loss=0.4073, pruned_loss=0.1599, over 4266722.85 frames. ], batch size: 282, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:42:29,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83100.0, ans=0.1 2023-06-18 00:42:50,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=83160.0, ans=0.125 2023-06-18 00:43:37,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=83280.0, ans=0.0 2023-06-18 00:43:42,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=83280.0, ans=0.0 2023-06-18 00:44:06,040 INFO [train.py:996] (0/4) Epoch 1, batch 13900, loss[loss=0.3528, simple_loss=0.3835, pruned_loss=0.1611, over 21683.00 frames. ], tot_loss[loss=0.3744, simple_loss=0.4139, pruned_loss=0.1674, over 4267638.92 frames. ], batch size: 263, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 00:44:58,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.781e+02 4.127e+02 5.100e+02 6.768e+02 1.105e+03, threshold=1.020e+03, percent-clipped=2.0 2023-06-18 00:45:02,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=83520.0, ans=0.0 2023-06-18 00:45:10,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=83580.0, ans=0.125 2023-06-18 00:45:18,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=83580.0, ans=0.0 2023-06-18 00:45:20,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83580.0, ans=0.1 2023-06-18 00:45:28,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83640.0, ans=0.1 2023-06-18 00:45:48,136 INFO [train.py:996] (0/4) Epoch 1, batch 13950, loss[loss=0.3788, simple_loss=0.4096, pruned_loss=0.174, over 21845.00 frames. ], tot_loss[loss=0.3787, simple_loss=0.4157, pruned_loss=0.1709, over 4277171.84 frames. ], batch size: 351, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 00:46:07,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.83 vs. limit=6.0 2023-06-18 00:46:13,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=83760.0, ans=0.025 2023-06-18 00:46:21,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=83760.0, ans=0.125 2023-06-18 00:47:17,803 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:47:30,525 INFO [train.py:996] (0/4) Epoch 1, batch 14000, loss[loss=0.3801, simple_loss=0.4832, pruned_loss=0.1385, over 20872.00 frames. ], tot_loss[loss=0.3676, simple_loss=0.4065, pruned_loss=0.1643, over 4266603.93 frames. ], batch size: 607, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 00:47:40,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=12.0 2023-06-18 00:48:11,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-18 00:48:12,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=84120.0, ans=0.125 2023-06-18 00:48:28,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.708e+02 4.933e+02 6.099e+02 9.890e+02, threshold=9.866e+02, percent-clipped=0.0 2023-06-18 00:48:45,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=84180.0, ans=0.0 2023-06-18 00:49:18,922 INFO [train.py:996] (0/4) Epoch 1, batch 14050, loss[loss=0.3165, simple_loss=0.3509, pruned_loss=0.141, over 21942.00 frames. ], tot_loss[loss=0.3601, simple_loss=0.4019, pruned_loss=0.1591, over 4265194.94 frames. ], batch size: 113, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 00:49:34,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.01 vs. limit=22.5 2023-06-18 00:49:56,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=84420.0, ans=0.0 2023-06-18 00:50:01,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=84420.0, ans=0.0 2023-06-18 00:51:01,829 INFO [train.py:996] (0/4) Epoch 1, batch 14100, loss[loss=0.3651, simple_loss=0.3889, pruned_loss=0.1706, over 21240.00 frames. ], tot_loss[loss=0.3577, simple_loss=0.3966, pruned_loss=0.1594, over 4248820.19 frames. ], batch size: 176, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:51:53,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84720.0, ans=0.1 2023-06-18 00:51:54,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 4.143e+02 4.965e+02 6.574e+02 1.166e+03, threshold=9.930e+02, percent-clipped=2.0 2023-06-18 00:52:04,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=84780.0, ans=0.2 2023-06-18 00:52:37,626 INFO [train.py:996] (0/4) Epoch 1, batch 14150, loss[loss=0.2904, simple_loss=0.3565, pruned_loss=0.1122, over 21772.00 frames. ], tot_loss[loss=0.3569, simple_loss=0.3969, pruned_loss=0.1585, over 4254110.88 frames. ], batch size: 102, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:52:59,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84960.0, ans=0.1 2023-06-18 00:53:40,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.11 vs. limit=6.0 2023-06-18 00:54:03,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=85140.0, ans=0.125 2023-06-18 00:54:13,114 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.03 vs. limit=22.5 2023-06-18 00:54:18,331 INFO [train.py:996] (0/4) Epoch 1, batch 14200, loss[loss=0.3341, simple_loss=0.3644, pruned_loss=0.1519, over 21784.00 frames. ], tot_loss[loss=0.3506, simple_loss=0.3925, pruned_loss=0.1543, over 4263293.40 frames. ], batch size: 98, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:54:46,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-18 00:55:09,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 4.023e+02 4.862e+02 6.439e+02 1.166e+03, threshold=9.724e+02, percent-clipped=3.0 2023-06-18 00:55:37,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=85440.0, ans=0.0 2023-06-18 00:55:52,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2023-06-18 00:55:59,276 INFO [train.py:996] (0/4) Epoch 1, batch 14250, loss[loss=0.3601, simple_loss=0.3848, pruned_loss=0.1677, over 21964.00 frames. ], tot_loss[loss=0.3479, simple_loss=0.388, pruned_loss=0.154, over 4258685.36 frames. ], batch size: 103, lr: 3.07e-02, grad_scale: 32.0 2023-06-18 00:56:06,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.28 vs. limit=22.5 2023-06-18 00:56:44,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=85620.0, ans=0.0 2023-06-18 00:57:43,794 INFO [train.py:996] (0/4) Epoch 1, batch 14300, loss[loss=0.3922, simple_loss=0.4433, pruned_loss=0.1706, over 21424.00 frames. ], tot_loss[loss=0.3499, simple_loss=0.3908, pruned_loss=0.1545, over 4262622.36 frames. ], batch size: 211, lr: 3.07e-02, grad_scale: 32.0 2023-06-18 00:58:03,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=85800.0, ans=0.07 2023-06-18 00:58:15,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-18 00:58:22,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85920.0, ans=0.1 2023-06-18 00:58:38,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.854e+02 5.533e+02 8.207e+02 1.409e+03, threshold=1.107e+03, percent-clipped=13.0 2023-06-18 00:58:49,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=85980.0, ans=0.0 2023-06-18 00:59:26,788 INFO [train.py:996] (0/4) Epoch 1, batch 14350, loss[loss=0.1816, simple_loss=0.1974, pruned_loss=0.0829, over 17155.00 frames. ], tot_loss[loss=0.3524, simple_loss=0.3947, pruned_loss=0.1551, over 4267268.53 frames. ], batch size: 61, lr: 3.06e-02, grad_scale: 16.0 2023-06-18 01:00:51,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=86340.0, ans=0.1 2023-06-18 01:01:08,529 INFO [train.py:996] (0/4) Epoch 1, batch 14400, loss[loss=0.3271, simple_loss=0.3577, pruned_loss=0.1483, over 21433.00 frames. ], tot_loss[loss=0.3533, simple_loss=0.3931, pruned_loss=0.1567, over 4275318.11 frames. ], batch size: 211, lr: 3.06e-02, grad_scale: 32.0 2023-06-18 01:01:09,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.82 vs. limit=6.0 2023-06-18 01:02:05,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=86520.0, ans=0.125 2023-06-18 01:02:08,120 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.871e+02 4.703e+02 5.738e+02 1.217e+03, threshold=9.407e+02, percent-clipped=2.0 2023-06-18 01:02:21,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-18 01:02:24,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=86580.0, ans=0.0 2023-06-18 01:02:44,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=86640.0, ans=0.2 2023-06-18 01:02:50,421 INFO [train.py:996] (0/4) Epoch 1, batch 14450, loss[loss=0.3964, simple_loss=0.406, pruned_loss=0.1934, over 21748.00 frames. ], tot_loss[loss=0.3511, simple_loss=0.3885, pruned_loss=0.1569, over 4269174.37 frames. ], batch size: 333, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 01:03:10,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=86760.0, ans=0.125 2023-06-18 01:03:15,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=86760.0, ans=0.2 2023-06-18 01:03:20,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=86760.0, ans=0.0 2023-06-18 01:03:35,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=86820.0, ans=0.07 2023-06-18 01:04:12,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=86940.0, ans=0.0 2023-06-18 01:04:29,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=86940.0, ans=0.125 2023-06-18 01:04:33,729 INFO [train.py:996] (0/4) Epoch 1, batch 14500, loss[loss=0.3213, simple_loss=0.3518, pruned_loss=0.1455, over 21244.00 frames. ], tot_loss[loss=0.35, simple_loss=0.3867, pruned_loss=0.1566, over 4266763.10 frames. ], batch size: 159, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 01:05:33,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=87120.0, ans=0.0 2023-06-18 01:05:36,052 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.944e+02 5.039e+02 7.491e+02 1.788e+03, threshold=1.008e+03, percent-clipped=13.0 2023-06-18 01:05:42,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.76 vs. limit=15.0 2023-06-18 01:06:01,016 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-18 01:06:14,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=87240.0, ans=0.0 2023-06-18 01:06:17,605 INFO [train.py:996] (0/4) Epoch 1, batch 14550, loss[loss=0.3559, simple_loss=0.3967, pruned_loss=0.1576, over 21350.00 frames. ], tot_loss[loss=0.3588, simple_loss=0.3956, pruned_loss=0.161, over 4271645.55 frames. ], batch size: 159, lr: 3.05e-02, grad_scale: 16.0 2023-06-18 01:07:15,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=87420.0, ans=0.125 2023-06-18 01:07:40,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=40.79 vs. limit=15.0 2023-06-18 01:07:40,974 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-18 01:08:01,705 INFO [train.py:996] (0/4) Epoch 1, batch 14600, loss[loss=0.3442, simple_loss=0.4018, pruned_loss=0.1433, over 21323.00 frames. ], tot_loss[loss=0.371, simple_loss=0.4059, pruned_loss=0.1681, over 4278994.06 frames. ], batch size: 176, lr: 3.04e-02, grad_scale: 16.0 2023-06-18 01:08:19,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=87600.0, ans=0.125 2023-06-18 01:08:55,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=87720.0, ans=0.1 2023-06-18 01:09:02,467 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 4.017e+02 5.030e+02 6.430e+02 1.157e+03, threshold=1.006e+03, percent-clipped=2.0 2023-06-18 01:09:16,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=87780.0, ans=0.125 2023-06-18 01:09:43,536 INFO [train.py:996] (0/4) Epoch 1, batch 14650, loss[loss=0.283, simple_loss=0.3534, pruned_loss=0.1063, over 21842.00 frames. ], tot_loss[loss=0.3703, simple_loss=0.4082, pruned_loss=0.1662, over 4282108.88 frames. ], batch size: 316, lr: 3.04e-02, grad_scale: 16.0 2023-06-18 01:09:44,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=87900.0, ans=0.1 2023-06-18 01:10:11,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=87960.0, ans=0.125 2023-06-18 01:10:11,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=87960.0, ans=0.125 2023-06-18 01:10:23,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=87960.0, ans=0.2 2023-06-18 01:10:23,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=87960.0, ans=0.0 2023-06-18 01:10:27,117 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.71 vs. limit=6.0 2023-06-18 01:10:37,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=22.5 2023-06-18 01:11:01,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=88080.0, ans=0.0 2023-06-18 01:11:01,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-18 01:11:30,888 INFO [train.py:996] (0/4) Epoch 1, batch 14700, loss[loss=0.2923, simple_loss=0.347, pruned_loss=0.1188, over 21808.00 frames. ], tot_loss[loss=0.3556, simple_loss=0.3985, pruned_loss=0.1563, over 4281425.92 frames. ], batch size: 124, lr: 3.03e-02, grad_scale: 16.0 2023-06-18 01:11:42,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=88200.0, ans=0.0 2023-06-18 01:11:50,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88260.0, ans=0.1 2023-06-18 01:12:00,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=88260.0, ans=0.125 2023-06-18 01:12:32,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 3.580e+02 4.552e+02 5.267e+02 1.016e+03, threshold=9.103e+02, percent-clipped=1.0 2023-06-18 01:12:33,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=88320.0, ans=0.0 2023-06-18 01:12:46,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=88380.0, ans=0.05 2023-06-18 01:13:13,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=88500.0, ans=0.125 2023-06-18 01:13:14,656 INFO [train.py:996] (0/4) Epoch 1, batch 14750, loss[loss=0.3518, simple_loss=0.3737, pruned_loss=0.165, over 16596.00 frames. ], tot_loss[loss=0.3638, simple_loss=0.4055, pruned_loss=0.1611, over 4279476.36 frames. ], batch size: 60, lr: 3.03e-02, grad_scale: 16.0 2023-06-18 01:13:15,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=88500.0, ans=0.0 2023-06-18 01:13:15,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88500.0, ans=0.1 2023-06-18 01:14:07,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.41 vs. limit=22.5 2023-06-18 01:14:59,905 INFO [train.py:996] (0/4) Epoch 1, batch 14800, loss[loss=0.3624, simple_loss=0.3937, pruned_loss=0.1655, over 21508.00 frames. ], tot_loss[loss=0.3805, simple_loss=0.42, pruned_loss=0.1705, over 4279187.37 frames. ], batch size: 230, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 01:15:23,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=88800.0, ans=0.0 2023-06-18 01:16:01,204 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:16:02,071 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.486e+02 4.489e+02 5.229e+02 7.110e+02 1.407e+03, threshold=1.046e+03, percent-clipped=11.0 2023-06-18 01:16:06,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=12.0 2023-06-18 01:16:28,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=89040.0, ans=0.0 2023-06-18 01:16:47,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=89040.0, ans=0.125 2023-06-18 01:16:55,423 INFO [train.py:996] (0/4) Epoch 1, batch 14850, loss[loss=0.4601, simple_loss=0.4838, pruned_loss=0.2182, over 21645.00 frames. ], tot_loss[loss=0.3743, simple_loss=0.4115, pruned_loss=0.1685, over 4277742.07 frames. ], batch size: 414, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 01:17:37,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89220.0, ans=0.1 2023-06-18 01:18:19,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=89340.0, ans=0.0 2023-06-18 01:18:25,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.86 vs. limit=6.0 2023-06-18 01:18:41,328 INFO [train.py:996] (0/4) Epoch 1, batch 14900, loss[loss=0.3948, simple_loss=0.4304, pruned_loss=0.1796, over 21707.00 frames. ], tot_loss[loss=0.3785, simple_loss=0.4159, pruned_loss=0.1705, over 4274646.51 frames. ], batch size: 351, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 01:18:41,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=89400.0, ans=0.125 2023-06-18 01:18:45,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=89400.0, ans=0.125 2023-06-18 01:18:45,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=89400.0, ans=0.125 2023-06-18 01:18:58,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=89460.0, ans=0.125 2023-06-18 01:19:34,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 4.060e+02 5.286e+02 6.323e+02 1.154e+03, threshold=1.057e+03, percent-clipped=2.0 2023-06-18 01:20:11,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-18 01:20:18,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=89640.0, ans=15.0 2023-06-18 01:20:21,544 INFO [train.py:996] (0/4) Epoch 1, batch 14950, loss[loss=0.2954, simple_loss=0.3572, pruned_loss=0.1168, over 21202.00 frames. ], tot_loss[loss=0.378, simple_loss=0.4172, pruned_loss=0.1694, over 4271364.45 frames. ], batch size: 143, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:20:25,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=89700.0, ans=0.05 2023-06-18 01:20:55,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=89760.0, ans=0.0 2023-06-18 01:21:36,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-18 01:22:04,801 INFO [train.py:996] (0/4) Epoch 1, batch 15000, loss[loss=0.3334, simple_loss=0.372, pruned_loss=0.1474, over 21871.00 frames. ], tot_loss[loss=0.3811, simple_loss=0.4188, pruned_loss=0.1717, over 4272665.21 frames. ], batch size: 98, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:22:04,803 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 01:22:23,156 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3215, simple_loss=0.4085, pruned_loss=0.1173, over 1796401.00 frames. 2023-06-18 01:22:23,157 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 01:22:25,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=90000.0, ans=0.125 2023-06-18 01:22:26,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-18 01:22:32,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=90000.0, ans=10.0 2023-06-18 01:22:50,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90060.0, ans=0.1 2023-06-18 01:23:25,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.727e+02 3.992e+02 4.836e+02 5.829e+02 8.010e+02, threshold=9.672e+02, percent-clipped=0.0 2023-06-18 01:23:25,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90120.0, ans=0.1 2023-06-18 01:23:59,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90240.0, ans=0.1 2023-06-18 01:24:12,196 INFO [train.py:996] (0/4) Epoch 1, batch 15050, loss[loss=0.3979, simple_loss=0.434, pruned_loss=0.1809, over 21656.00 frames. ], tot_loss[loss=0.3833, simple_loss=0.4206, pruned_loss=0.173, over 4260154.20 frames. ], batch size: 389, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:24:44,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=90360.0, ans=0.125 2023-06-18 01:25:55,096 INFO [train.py:996] (0/4) Epoch 1, batch 15100, loss[loss=0.4956, simple_loss=0.5002, pruned_loss=0.2455, over 21463.00 frames. ], tot_loss[loss=0.3826, simple_loss=0.4212, pruned_loss=0.172, over 4265722.28 frames. ], batch size: 471, lr: 3.00e-02, grad_scale: 32.0 2023-06-18 01:26:15,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=22.5 2023-06-18 01:26:50,651 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 4.036e+02 5.408e+02 6.449e+02 1.241e+03, threshold=1.082e+03, percent-clipped=5.0 2023-06-18 01:27:00,667 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.05 vs. limit=22.5 2023-06-18 01:27:22,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-18 01:27:32,777 INFO [train.py:996] (0/4) Epoch 1, batch 15150, loss[loss=0.3013, simple_loss=0.337, pruned_loss=0.1328, over 21586.00 frames. ], tot_loss[loss=0.3817, simple_loss=0.4176, pruned_loss=0.1729, over 4268655.68 frames. ], batch size: 231, lr: 3.00e-02, grad_scale: 32.0 2023-06-18 01:27:51,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90900.0, ans=0.1 2023-06-18 01:28:00,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=22.5 2023-06-18 01:28:22,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.84 vs. limit=6.0 2023-06-18 01:29:15,460 INFO [train.py:996] (0/4) Epoch 1, batch 15200, loss[loss=0.3461, simple_loss=0.3887, pruned_loss=0.1517, over 21597.00 frames. ], tot_loss[loss=0.369, simple_loss=0.4054, pruned_loss=0.1663, over 4271914.66 frames. ], batch size: 263, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:29:50,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=91260.0, ans=0.0 2023-06-18 01:30:15,296 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.952e+02 4.981e+02 6.119e+02 1.167e+03, threshold=9.963e+02, percent-clipped=1.0 2023-06-18 01:30:17,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=91380.0, ans=0.0 2023-06-18 01:30:56,069 INFO [train.py:996] (0/4) Epoch 1, batch 15250, loss[loss=0.3533, simple_loss=0.3743, pruned_loss=0.1661, over 21242.00 frames. ], tot_loss[loss=0.3652, simple_loss=0.4009, pruned_loss=0.1648, over 4269382.42 frames. ], batch size: 176, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:31:29,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-06-18 01:31:39,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=91560.0, ans=0.125 2023-06-18 01:31:45,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=91620.0, ans=0.2 2023-06-18 01:32:16,042 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-18 01:32:22,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=91740.0, ans=0.2 2023-06-18 01:32:38,846 INFO [train.py:996] (0/4) Epoch 1, batch 15300, loss[loss=0.397, simple_loss=0.4314, pruned_loss=0.1812, over 21284.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.4039, pruned_loss=0.169, over 4266637.54 frames. ], batch size: 143, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:32:48,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.63 vs. limit=15.0 2023-06-18 01:33:17,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=24.02 vs. limit=15.0 2023-06-18 01:33:22,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=91860.0, ans=0.125 2023-06-18 01:33:46,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.623e+02 4.254e+02 5.015e+02 5.905e+02 1.167e+03, threshold=1.003e+03, percent-clipped=1.0 2023-06-18 01:33:57,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=91980.0, ans=0.95 2023-06-18 01:34:12,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-18 01:34:16,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.85 vs. limit=15.0 2023-06-18 01:34:28,069 INFO [train.py:996] (0/4) Epoch 1, batch 15350, loss[loss=0.3533, simple_loss=0.4068, pruned_loss=0.1499, over 21444.00 frames. ], tot_loss[loss=0.3764, simple_loss=0.409, pruned_loss=0.1719, over 4269343.60 frames. ], batch size: 211, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 01:34:30,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.23 vs. limit=15.0 2023-06-18 01:35:36,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92280.0, ans=0.1 2023-06-18 01:35:49,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=92340.0, ans=0.0 2023-06-18 01:35:55,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=92340.0, ans=0.125 2023-06-18 01:36:04,744 INFO [train.py:996] (0/4) Epoch 1, batch 15400, loss[loss=0.3037, simple_loss=0.3723, pruned_loss=0.1175, over 21834.00 frames. ], tot_loss[loss=0.3754, simple_loss=0.4102, pruned_loss=0.1703, over 4267224.91 frames. ], batch size: 102, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 01:36:20,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=92400.0, ans=0.125 2023-06-18 01:36:43,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=92460.0, ans=0.0 2023-06-18 01:37:10,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.998e+02 4.934e+02 5.907e+02 9.449e+02, threshold=9.868e+02, percent-clipped=0.0 2023-06-18 01:37:46,504 INFO [train.py:996] (0/4) Epoch 1, batch 15450, loss[loss=0.3587, simple_loss=0.4033, pruned_loss=0.157, over 21386.00 frames. ], tot_loss[loss=0.3732, simple_loss=0.408, pruned_loss=0.1692, over 4261200.87 frames. ], batch size: 211, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 01:38:58,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=92880.0, ans=0.2 2023-06-18 01:39:35,854 INFO [train.py:996] (0/4) Epoch 1, batch 15500, loss[loss=0.3943, simple_loss=0.4228, pruned_loss=0.1829, over 21917.00 frames. ], tot_loss[loss=0.3734, simple_loss=0.4102, pruned_loss=0.1683, over 4257828.69 frames. ], batch size: 316, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 01:39:43,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=93000.0, ans=0.125 2023-06-18 01:39:47,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=22.5 2023-06-18 01:40:02,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=15.0 2023-06-18 01:40:39,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.584e+02 4.777e+02 6.158e+02 1.272e+03, threshold=9.553e+02, percent-clipped=7.0 2023-06-18 01:41:30,363 INFO [train.py:996] (0/4) Epoch 1, batch 15550, loss[loss=0.3316, simple_loss=0.3847, pruned_loss=0.1392, over 21726.00 frames. ], tot_loss[loss=0.3702, simple_loss=0.4105, pruned_loss=0.1649, over 4260546.72 frames. ], batch size: 332, lr: 2.97e-02, grad_scale: 16.0 2023-06-18 01:41:39,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=93300.0, ans=0.125 2023-06-18 01:41:42,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93300.0, ans=0.1 2023-06-18 01:41:54,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=93360.0, ans=0.0 2023-06-18 01:41:57,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=93360.0, ans=0.125 2023-06-18 01:42:33,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93480.0, ans=0.1 2023-06-18 01:42:37,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=93480.0, ans=0.0 2023-06-18 01:43:13,888 INFO [train.py:996] (0/4) Epoch 1, batch 15600, loss[loss=0.3501, simple_loss=0.3712, pruned_loss=0.1645, over 21586.00 frames. ], tot_loss[loss=0.364, simple_loss=0.4037, pruned_loss=0.1621, over 4254404.45 frames. ], batch size: 332, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 01:43:39,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=93660.0, ans=0.0 2023-06-18 01:43:49,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93660.0, ans=0.1 2023-06-18 01:43:49,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93660.0, ans=0.1 2023-06-18 01:44:06,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.574e+02 3.769e+02 4.800e+02 6.009e+02 1.224e+03, threshold=9.599e+02, percent-clipped=5.0 2023-06-18 01:44:09,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.06 vs. limit=10.0 2023-06-18 01:44:43,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=93840.0, ans=0.07 2023-06-18 01:44:56,285 INFO [train.py:996] (0/4) Epoch 1, batch 15650, loss[loss=0.3656, simple_loss=0.3902, pruned_loss=0.1705, over 21750.00 frames. ], tot_loss[loss=0.3605, simple_loss=0.3998, pruned_loss=0.1606, over 4264265.51 frames. ], batch size: 351, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 01:45:15,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93900.0, ans=0.1 2023-06-18 01:46:16,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.19 vs. limit=22.5 2023-06-18 01:46:38,889 INFO [train.py:996] (0/4) Epoch 1, batch 15700, loss[loss=0.3344, simple_loss=0.3628, pruned_loss=0.153, over 21839.00 frames. ], tot_loss[loss=0.3552, simple_loss=0.3937, pruned_loss=0.1583, over 4266954.06 frames. ], batch size: 98, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:46:39,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=94200.0, ans=0.0 2023-06-18 01:47:06,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=94260.0, ans=0.2 2023-06-18 01:47:08,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-18 01:47:31,863 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.759e+02 5.241e+02 6.627e+02 1.144e+03, threshold=1.048e+03, percent-clipped=4.0 2023-06-18 01:48:02,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=94440.0, ans=0.125 2023-06-18 01:48:09,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-18 01:48:18,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=94440.0, ans=0.2 2023-06-18 01:48:21,249 INFO [train.py:996] (0/4) Epoch 1, batch 15750, loss[loss=0.3636, simple_loss=0.3849, pruned_loss=0.1712, over 21778.00 frames. ], tot_loss[loss=0.3514, simple_loss=0.3884, pruned_loss=0.1572, over 4261754.52 frames. ], batch size: 112, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:48:37,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.83 vs. limit=15.0 2023-06-18 01:48:42,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-18 01:49:10,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94620.0, ans=0.125 2023-06-18 01:49:53,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=94740.0, ans=0.125 2023-06-18 01:50:03,653 INFO [train.py:996] (0/4) Epoch 1, batch 15800, loss[loss=0.335, simple_loss=0.3602, pruned_loss=0.1549, over 21449.00 frames. ], tot_loss[loss=0.3471, simple_loss=0.3826, pruned_loss=0.1558, over 4259563.40 frames. ], batch size: 212, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:51:07,134 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.489e+02 4.310e+02 5.461e+02 1.002e+03, threshold=8.621e+02, percent-clipped=0.0 2023-06-18 01:51:14,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=94980.0, ans=0.125 2023-06-18 01:51:17,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=94980.0, ans=0.125 2023-06-18 01:51:26,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-18 01:51:29,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=95040.0, ans=0.125 2023-06-18 01:51:47,045 INFO [train.py:996] (0/4) Epoch 1, batch 15850, loss[loss=0.3255, simple_loss=0.3466, pruned_loss=0.1522, over 21798.00 frames. ], tot_loss[loss=0.3492, simple_loss=0.3829, pruned_loss=0.1577, over 4270472.59 frames. ], batch size: 107, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:52:08,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.61 vs. limit=15.0 2023-06-18 01:53:06,938 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:53:27,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=95340.0, ans=0.0 2023-06-18 01:53:31,123 INFO [train.py:996] (0/4) Epoch 1, batch 15900, loss[loss=0.3616, simple_loss=0.3994, pruned_loss=0.1619, over 21688.00 frames. ], tot_loss[loss=0.3521, simple_loss=0.3845, pruned_loss=0.1598, over 4268218.78 frames. ], batch size: 351, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:54:17,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=95520.0, ans=0.5 2023-06-18 01:54:28,633 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.552e+02 4.278e+02 5.211e+02 7.153e+02 1.346e+03, threshold=1.042e+03, percent-clipped=13.0 2023-06-18 01:54:29,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=95580.0, ans=0.125 2023-06-18 01:54:47,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-18 01:55:03,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=22.5 2023-06-18 01:55:07,522 INFO [train.py:996] (0/4) Epoch 1, batch 15950, loss[loss=0.2953, simple_loss=0.3805, pruned_loss=0.105, over 21639.00 frames. ], tot_loss[loss=0.3482, simple_loss=0.3836, pruned_loss=0.1564, over 4258990.21 frames. ], batch size: 389, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:55:33,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=22.5 2023-06-18 01:56:18,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95880.0, ans=0.1 2023-06-18 01:56:51,186 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-16000.pt 2023-06-18 01:56:54,410 INFO [train.py:996] (0/4) Epoch 1, batch 16000, loss[loss=0.3554, simple_loss=0.4184, pruned_loss=0.1462, over 21869.00 frames. ], tot_loss[loss=0.3445, simple_loss=0.3841, pruned_loss=0.1525, over 4268001.03 frames. ], batch size: 371, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 01:56:58,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=96000.0, ans=0.125 2023-06-18 01:57:52,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 3.630e+02 4.249e+02 5.497e+02 1.232e+03, threshold=8.498e+02, percent-clipped=2.0 2023-06-18 01:58:06,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=96180.0, ans=0.2 2023-06-18 01:58:12,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=96180.0, ans=0.07 2023-06-18 01:58:25,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=96240.0, ans=0.125 2023-06-18 01:58:30,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=96240.0, ans=0.0 2023-06-18 01:58:35,953 INFO [train.py:996] (0/4) Epoch 1, batch 16050, loss[loss=0.35, simple_loss=0.4108, pruned_loss=0.1446, over 21620.00 frames. ], tot_loss[loss=0.342, simple_loss=0.387, pruned_loss=0.1485, over 4268284.29 frames. ], batch size: 263, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 01:58:39,763 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:58:51,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96300.0, ans=0.1 2023-06-18 02:00:01,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=96540.0, ans=0.125 2023-06-18 02:00:17,297 INFO [train.py:996] (0/4) Epoch 1, batch 16100, loss[loss=0.3889, simple_loss=0.4143, pruned_loss=0.1817, over 21985.00 frames. ], tot_loss[loss=0.3477, simple_loss=0.393, pruned_loss=0.1512, over 4274554.99 frames. ], batch size: 113, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:00:59,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=96720.0, ans=0.125 2023-06-18 02:01:13,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=96720.0, ans=0.125 2023-06-18 02:01:14,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.670e+02 4.719e+02 5.843e+02 1.104e+03, threshold=9.438e+02, percent-clipped=5.0 2023-06-18 02:01:58,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=96900.0, ans=0.0 2023-06-18 02:01:59,169 INFO [train.py:996] (0/4) Epoch 1, batch 16150, loss[loss=0.3949, simple_loss=0.4265, pruned_loss=0.1817, over 21906.00 frames. ], tot_loss[loss=0.3538, simple_loss=0.3964, pruned_loss=0.1556, over 4286507.71 frames. ], batch size: 371, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:02:54,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-18 02:03:45,804 INFO [train.py:996] (0/4) Epoch 1, batch 16200, loss[loss=0.3587, simple_loss=0.4171, pruned_loss=0.1502, over 21720.00 frames. ], tot_loss[loss=0.3586, simple_loss=0.4018, pruned_loss=0.1577, over 4287413.59 frames. ], batch size: 298, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:04:00,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=97200.0, ans=0.125 2023-06-18 02:04:10,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=97200.0, ans=0.2 2023-06-18 02:04:26,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=97260.0, ans=0.07 2023-06-18 02:04:49,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.675e+02 4.045e+02 5.128e+02 6.271e+02 1.195e+03, threshold=1.026e+03, percent-clipped=3.0 2023-06-18 02:04:56,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=97380.0, ans=0.125 2023-06-18 02:05:06,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=97380.0, ans=0.1 2023-06-18 02:05:11,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=97440.0, ans=0.025 2023-06-18 02:05:16,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=97440.0, ans=0.125 2023-06-18 02:05:34,474 INFO [train.py:996] (0/4) Epoch 1, batch 16250, loss[loss=0.2378, simple_loss=0.2969, pruned_loss=0.08934, over 21452.00 frames. ], tot_loss[loss=0.3586, simple_loss=0.4004, pruned_loss=0.1584, over 4285430.16 frames. ], batch size: 195, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:05:42,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=97500.0, ans=0.125 2023-06-18 02:06:38,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=97680.0, ans=0.2 2023-06-18 02:06:56,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=97740.0, ans=0.125 2023-06-18 02:07:23,966 INFO [train.py:996] (0/4) Epoch 1, batch 16300, loss[loss=0.3747, simple_loss=0.4069, pruned_loss=0.1713, over 20062.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.3946, pruned_loss=0.1539, over 4282049.75 frames. ], batch size: 702, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:07:49,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=97860.0, ans=0.0 2023-06-18 02:07:50,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97860.0, ans=0.1 2023-06-18 02:08:17,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 3.435e+02 4.309e+02 5.263e+02 1.274e+03, threshold=8.618e+02, percent-clipped=4.0 2023-06-18 02:09:08,888 INFO [train.py:996] (0/4) Epoch 1, batch 16350, loss[loss=0.4447, simple_loss=0.4651, pruned_loss=0.2121, over 21790.00 frames. ], tot_loss[loss=0.3554, simple_loss=0.3966, pruned_loss=0.1571, over 4277972.14 frames. ], batch size: 441, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:09:32,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=98160.0, ans=0.125 2023-06-18 02:10:51,214 INFO [train.py:996] (0/4) Epoch 1, batch 16400, loss[loss=0.4018, simple_loss=0.4289, pruned_loss=0.1873, over 21859.00 frames. ], tot_loss[loss=0.3584, simple_loss=0.4, pruned_loss=0.1584, over 4284590.23 frames. ], batch size: 414, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 02:11:01,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=98400.0, ans=0.0 2023-06-18 02:11:12,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=98460.0, ans=0.125 2023-06-18 02:11:47,766 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 3.743e+02 5.331e+02 6.637e+02 1.239e+03, threshold=1.066e+03, percent-clipped=10.0 2023-06-18 02:12:07,462 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-18 02:12:33,638 INFO [train.py:996] (0/4) Epoch 1, batch 16450, loss[loss=0.3749, simple_loss=0.4054, pruned_loss=0.1722, over 21878.00 frames. ], tot_loss[loss=0.3589, simple_loss=0.3991, pruned_loss=0.1593, over 4289785.32 frames. ], batch size: 351, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 02:12:54,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-18 02:13:15,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=98820.0, ans=0.2 2023-06-18 02:14:17,030 INFO [train.py:996] (0/4) Epoch 1, batch 16500, loss[loss=0.3487, simple_loss=0.3861, pruned_loss=0.1556, over 21115.00 frames. ], tot_loss[loss=0.3551, simple_loss=0.3948, pruned_loss=0.1577, over 4289985.45 frames. ], batch size: 607, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:14:38,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=99060.0, ans=15.0 2023-06-18 02:15:16,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.870e+02 4.027e+02 4.822e+02 5.863e+02 1.078e+03, threshold=9.645e+02, percent-clipped=1.0 2023-06-18 02:15:43,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=99240.0, ans=0.125 2023-06-18 02:15:43,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=99240.0, ans=0.125 2023-06-18 02:15:50,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=99240.0, ans=0.1 2023-06-18 02:15:52,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=8.0 2023-06-18 02:16:01,049 INFO [train.py:996] (0/4) Epoch 1, batch 16550, loss[loss=0.4053, simple_loss=0.4399, pruned_loss=0.1854, over 21720.00 frames. ], tot_loss[loss=0.3547, simple_loss=0.3961, pruned_loss=0.1566, over 4278500.56 frames. ], batch size: 351, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:16:06,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=99300.0, ans=0.0 2023-06-18 02:16:45,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=99420.0, ans=0.2 2023-06-18 02:17:08,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=99420.0, ans=0.125 2023-06-18 02:17:16,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=99480.0, ans=0.125 2023-06-18 02:17:45,787 INFO [train.py:996] (0/4) Epoch 1, batch 16600, loss[loss=0.3848, simple_loss=0.4353, pruned_loss=0.1671, over 21637.00 frames. ], tot_loss[loss=0.3646, simple_loss=0.406, pruned_loss=0.1616, over 4282130.04 frames. ], batch size: 263, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:17:59,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=99600.0, ans=0.0 2023-06-18 02:18:11,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=99660.0, ans=0.0 2023-06-18 02:18:51,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.72 vs. limit=15.0 2023-06-18 02:18:55,033 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.830e+02 4.483e+02 5.513e+02 7.810e+02 1.353e+03, threshold=1.103e+03, percent-clipped=10.0 2023-06-18 02:19:16,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=99840.0, ans=0.0 2023-06-18 02:19:40,962 INFO [train.py:996] (0/4) Epoch 1, batch 16650, loss[loss=0.3937, simple_loss=0.4235, pruned_loss=0.182, over 21786.00 frames. ], tot_loss[loss=0.3738, simple_loss=0.4175, pruned_loss=0.1651, over 4274895.11 frames. ], batch size: 298, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:19:43,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=99900.0, ans=0.1 2023-06-18 02:19:54,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=99900.0, ans=0.1 2023-06-18 02:20:22,214 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:21:01,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=100080.0, ans=0.125 2023-06-18 02:21:27,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=100200.0, ans=0.125 2023-06-18 02:21:28,232 INFO [train.py:996] (0/4) Epoch 1, batch 16700, loss[loss=0.3681, simple_loss=0.4195, pruned_loss=0.1584, over 21659.00 frames. ], tot_loss[loss=0.3731, simple_loss=0.4164, pruned_loss=0.1648, over 4274725.23 frames. ], batch size: 414, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:21:35,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=100200.0, ans=0.125 2023-06-18 02:22:34,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 4.065e+02 5.046e+02 6.706e+02 1.129e+03, threshold=1.009e+03, percent-clipped=1.0 2023-06-18 02:22:57,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=100440.0, ans=0.125 2023-06-18 02:23:27,874 INFO [train.py:996] (0/4) Epoch 1, batch 16750, loss[loss=0.5016, simple_loss=0.5274, pruned_loss=0.2379, over 21404.00 frames. ], tot_loss[loss=0.3767, simple_loss=0.4187, pruned_loss=0.1674, over 4271944.92 frames. ], batch size: 507, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:23:48,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=100560.0, ans=0.125 2023-06-18 02:23:54,993 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:24:04,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=100560.0, ans=0.125 2023-06-18 02:24:11,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=100620.0, ans=0.125 2023-06-18 02:24:20,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-18 02:24:20,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=12.0 2023-06-18 02:25:01,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=100740.0, ans=0.125 2023-06-18 02:25:11,929 INFO [train.py:996] (0/4) Epoch 1, batch 16800, loss[loss=0.2243, simple_loss=0.2472, pruned_loss=0.1007, over 16800.00 frames. ], tot_loss[loss=0.378, simple_loss=0.4215, pruned_loss=0.1673, over 4268988.65 frames. ], batch size: 61, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:25:58,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-18 02:26:08,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 4.332e+02 5.464e+02 7.061e+02 1.204e+03, threshold=1.093e+03, percent-clipped=8.0 2023-06-18 02:26:26,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=100980.0, ans=0.125 2023-06-18 02:26:32,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=100980.0, ans=0.125 2023-06-18 02:26:54,566 INFO [train.py:996] (0/4) Epoch 1, batch 16850, loss[loss=0.3486, simple_loss=0.3935, pruned_loss=0.1519, over 21906.00 frames. ], tot_loss[loss=0.3768, simple_loss=0.4186, pruned_loss=0.1675, over 4279494.66 frames. ], batch size: 118, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:27:06,724 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:27:42,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=101220.0, ans=0.05 2023-06-18 02:28:06,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=101280.0, ans=0.125 2023-06-18 02:28:23,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=101340.0, ans=0.2 2023-06-18 02:28:34,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=101340.0, ans=0.2 2023-06-18 02:28:37,473 INFO [train.py:996] (0/4) Epoch 1, batch 16900, loss[loss=0.3807, simple_loss=0.4035, pruned_loss=0.179, over 21177.00 frames. ], tot_loss[loss=0.37, simple_loss=0.4108, pruned_loss=0.1646, over 4285179.99 frames. ], batch size: 607, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:28:49,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=101400.0, ans=0.2 2023-06-18 02:29:39,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.564e+02 4.278e+02 5.386e+02 1.254e+03, threshold=8.556e+02, percent-clipped=1.0 2023-06-18 02:30:17,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-18 02:30:19,207 INFO [train.py:996] (0/4) Epoch 1, batch 16950, loss[loss=0.396, simple_loss=0.4133, pruned_loss=0.1893, over 21597.00 frames. ], tot_loss[loss=0.3627, simple_loss=0.4025, pruned_loss=0.1615, over 4287081.08 frames. ], batch size: 471, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:30:32,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=101700.0, ans=0.125 2023-06-18 02:32:00,707 INFO [train.py:996] (0/4) Epoch 1, batch 17000, loss[loss=0.3347, simple_loss=0.3715, pruned_loss=0.1489, over 21857.00 frames. ], tot_loss[loss=0.3601, simple_loss=0.398, pruned_loss=0.161, over 4287059.63 frames. ], batch size: 298, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:32:48,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102120.0, ans=0.1 2023-06-18 02:33:09,874 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.976e+02 5.576e+02 8.538e+02 1.340e+03, threshold=1.115e+03, percent-clipped=23.0 2023-06-18 02:33:10,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=102180.0, ans=0.125 2023-06-18 02:33:34,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=102240.0, ans=0.125 2023-06-18 02:33:44,522 INFO [train.py:996] (0/4) Epoch 1, batch 17050, loss[loss=0.4444, simple_loss=0.4788, pruned_loss=0.205, over 21549.00 frames. ], tot_loss[loss=0.3681, simple_loss=0.4059, pruned_loss=0.1651, over 4290459.34 frames. ], batch size: 471, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:34:12,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=102360.0, ans=0.125 2023-06-18 02:34:39,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=102420.0, ans=0.0 2023-06-18 02:35:26,150 INFO [train.py:996] (0/4) Epoch 1, batch 17100, loss[loss=0.3667, simple_loss=0.3952, pruned_loss=0.1691, over 21932.00 frames. ], tot_loss[loss=0.3669, simple_loss=0.4045, pruned_loss=0.1646, over 4296080.40 frames. ], batch size: 333, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 02:35:33,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=102600.0, ans=0.2 2023-06-18 02:35:51,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=102660.0, ans=0.2 2023-06-18 02:36:02,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=102660.0, ans=0.2 2023-06-18 02:36:09,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=102720.0, ans=0.015 2023-06-18 02:36:26,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=102720.0, ans=0.05 2023-06-18 02:36:28,705 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.900e+02 4.752e+02 6.955e+02 1.664e+03, threshold=9.503e+02, percent-clipped=6.0 2023-06-18 02:36:44,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=102780.0, ans=0.125 2023-06-18 02:36:55,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102840.0, ans=0.1 2023-06-18 02:37:07,593 INFO [train.py:996] (0/4) Epoch 1, batch 17150, loss[loss=0.4287, simple_loss=0.4313, pruned_loss=0.2131, over 21719.00 frames. ], tot_loss[loss=0.3634, simple_loss=0.3995, pruned_loss=0.1637, over 4300030.36 frames. ], batch size: 508, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 02:37:35,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=102960.0, ans=0.2 2023-06-18 02:38:08,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=103080.0, ans=0.2 2023-06-18 02:38:22,731 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-06-18 02:38:34,783 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.31 vs. limit=15.0 2023-06-18 02:38:52,559 INFO [train.py:996] (0/4) Epoch 1, batch 17200, loss[loss=0.3683, simple_loss=0.406, pruned_loss=0.1653, over 19998.00 frames. ], tot_loss[loss=0.3634, simple_loss=0.4, pruned_loss=0.1634, over 4301193.55 frames. ], batch size: 702, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:39:28,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=103260.0, ans=0.0 2023-06-18 02:39:31,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=103320.0, ans=0.015 2023-06-18 02:39:45,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 3.944e+02 4.838e+02 6.225e+02 9.968e+02, threshold=9.676e+02, percent-clipped=1.0 2023-06-18 02:39:57,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=103380.0, ans=0.125 2023-06-18 02:40:31,384 INFO [train.py:996] (0/4) Epoch 1, batch 17250, loss[loss=0.3423, simple_loss=0.3932, pruned_loss=0.1456, over 21640.00 frames. ], tot_loss[loss=0.369, simple_loss=0.405, pruned_loss=0.1665, over 4301164.11 frames. ], batch size: 263, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:40:36,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=103500.0, ans=0.125 2023-06-18 02:41:01,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=103560.0, ans=10.0 2023-06-18 02:41:54,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=103740.0, ans=0.035 2023-06-18 02:42:03,788 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-18 02:42:10,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-18 02:42:10,835 INFO [train.py:996] (0/4) Epoch 1, batch 17300, loss[loss=0.361, simple_loss=0.4024, pruned_loss=0.1598, over 21785.00 frames. ], tot_loss[loss=0.3786, simple_loss=0.4152, pruned_loss=0.1711, over 4293310.13 frames. ], batch size: 282, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:42:49,515 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:42:59,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=103920.0, ans=0.125 2023-06-18 02:43:17,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.798e+02 4.894e+02 6.344e+02 1.044e+03, threshold=9.789e+02, percent-clipped=2.0 2023-06-18 02:43:34,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=104040.0, ans=0.125 2023-06-18 02:44:02,184 INFO [train.py:996] (0/4) Epoch 1, batch 17350, loss[loss=0.3853, simple_loss=0.4375, pruned_loss=0.1666, over 20636.00 frames. ], tot_loss[loss=0.3788, simple_loss=0.4163, pruned_loss=0.1706, over 4289333.46 frames. ], batch size: 607, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:44:34,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=104160.0, ans=0.0 2023-06-18 02:44:59,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=104220.0, ans=0.0 2023-06-18 02:45:38,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=104340.0, ans=0.0 2023-06-18 02:45:47,333 INFO [train.py:996] (0/4) Epoch 1, batch 17400, loss[loss=0.2324, simple_loss=0.2433, pruned_loss=0.1107, over 16359.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.4112, pruned_loss=0.1653, over 4280000.07 frames. ], batch size: 60, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:46:13,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=104460.0, ans=0.07 2023-06-18 02:46:47,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 3.665e+02 5.087e+02 7.278e+02 1.204e+03, threshold=1.017e+03, percent-clipped=4.0 2023-06-18 02:47:29,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=104700.0, ans=0.125 2023-06-18 02:47:30,815 INFO [train.py:996] (0/4) Epoch 1, batch 17450, loss[loss=0.2567, simple_loss=0.3327, pruned_loss=0.09033, over 21808.00 frames. ], tot_loss[loss=0.3626, simple_loss=0.4046, pruned_loss=0.1603, over 4275038.15 frames. ], batch size: 316, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:47:42,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104700.0, ans=0.1 2023-06-18 02:48:08,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=104820.0, ans=0.0 2023-06-18 02:48:17,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-18 02:48:54,518 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:49:07,404 INFO [train.py:996] (0/4) Epoch 1, batch 17500, loss[loss=0.3375, simple_loss=0.3715, pruned_loss=0.1517, over 21864.00 frames. ], tot_loss[loss=0.3519, simple_loss=0.3962, pruned_loss=0.1538, over 4282811.77 frames. ], batch size: 124, lr: 2.82e-02, grad_scale: 16.0 2023-06-18 02:49:26,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=105000.0, ans=0.125 2023-06-18 02:49:29,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=105060.0, ans=0.125 2023-06-18 02:50:10,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=105120.0, ans=0.125 2023-06-18 02:50:12,882 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 3.016e+02 3.973e+02 5.519e+02 1.327e+03, threshold=7.947e+02, percent-clipped=4.0 2023-06-18 02:50:45,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=105240.0, ans=0.125 2023-06-18 02:50:49,658 INFO [train.py:996] (0/4) Epoch 1, batch 17550, loss[loss=0.2969, simple_loss=0.3668, pruned_loss=0.1135, over 21261.00 frames. ], tot_loss[loss=0.351, simple_loss=0.3966, pruned_loss=0.1527, over 4288802.84 frames. ], batch size: 143, lr: 2.82e-02, grad_scale: 16.0 2023-06-18 02:51:27,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=105420.0, ans=0.125 2023-06-18 02:51:40,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=105420.0, ans=0.04949747468305833 2023-06-18 02:52:32,340 INFO [train.py:996] (0/4) Epoch 1, batch 17600, loss[loss=0.4015, simple_loss=0.4284, pruned_loss=0.1873, over 21693.00 frames. ], tot_loss[loss=0.3502, simple_loss=0.3976, pruned_loss=0.1514, over 4278305.57 frames. ], batch size: 351, lr: 2.82e-02, grad_scale: 32.0 2023-06-18 02:53:04,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=105660.0, ans=0.1 2023-06-18 02:53:10,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=105720.0, ans=0.125 2023-06-18 02:53:29,800 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.74 vs. limit=22.5 2023-06-18 02:53:35,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=105780.0, ans=0.0 2023-06-18 02:53:36,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 3.619e+02 4.865e+02 6.950e+02 1.496e+03, threshold=9.730e+02, percent-clipped=22.0 2023-06-18 02:53:45,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=105780.0, ans=0.05 2023-06-18 02:54:19,867 INFO [train.py:996] (0/4) Epoch 1, batch 17650, loss[loss=0.2813, simple_loss=0.336, pruned_loss=0.1133, over 21839.00 frames. ], tot_loss[loss=0.3474, simple_loss=0.3935, pruned_loss=0.1507, over 4279194.78 frames. ], batch size: 317, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:56:01,847 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:56:02,790 INFO [train.py:996] (0/4) Epoch 1, batch 17700, loss[loss=0.4454, simple_loss=0.48, pruned_loss=0.2054, over 21437.00 frames. ], tot_loss[loss=0.3416, simple_loss=0.3889, pruned_loss=0.1471, over 4270891.38 frames. ], batch size: 471, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:56:23,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=106260.0, ans=0.0 2023-06-18 02:56:25,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106260.0, ans=0.1 2023-06-18 02:57:07,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.749e+02 4.413e+02 5.536e+02 1.027e+03, threshold=8.827e+02, percent-clipped=1.0 2023-06-18 02:57:45,709 INFO [train.py:996] (0/4) Epoch 1, batch 17750, loss[loss=0.3575, simple_loss=0.4062, pruned_loss=0.1544, over 20644.00 frames. ], tot_loss[loss=0.3537, simple_loss=0.3993, pruned_loss=0.1541, over 4271935.08 frames. ], batch size: 607, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:57:49,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=106500.0, ans=0.2 2023-06-18 02:58:50,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=106680.0, ans=0.09899494936611666 2023-06-18 02:59:12,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=106740.0, ans=0.125 2023-06-18 02:59:15,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=106740.0, ans=0.035 2023-06-18 02:59:21,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-18 02:59:30,624 INFO [train.py:996] (0/4) Epoch 1, batch 17800, loss[loss=0.4164, simple_loss=0.4383, pruned_loss=0.1972, over 21292.00 frames. ], tot_loss[loss=0.3555, simple_loss=0.4008, pruned_loss=0.1551, over 4276053.25 frames. ], batch size: 159, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 02:59:38,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=106800.0, ans=0.07 2023-06-18 02:59:41,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106800.0, ans=0.1 2023-06-18 03:00:08,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=106860.0, ans=0.125 2023-06-18 03:00:41,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.616e+02 4.998e+02 5.902e+02 1.082e+03, threshold=9.996e+02, percent-clipped=5.0 2023-06-18 03:00:42,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=106980.0, ans=0.025 2023-06-18 03:01:25,761 INFO [train.py:996] (0/4) Epoch 1, batch 17850, loss[loss=0.3235, simple_loss=0.3711, pruned_loss=0.138, over 21325.00 frames. ], tot_loss[loss=0.3543, simple_loss=0.4006, pruned_loss=0.1541, over 4267054.26 frames. ], batch size: 176, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 03:01:45,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=107160.0, ans=0.125 2023-06-18 03:01:48,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=107160.0, ans=0.035 2023-06-18 03:02:02,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107160.0, ans=0.1 2023-06-18 03:02:08,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=107220.0, ans=0.0 2023-06-18 03:02:36,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.20 vs. limit=10.0 2023-06-18 03:03:05,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=107340.0, ans=0.02 2023-06-18 03:03:06,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107340.0, ans=0.1 2023-06-18 03:03:11,363 INFO [train.py:996] (0/4) Epoch 1, batch 17900, loss[loss=0.3721, simple_loss=0.4321, pruned_loss=0.1561, over 21631.00 frames. ], tot_loss[loss=0.3612, simple_loss=0.4065, pruned_loss=0.1579, over 4256123.89 frames. ], batch size: 230, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 03:03:38,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=107460.0, ans=0.125 2023-06-18 03:04:10,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.685e+02 4.154e+02 5.194e+02 6.786e+02 1.159e+03, threshold=1.039e+03, percent-clipped=5.0 2023-06-18 03:04:32,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107580.0, ans=0.1 2023-06-18 03:04:39,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107640.0, ans=0.1 2023-06-18 03:04:40,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=107640.0, ans=15.0 2023-06-18 03:04:55,234 INFO [train.py:996] (0/4) Epoch 1, batch 17950, loss[loss=0.3199, simple_loss=0.3853, pruned_loss=0.1272, over 21777.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.404, pruned_loss=0.1525, over 4258524.13 frames. ], batch size: 282, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:05:07,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=107700.0, ans=0.125 2023-06-18 03:05:16,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=107760.0, ans=0.0 2023-06-18 03:05:16,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=107760.0, ans=0.2 2023-06-18 03:06:38,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.13 vs. limit=15.0 2023-06-18 03:06:38,451 INFO [train.py:996] (0/4) Epoch 1, batch 18000, loss[loss=0.3041, simple_loss=0.338, pruned_loss=0.1351, over 21609.00 frames. ], tot_loss[loss=0.3479, simple_loss=0.3949, pruned_loss=0.1505, over 4260484.58 frames. ], batch size: 332, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:06:38,452 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 03:06:48,132 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.4986, 3.2959, 2.3898, 3.5676], device='cuda:0') 2023-06-18 03:06:57,887 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3324, simple_loss=0.4216, pruned_loss=0.1216, over 1796401.00 frames. 2023-06-18 03:06:57,888 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 03:07:07,686 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:07:45,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=108120.0, ans=0.0 2023-06-18 03:08:03,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 3.501e+02 4.751e+02 6.240e+02 1.819e+03, threshold=9.502e+02, percent-clipped=6.0 2023-06-18 03:08:17,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=16.61 vs. limit=15.0 2023-06-18 03:08:36,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=108240.0, ans=0.125 2023-06-18 03:08:41,382 INFO [train.py:996] (0/4) Epoch 1, batch 18050, loss[loss=0.3302, simple_loss=0.3667, pruned_loss=0.1468, over 21405.00 frames. ], tot_loss[loss=0.3438, simple_loss=0.3885, pruned_loss=0.1496, over 4268867.99 frames. ], batch size: 194, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:08:56,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-18 03:09:14,129 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:09:26,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=108420.0, ans=0.2 2023-06-18 03:09:26,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=108420.0, ans=0.125 2023-06-18 03:09:37,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=108420.0, ans=0.0 2023-06-18 03:09:52,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108480.0, ans=0.1 2023-06-18 03:09:57,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=108480.0, ans=0.125 2023-06-18 03:09:59,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108480.0, ans=0.1 2023-06-18 03:10:03,996 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-18 03:10:20,567 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-18 03:10:30,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108540.0, ans=0.1 2023-06-18 03:10:33,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2023-06-18 03:10:33,512 INFO [train.py:996] (0/4) Epoch 1, batch 18100, loss[loss=0.3994, simple_loss=0.4355, pruned_loss=0.1816, over 21383.00 frames. ], tot_loss[loss=0.3528, simple_loss=0.3967, pruned_loss=0.1544, over 4265895.15 frames. ], batch size: 549, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:10:55,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=108660.0, ans=0.125 2023-06-18 03:11:33,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.544e+02 4.148e+02 5.231e+02 6.853e+02 1.250e+03, threshold=1.046e+03, percent-clipped=5.0 2023-06-18 03:11:44,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=108780.0, ans=0.125 2023-06-18 03:12:07,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=108840.0, ans=0.07 2023-06-18 03:12:17,540 INFO [train.py:996] (0/4) Epoch 1, batch 18150, loss[loss=0.4329, simple_loss=0.4903, pruned_loss=0.1878, over 19888.00 frames. ], tot_loss[loss=0.3504, simple_loss=0.3966, pruned_loss=0.1521, over 4265561.91 frames. ], batch size: 702, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:12:53,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.26 vs. limit=15.0 2023-06-18 03:13:01,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109020.0, ans=0.1 2023-06-18 03:13:06,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-18 03:13:29,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=109080.0, ans=0.05 2023-06-18 03:13:32,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.52 vs. limit=15.0 2023-06-18 03:13:58,933 INFO [train.py:996] (0/4) Epoch 1, batch 18200, loss[loss=0.2796, simple_loss=0.3334, pruned_loss=0.1129, over 21616.00 frames. ], tot_loss[loss=0.3477, simple_loss=0.3907, pruned_loss=0.1524, over 4262359.31 frames. ], batch size: 132, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:14:00,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=109200.0, ans=0.125 2023-06-18 03:14:52,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=109320.0, ans=0.125 2023-06-18 03:14:56,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 3.703e+02 5.000e+02 6.238e+02 9.945e+02, threshold=1.000e+03, percent-clipped=0.0 2023-06-18 03:15:10,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=109380.0, ans=0.2 2023-06-18 03:15:10,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=109380.0, ans=0.125 2023-06-18 03:15:33,213 INFO [train.py:996] (0/4) Epoch 1, batch 18250, loss[loss=0.289, simple_loss=0.3359, pruned_loss=0.121, over 21484.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3788, pruned_loss=0.146, over 4258361.00 frames. ], batch size: 212, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:15:51,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=109560.0, ans=0.125 2023-06-18 03:16:13,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=109620.0, ans=0.0 2023-06-18 03:17:12,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=109740.0, ans=0.125 2023-06-18 03:17:17,200 INFO [train.py:996] (0/4) Epoch 1, batch 18300, loss[loss=0.3582, simple_loss=0.4299, pruned_loss=0.1432, over 21414.00 frames. ], tot_loss[loss=0.3356, simple_loss=0.3784, pruned_loss=0.1464, over 4246259.12 frames. ], batch size: 211, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:17:57,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=109920.0, ans=0.125 2023-06-18 03:18:07,154 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:18:21,688 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 3.385e+02 4.237e+02 5.760e+02 9.388e+02, threshold=8.473e+02, percent-clipped=0.0 2023-06-18 03:18:31,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.09 vs. limit=15.0 2023-06-18 03:19:00,061 INFO [train.py:996] (0/4) Epoch 1, batch 18350, loss[loss=0.2961, simple_loss=0.3544, pruned_loss=0.1189, over 21394.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3874, pruned_loss=0.1481, over 4243381.45 frames. ], batch size: 211, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:20:28,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=110340.0, ans=0.125 2023-06-18 03:20:44,049 INFO [train.py:996] (0/4) Epoch 1, batch 18400, loss[loss=0.3777, simple_loss=0.4006, pruned_loss=0.1774, over 19979.00 frames. ], tot_loss[loss=0.3379, simple_loss=0.3829, pruned_loss=0.1465, over 4239906.16 frames. ], batch size: 702, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:20:53,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=22.5 2023-06-18 03:21:39,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=110520.0, ans=0.125 2023-06-18 03:21:48,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.132e+02 3.866e+02 5.099e+02 1.496e+03, threshold=7.733e+02, percent-clipped=6.0 2023-06-18 03:21:53,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=110580.0, ans=0.125 2023-06-18 03:22:31,361 INFO [train.py:996] (0/4) Epoch 1, batch 18450, loss[loss=0.4071, simple_loss=0.5055, pruned_loss=0.1544, over 19670.00 frames. ], tot_loss[loss=0.328, simple_loss=0.3766, pruned_loss=0.1397, over 4237728.47 frames. ], batch size: 702, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:22:31,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=110700.0, ans=0.125 2023-06-18 03:22:32,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110700.0, ans=0.1 2023-06-18 03:23:43,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=110880.0, ans=0.125 2023-06-18 03:24:10,279 INFO [train.py:996] (0/4) Epoch 1, batch 18500, loss[loss=0.2992, simple_loss=0.3845, pruned_loss=0.1069, over 20811.00 frames. ], tot_loss[loss=0.3252, simple_loss=0.3737, pruned_loss=0.1383, over 4248063.29 frames. ], batch size: 608, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:24:17,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=111000.0, ans=0.1 2023-06-18 03:24:40,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=111060.0, ans=0.05 2023-06-18 03:25:14,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 3.536e+02 4.866e+02 6.375e+02 1.291e+03, threshold=9.732e+02, percent-clipped=16.0 2023-06-18 03:25:21,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=111180.0, ans=0.04949747468305833 2023-06-18 03:25:31,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=111180.0, ans=0.07 2023-06-18 03:25:48,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=111240.0, ans=0.125 2023-06-18 03:25:52,422 INFO [train.py:996] (0/4) Epoch 1, batch 18550, loss[loss=0.3645, simple_loss=0.4264, pruned_loss=0.1513, over 21478.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3721, pruned_loss=0.1381, over 4248724.35 frames. ], batch size: 471, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:26:21,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=111360.0, ans=0.05 2023-06-18 03:26:53,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=111420.0, ans=0.125 2023-06-18 03:27:00,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=111480.0, ans=0.125 2023-06-18 03:27:44,745 INFO [train.py:996] (0/4) Epoch 1, batch 18600, loss[loss=0.3401, simple_loss=0.3701, pruned_loss=0.1551, over 20136.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3715, pruned_loss=0.14, over 4252415.20 frames. ], batch size: 702, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:27:48,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=111600.0, ans=0.125 2023-06-18 03:28:36,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=12.0 2023-06-18 03:28:44,876 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.656e+02 4.245e+02 5.529e+02 8.990e+02, threshold=8.491e+02, percent-clipped=0.0 2023-06-18 03:28:45,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=111780.0, ans=0.04949747468305833 2023-06-18 03:29:15,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=22.5 2023-06-18 03:29:27,690 INFO [train.py:996] (0/4) Epoch 1, batch 18650, loss[loss=0.2823, simple_loss=0.3221, pruned_loss=0.1212, over 21764.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3694, pruned_loss=0.1392, over 4260847.05 frames. ], batch size: 124, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:29:34,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=111900.0, ans=0.1 2023-06-18 03:29:36,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=111900.0, ans=0.0 2023-06-18 03:31:04,857 INFO [train.py:996] (0/4) Epoch 1, batch 18700, loss[loss=0.3187, simple_loss=0.3437, pruned_loss=0.1469, over 21429.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.366, pruned_loss=0.1405, over 4257452.37 frames. ], batch size: 194, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:31:30,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=21.08 vs. limit=15.0 2023-06-18 03:31:33,056 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:31:59,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.50 vs. limit=15.0 2023-06-18 03:32:03,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.596e+02 4.888e+02 6.142e+02 1.184e+03, threshold=9.776e+02, percent-clipped=7.0 2023-06-18 03:32:16,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=15.0 2023-06-18 03:32:39,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=112440.0, ans=0.125 2023-06-18 03:32:47,390 INFO [train.py:996] (0/4) Epoch 1, batch 18750, loss[loss=0.3433, simple_loss=0.3806, pruned_loss=0.153, over 21834.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3683, pruned_loss=0.144, over 4259685.00 frames. ], batch size: 124, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:33:22,523 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.37 vs. limit=22.5 2023-06-18 03:33:40,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=112620.0, ans=0.0 2023-06-18 03:33:52,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=112680.0, ans=0.125 2023-06-18 03:34:20,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=112740.0, ans=0.0 2023-06-18 03:34:31,356 INFO [train.py:996] (0/4) Epoch 1, batch 18800, loss[loss=0.245, simple_loss=0.3062, pruned_loss=0.09192, over 21192.00 frames. ], tot_loss[loss=0.3315, simple_loss=0.3735, pruned_loss=0.1448, over 4254410.73 frames. ], batch size: 143, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:35:05,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=112860.0, ans=0.125 2023-06-18 03:35:38,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 3.151e+02 4.208e+02 5.876e+02 1.169e+03, threshold=8.416e+02, percent-clipped=1.0 2023-06-18 03:35:38,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=112980.0, ans=0.0 2023-06-18 03:36:04,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=113040.0, ans=0.125 2023-06-18 03:36:15,769 INFO [train.py:996] (0/4) Epoch 1, batch 18850, loss[loss=0.2972, simple_loss=0.3426, pruned_loss=0.1259, over 21694.00 frames. ], tot_loss[loss=0.322, simple_loss=0.3684, pruned_loss=0.1378, over 4265263.67 frames. ], batch size: 333, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:36:28,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=113100.0, ans=0.0 2023-06-18 03:36:39,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=113160.0, ans=0.125 2023-06-18 03:37:18,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-18 03:37:57,707 INFO [train.py:996] (0/4) Epoch 1, batch 18900, loss[loss=0.317, simple_loss=0.3591, pruned_loss=0.1374, over 21729.00 frames. ], tot_loss[loss=0.3212, simple_loss=0.3656, pruned_loss=0.1384, over 4266284.74 frames. ], batch size: 112, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:38:18,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=113460.0, ans=0.125 2023-06-18 03:38:24,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=113460.0, ans=0.125 2023-06-18 03:38:57,661 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.397e+02 4.704e+02 6.198e+02 1.365e+03, threshold=9.409e+02, percent-clipped=10.0 2023-06-18 03:39:10,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=113580.0, ans=0.125 2023-06-18 03:39:48,148 INFO [train.py:996] (0/4) Epoch 1, batch 18950, loss[loss=0.3939, simple_loss=0.4388, pruned_loss=0.1745, over 21812.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.3697, pruned_loss=0.1435, over 4270793.09 frames. ], batch size: 351, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:41:07,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113880.0, ans=0.1 2023-06-18 03:41:31,613 INFO [train.py:996] (0/4) Epoch 1, batch 19000, loss[loss=0.3758, simple_loss=0.4464, pruned_loss=0.1526, over 21504.00 frames. ], tot_loss[loss=0.3364, simple_loss=0.3808, pruned_loss=0.146, over 4272934.24 frames. ], batch size: 473, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:41:48,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=114060.0, ans=0.02 2023-06-18 03:42:36,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=114180.0, ans=0.0 2023-06-18 03:42:37,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 4.163e+02 4.936e+02 6.528e+02 1.667e+03, threshold=9.873e+02, percent-clipped=8.0 2023-06-18 03:42:37,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=114180.0, ans=0.95 2023-06-18 03:42:37,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=114180.0, ans=0.0 2023-06-18 03:42:56,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.91 vs. limit=6.0 2023-06-18 03:43:15,067 INFO [train.py:996] (0/4) Epoch 1, batch 19050, loss[loss=0.3342, simple_loss=0.3666, pruned_loss=0.1509, over 21304.00 frames. ], tot_loss[loss=0.3461, simple_loss=0.3876, pruned_loss=0.1523, over 4279673.15 frames. ], batch size: 159, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:43:19,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=114300.0, ans=0.2 2023-06-18 03:43:20,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=114300.0, ans=0.2 2023-06-18 03:44:31,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=114480.0, ans=0.125 2023-06-18 03:44:31,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=114480.0, ans=0.025 2023-06-18 03:44:41,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.02 vs. limit=10.0 2023-06-18 03:44:58,940 INFO [train.py:996] (0/4) Epoch 1, batch 19100, loss[loss=0.3579, simple_loss=0.3785, pruned_loss=0.1687, over 21727.00 frames. ], tot_loss[loss=0.3478, simple_loss=0.3865, pruned_loss=0.1546, over 4282187.47 frames. ], batch size: 316, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:45:55,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=114720.0, ans=0.125 2023-06-18 03:46:04,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.557e+02 3.849e+02 4.992e+02 6.577e+02 2.048e+03, threshold=9.985e+02, percent-clipped=3.0 2023-06-18 03:46:09,201 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=7.900e-03 2023-06-18 03:46:16,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=114780.0, ans=0.125 2023-06-18 03:46:31,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114840.0, ans=0.1 2023-06-18 03:46:43,853 INFO [train.py:996] (0/4) Epoch 1, batch 19150, loss[loss=0.3866, simple_loss=0.4445, pruned_loss=0.1644, over 21787.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.3904, pruned_loss=0.156, over 4285681.04 frames. ], batch size: 282, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:47:17,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=114960.0, ans=0.125 2023-06-18 03:47:28,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=114960.0, ans=0.125 2023-06-18 03:47:34,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=115020.0, ans=0.2 2023-06-18 03:48:05,719 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:48:29,969 INFO [train.py:996] (0/4) Epoch 1, batch 19200, loss[loss=0.4234, simple_loss=0.4397, pruned_loss=0.2036, over 20058.00 frames. ], tot_loss[loss=0.3557, simple_loss=0.3995, pruned_loss=0.156, over 4275459.71 frames. ], batch size: 702, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:49:01,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=115260.0, ans=0.125 2023-06-18 03:49:11,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.06 vs. limit=22.5 2023-06-18 03:49:14,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=115320.0, ans=0.0 2023-06-18 03:49:16,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=115320.0, ans=0.1 2023-06-18 03:49:20,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115320.0, ans=0.1 2023-06-18 03:49:29,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=115380.0, ans=0.2 2023-06-18 03:49:30,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.474e+02 4.215e+02 5.397e+02 9.229e+02, threshold=8.431e+02, percent-clipped=0.0 2023-06-18 03:50:08,871 INFO [train.py:996] (0/4) Epoch 1, batch 19250, loss[loss=0.3131, simple_loss=0.3831, pruned_loss=0.1215, over 21660.00 frames. ], tot_loss[loss=0.3471, simple_loss=0.3981, pruned_loss=0.148, over 4272423.55 frames. ], batch size: 441, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:51:14,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=115680.0, ans=0.2 2023-06-18 03:51:33,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=115680.0, ans=0.2 2023-06-18 03:51:52,358 INFO [train.py:996] (0/4) Epoch 1, batch 19300, loss[loss=0.4276, simple_loss=0.4452, pruned_loss=0.205, over 21611.00 frames. ], tot_loss[loss=0.347, simple_loss=0.3958, pruned_loss=0.1491, over 4276360.16 frames. ], batch size: 508, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:52:31,143 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=15.0 2023-06-18 03:53:02,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 3.240e+02 4.224e+02 5.313e+02 1.250e+03, threshold=8.447e+02, percent-clipped=7.0 2023-06-18 03:53:13,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115980.0, ans=0.1 2023-06-18 03:53:37,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=116040.0, ans=10.0 2023-06-18 03:53:41,294 INFO [train.py:996] (0/4) Epoch 1, batch 19350, loss[loss=0.2625, simple_loss=0.3081, pruned_loss=0.1085, over 21838.00 frames. ], tot_loss[loss=0.3353, simple_loss=0.3863, pruned_loss=0.1422, over 4276606.85 frames. ], batch size: 107, lr: 2.71e-02, grad_scale: 64.0 2023-06-18 03:54:04,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=116100.0, ans=15.0 2023-06-18 03:54:45,976 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.24 vs. limit=22.5 2023-06-18 03:55:04,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=116340.0, ans=0.125 2023-06-18 03:55:24,582 INFO [train.py:996] (0/4) Epoch 1, batch 19400, loss[loss=0.2886, simple_loss=0.3374, pruned_loss=0.1199, over 21202.00 frames. ], tot_loss[loss=0.3324, simple_loss=0.3837, pruned_loss=0.1405, over 4272998.54 frames. ], batch size: 159, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:55:54,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.78 vs. limit=22.5 2023-06-18 03:56:29,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.749e+02 4.636e+02 5.829e+02 1.066e+03, threshold=9.272e+02, percent-clipped=6.0 2023-06-18 03:56:31,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=116580.0, ans=0.125 2023-06-18 03:56:41,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116580.0, ans=0.1 2023-06-18 03:57:05,456 INFO [train.py:996] (0/4) Epoch 1, batch 19450, loss[loss=0.3302, simple_loss=0.3618, pruned_loss=0.1493, over 21795.00 frames. ], tot_loss[loss=0.3364, simple_loss=0.383, pruned_loss=0.1449, over 4278609.91 frames. ], batch size: 351, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:58:42,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=116940.0, ans=0.025 2023-06-18 03:58:56,299 INFO [train.py:996] (0/4) Epoch 1, batch 19500, loss[loss=0.2876, simple_loss=0.3291, pruned_loss=0.1231, over 21201.00 frames. ], tot_loss[loss=0.3358, simple_loss=0.3782, pruned_loss=0.1467, over 4276777.90 frames. ], batch size: 176, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:59:13,399 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:59:57,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.820e+02 4.726e+02 6.793e+02 1.461e+03, threshold=9.451e+02, percent-clipped=7.0 2023-06-18 04:00:00,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=117180.0, ans=0.0 2023-06-18 04:00:15,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-18 04:00:32,889 INFO [train.py:996] (0/4) Epoch 1, batch 19550, loss[loss=0.3019, simple_loss=0.37, pruned_loss=0.1169, over 21569.00 frames. ], tot_loss[loss=0.3293, simple_loss=0.3725, pruned_loss=0.143, over 4273925.55 frames. ], batch size: 230, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:01:09,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=117360.0, ans=0.125 2023-06-18 04:02:09,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=117540.0, ans=0.2 2023-06-18 04:02:13,867 INFO [train.py:996] (0/4) Epoch 1, batch 19600, loss[loss=0.356, simple_loss=0.3938, pruned_loss=0.159, over 21876.00 frames. ], tot_loss[loss=0.3338, simple_loss=0.3758, pruned_loss=0.146, over 4285163.00 frames. ], batch size: 107, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:02:23,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=117600.0, ans=0.0 2023-06-18 04:03:14,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=117720.0, ans=0.05 2023-06-18 04:03:20,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.478e+02 4.292e+02 5.648e+02 1.125e+03, threshold=8.585e+02, percent-clipped=2.0 2023-06-18 04:04:03,646 INFO [train.py:996] (0/4) Epoch 1, batch 19650, loss[loss=0.3362, simple_loss=0.3772, pruned_loss=0.1476, over 21859.00 frames. ], tot_loss[loss=0.3455, simple_loss=0.3846, pruned_loss=0.1532, over 4288659.72 frames. ], batch size: 282, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:04:30,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=117960.0, ans=0.125 2023-06-18 04:04:44,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=117960.0, ans=0.2 2023-06-18 04:05:36,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=118140.0, ans=0.0 2023-06-18 04:05:52,072 INFO [train.py:996] (0/4) Epoch 1, batch 19700, loss[loss=0.3387, simple_loss=0.3994, pruned_loss=0.139, over 21784.00 frames. ], tot_loss[loss=0.3473, simple_loss=0.3881, pruned_loss=0.1532, over 4281498.42 frames. ], batch size: 316, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:06:08,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-18 04:06:59,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.705e+02 3.771e+02 4.552e+02 5.763e+02 1.165e+03, threshold=9.104e+02, percent-clipped=3.0 2023-06-18 04:07:03,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118380.0, ans=0.1 2023-06-18 04:07:09,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=118380.0, ans=0.125 2023-06-18 04:07:30,623 INFO [train.py:996] (0/4) Epoch 1, batch 19750, loss[loss=0.3532, simple_loss=0.4078, pruned_loss=0.1492, over 21486.00 frames. ], tot_loss[loss=0.3526, simple_loss=0.397, pruned_loss=0.1541, over 4274977.53 frames. ], batch size: 211, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:07:59,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-18 04:08:05,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-18 04:08:40,808 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:08:46,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118680.0, ans=0.1 2023-06-18 04:09:17,506 INFO [train.py:996] (0/4) Epoch 1, batch 19800, loss[loss=0.2644, simple_loss=0.3134, pruned_loss=0.1077, over 21280.00 frames. ], tot_loss[loss=0.3554, simple_loss=0.3982, pruned_loss=0.1562, over 4274683.55 frames. ], batch size: 176, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:09:18,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=22.5 2023-06-18 04:09:21,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=118800.0, ans=0.0 2023-06-18 04:10:11,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118920.0, ans=0.1 2023-06-18 04:10:23,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.634e+02 4.450e+02 5.874e+02 9.997e+02, threshold=8.899e+02, percent-clipped=2.0 2023-06-18 04:10:41,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=119040.0, ans=0.0 2023-06-18 04:10:51,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=119040.0, ans=0.0 2023-06-18 04:10:51,947 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-06-18 04:10:55,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=119040.0, ans=0.0 2023-06-18 04:11:00,161 INFO [train.py:996] (0/4) Epoch 1, batch 19850, loss[loss=0.2499, simple_loss=0.3272, pruned_loss=0.08629, over 21710.00 frames. ], tot_loss[loss=0.3398, simple_loss=0.386, pruned_loss=0.1468, over 4271238.50 frames. ], batch size: 332, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:12:13,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.78 vs. limit=15.0 2023-06-18 04:12:41,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=119340.0, ans=0.125 2023-06-18 04:12:45,782 INFO [train.py:996] (0/4) Epoch 1, batch 19900, loss[loss=0.3139, simple_loss=0.3688, pruned_loss=0.1294, over 21761.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3866, pruned_loss=0.1439, over 4269543.94 frames. ], batch size: 316, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:13:09,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=119460.0, ans=0.0 2023-06-18 04:13:51,582 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 3.596e+02 4.410e+02 6.393e+02 1.239e+03, threshold=8.821e+02, percent-clipped=7.0 2023-06-18 04:13:59,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=119580.0, ans=15.0 2023-06-18 04:14:27,874 INFO [train.py:996] (0/4) Epoch 1, batch 19950, loss[loss=0.3364, simple_loss=0.3712, pruned_loss=0.1508, over 21404.00 frames. ], tot_loss[loss=0.3335, simple_loss=0.3799, pruned_loss=0.1435, over 4261663.16 frames. ], batch size: 194, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:14:30,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=119700.0, ans=0.0 2023-06-18 04:14:47,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-06-18 04:15:14,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=119760.0, ans=0.0 2023-06-18 04:15:53,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=119940.0, ans=0.95 2023-06-18 04:15:56,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-06-18 04:16:09,645 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-20000.pt 2023-06-18 04:16:12,232 INFO [train.py:996] (0/4) Epoch 1, batch 20000, loss[loss=0.363, simple_loss=0.4205, pruned_loss=0.1527, over 21593.00 frames. ], tot_loss[loss=0.3331, simple_loss=0.3801, pruned_loss=0.1431, over 4251801.07 frames. ], batch size: 389, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:16:47,595 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:17:17,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=120180.0, ans=0.07 2023-06-18 04:17:18,792 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.616e+02 4.426e+02 6.098e+02 1.164e+03, threshold=8.852e+02, percent-clipped=3.0 2023-06-18 04:17:40,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=120240.0, ans=0.1 2023-06-18 04:17:43,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=120240.0, ans=0.125 2023-06-18 04:17:52,859 INFO [train.py:996] (0/4) Epoch 1, batch 20050, loss[loss=0.3693, simple_loss=0.406, pruned_loss=0.1663, over 21840.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.3834, pruned_loss=0.147, over 4266784.49 frames. ], batch size: 282, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:17:53,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=120300.0, ans=0.125 2023-06-18 04:18:02,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=120300.0, ans=0.0 2023-06-18 04:18:29,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.64 vs. limit=6.0 2023-06-18 04:18:42,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-18 04:18:52,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=120420.0, ans=0.2 2023-06-18 04:19:15,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=120480.0, ans=0.0 2023-06-18 04:19:29,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=120540.0, ans=0.2 2023-06-18 04:19:37,720 INFO [train.py:996] (0/4) Epoch 1, batch 20100, loss[loss=0.3304, simple_loss=0.385, pruned_loss=0.138, over 21356.00 frames. ], tot_loss[loss=0.3452, simple_loss=0.3878, pruned_loss=0.1513, over 4275151.08 frames. ], batch size: 211, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:20:09,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=120660.0, ans=0.015 2023-06-18 04:20:16,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-18 04:20:51,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.978e+02 4.839e+02 6.470e+02 1.176e+03, threshold=9.678e+02, percent-clipped=4.0 2023-06-18 04:21:03,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=120840.0, ans=0.125 2023-06-18 04:21:14,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=120840.0, ans=0.125 2023-06-18 04:21:29,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=120840.0, ans=0.0 2023-06-18 04:21:30,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=120840.0, ans=10.0 2023-06-18 04:21:32,586 INFO [train.py:996] (0/4) Epoch 1, batch 20150, loss[loss=0.4628, simple_loss=0.4775, pruned_loss=0.224, over 21778.00 frames. ], tot_loss[loss=0.358, simple_loss=0.4014, pruned_loss=0.1573, over 4272332.95 frames. ], batch size: 441, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:22:14,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=120960.0, ans=0.2 2023-06-18 04:22:16,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=121020.0, ans=0.125 2023-06-18 04:22:34,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=121080.0, ans=0.04949747468305833 2023-06-18 04:22:35,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.14 vs. limit=6.0 2023-06-18 04:22:44,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=121080.0, ans=0.125 2023-06-18 04:23:23,954 INFO [train.py:996] (0/4) Epoch 1, batch 20200, loss[loss=0.3616, simple_loss=0.4451, pruned_loss=0.139, over 20765.00 frames. ], tot_loss[loss=0.3635, simple_loss=0.4057, pruned_loss=0.1606, over 4265968.05 frames. ], batch size: 607, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:24:26,096 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 4.003e+02 5.201e+02 6.811e+02 1.420e+03, threshold=1.040e+03, percent-clipped=11.0 2023-06-18 04:25:01,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=121440.0, ans=0.125 2023-06-18 04:25:05,815 INFO [train.py:996] (0/4) Epoch 1, batch 20250, loss[loss=0.3315, simple_loss=0.39, pruned_loss=0.1366, over 21768.00 frames. ], tot_loss[loss=0.3599, simple_loss=0.4048, pruned_loss=0.1575, over 4266217.40 frames. ], batch size: 247, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:25:22,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121500.0, ans=0.1 2023-06-18 04:25:28,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=121560.0, ans=0.2 2023-06-18 04:26:33,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=22.5 2023-06-18 04:26:40,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.64 vs. limit=6.0 2023-06-18 04:26:47,852 INFO [train.py:996] (0/4) Epoch 1, batch 20300, loss[loss=0.3252, simple_loss=0.3897, pruned_loss=0.1303, over 21756.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.3992, pruned_loss=0.1516, over 4251757.39 frames. ], batch size: 316, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:27:05,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-18 04:27:08,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-18 04:27:24,709 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-18 04:27:55,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 3.171e+02 3.718e+02 4.802e+02 8.828e+02, threshold=7.436e+02, percent-clipped=0.0 2023-06-18 04:27:56,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=121980.0, ans=0.125 2023-06-18 04:28:04,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=121980.0, ans=0.0 2023-06-18 04:28:21,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=122040.0, ans=0.1 2023-06-18 04:28:28,665 INFO [train.py:996] (0/4) Epoch 1, batch 20350, loss[loss=0.4184, simple_loss=0.4374, pruned_loss=0.1998, over 21519.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.3983, pruned_loss=0.152, over 4243816.15 frames. ], batch size: 548, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:28:40,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=122100.0, ans=0.125 2023-06-18 04:28:46,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122100.0, ans=0.1 2023-06-18 04:28:52,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=122160.0, ans=0.0 2023-06-18 04:29:31,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122280.0, ans=0.125 2023-06-18 04:29:49,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=122340.0, ans=0.125 2023-06-18 04:30:16,439 INFO [train.py:996] (0/4) Epoch 1, batch 20400, loss[loss=0.4214, simple_loss=0.4396, pruned_loss=0.2016, over 21281.00 frames. ], tot_loss[loss=0.3582, simple_loss=0.4031, pruned_loss=0.1566, over 4241268.18 frames. ], batch size: 143, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:30:46,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.91 vs. limit=22.5 2023-06-18 04:30:57,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=122520.0, ans=0.0 2023-06-18 04:31:10,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=122580.0, ans=0.125 2023-06-18 04:31:12,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=122580.0, ans=0.125 2023-06-18 04:31:13,484 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 4.014e+02 4.909e+02 5.768e+02 1.154e+03, threshold=9.817e+02, percent-clipped=10.0 2023-06-18 04:31:15,811 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:31:30,819 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:31:53,073 INFO [train.py:996] (0/4) Epoch 1, batch 20450, loss[loss=0.3536, simple_loss=0.3807, pruned_loss=0.1632, over 20809.00 frames. ], tot_loss[loss=0.3622, simple_loss=0.4048, pruned_loss=0.1599, over 4226657.46 frames. ], batch size: 608, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:32:16,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=122760.0, ans=0.0 2023-06-18 04:32:28,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=122820.0, ans=0.95 2023-06-18 04:32:36,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=122820.0, ans=0.0 2023-06-18 04:32:50,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=122880.0, ans=0.0 2023-06-18 04:33:34,863 INFO [train.py:996] (0/4) Epoch 1, batch 20500, loss[loss=0.3351, simple_loss=0.36, pruned_loss=0.1551, over 21406.00 frames. ], tot_loss[loss=0.3604, simple_loss=0.4004, pruned_loss=0.1602, over 4229082.17 frames. ], batch size: 548, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:33:36,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.74 vs. limit=22.5 2023-06-18 04:33:51,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=123000.0, ans=0.2 2023-06-18 04:34:07,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=123060.0, ans=0.125 2023-06-18 04:34:43,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.920e+02 3.898e+02 4.731e+02 5.915e+02 1.084e+03, threshold=9.462e+02, percent-clipped=4.0 2023-06-18 04:35:23,691 INFO [train.py:996] (0/4) Epoch 1, batch 20550, loss[loss=0.3545, simple_loss=0.4098, pruned_loss=0.1496, over 21860.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3906, pruned_loss=0.157, over 4225425.62 frames. ], batch size: 372, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:35:32,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=123300.0, ans=0.125 2023-06-18 04:35:34,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=123300.0, ans=0.0 2023-06-18 04:35:34,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=123300.0, ans=0.07 2023-06-18 04:35:41,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=123360.0, ans=15.0 2023-06-18 04:35:44,169 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-18 04:35:58,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=123420.0, ans=0.0 2023-06-18 04:36:00,773 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-18 04:36:52,702 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=9.712e-02 2023-06-18 04:37:04,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=123540.0, ans=0.0 2023-06-18 04:37:06,864 INFO [train.py:996] (0/4) Epoch 1, batch 20600, loss[loss=0.3396, simple_loss=0.3801, pruned_loss=0.1495, over 21777.00 frames. ], tot_loss[loss=0.3511, simple_loss=0.3924, pruned_loss=0.1549, over 4245586.70 frames. ], batch size: 247, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:37:12,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=123600.0, ans=0.125 2023-06-18 04:38:09,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.464e+02 4.439e+02 5.514e+02 9.400e+02, threshold=8.878e+02, percent-clipped=0.0 2023-06-18 04:38:47,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=123900.0, ans=0.2 2023-06-18 04:38:48,813 INFO [train.py:996] (0/4) Epoch 1, batch 20650, loss[loss=0.3561, simple_loss=0.3781, pruned_loss=0.167, over 21621.00 frames. ], tot_loss[loss=0.3489, simple_loss=0.3883, pruned_loss=0.1548, over 4240900.96 frames. ], batch size: 414, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:38:52,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=123900.0, ans=0.125 2023-06-18 04:40:00,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=124080.0, ans=0.125 2023-06-18 04:40:31,340 INFO [train.py:996] (0/4) Epoch 1, batch 20700, loss[loss=0.3454, simple_loss=0.3736, pruned_loss=0.1586, over 21533.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.3797, pruned_loss=0.1488, over 4249504.95 frames. ], batch size: 441, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:40:41,057 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=8.596e-02 2023-06-18 04:41:17,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=124320.0, ans=0.125 2023-06-18 04:41:38,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.293e+02 3.859e+02 5.120e+02 8.262e+02, threshold=7.718e+02, percent-clipped=0.0 2023-06-18 04:42:12,835 INFO [train.py:996] (0/4) Epoch 1, batch 20750, loss[loss=0.4293, simple_loss=0.4936, pruned_loss=0.1825, over 21673.00 frames. ], tot_loss[loss=0.3367, simple_loss=0.3809, pruned_loss=0.1462, over 4249873.16 frames. ], batch size: 414, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:43:08,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=124620.0, ans=0.125 2023-06-18 04:43:20,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.72 vs. limit=10.0 2023-06-18 04:43:46,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=124740.0, ans=0.5 2023-06-18 04:43:56,110 INFO [train.py:996] (0/4) Epoch 1, batch 20800, loss[loss=0.3353, simple_loss=0.3725, pruned_loss=0.1491, over 21797.00 frames. ], tot_loss[loss=0.3426, simple_loss=0.387, pruned_loss=0.1491, over 4252821.49 frames. ], batch size: 102, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:43:56,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=124800.0, ans=0.125 2023-06-18 04:44:00,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=124800.0, ans=0.0 2023-06-18 04:44:08,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=124800.0, ans=0.2 2023-06-18 04:45:08,184 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 3.772e+02 4.526e+02 5.632e+02 1.034e+03, threshold=9.051e+02, percent-clipped=9.0 2023-06-18 04:45:08,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124980.0, ans=0.1 2023-06-18 04:45:22,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=125040.0, ans=0.125 2023-06-18 04:45:32,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.96 vs. limit=10.0 2023-06-18 04:45:36,948 INFO [train.py:996] (0/4) Epoch 1, batch 20850, loss[loss=0.3435, simple_loss=0.3809, pruned_loss=0.153, over 21394.00 frames. ], tot_loss[loss=0.3342, simple_loss=0.377, pruned_loss=0.1456, over 4251829.56 frames. ], batch size: 131, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:45:52,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=125160.0, ans=0.125 2023-06-18 04:46:50,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=125280.0, ans=0.2 2023-06-18 04:47:09,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=125340.0, ans=0.125 2023-06-18 04:47:18,197 INFO [train.py:996] (0/4) Epoch 1, batch 20900, loss[loss=0.4037, simple_loss=0.4304, pruned_loss=0.1885, over 21700.00 frames. ], tot_loss[loss=0.3403, simple_loss=0.3816, pruned_loss=0.1495, over 4262755.49 frames. ], batch size: 473, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:48:20,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=125580.0, ans=0.125 2023-06-18 04:48:25,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.309e+02 3.915e+02 5.105e+02 1.001e+03, threshold=7.830e+02, percent-clipped=2.0 2023-06-18 04:48:26,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-18 04:48:27,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=125580.0, ans=0.125 2023-06-18 04:48:29,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-18 04:48:30,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=125580.0, ans=0.125 2023-06-18 04:48:53,599 INFO [train.py:996] (0/4) Epoch 1, batch 20950, loss[loss=0.2665, simple_loss=0.3217, pruned_loss=0.1057, over 21600.00 frames. ], tot_loss[loss=0.3296, simple_loss=0.3746, pruned_loss=0.1423, over 4271606.05 frames. ], batch size: 263, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:50:22,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=125940.0, ans=0.125 2023-06-18 04:50:33,091 INFO [train.py:996] (0/4) Epoch 1, batch 21000, loss[loss=0.3794, simple_loss=0.4026, pruned_loss=0.1781, over 21804.00 frames. ], tot_loss[loss=0.3317, simple_loss=0.3754, pruned_loss=0.144, over 4267294.32 frames. ], batch size: 441, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:50:33,092 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 04:50:45,390 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.3375, 3.6120, 1.7076, 1.2631], device='cuda:0') 2023-06-18 04:50:48,185 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.5071, 3.7149, 1.7417, 2.2883], device='cuda:0') 2023-06-18 04:50:50,131 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.3151, simple_loss=0.4075, pruned_loss=0.1114, over 1796401.00 frames. 2023-06-18 04:50:50,131 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 04:52:02,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 3.430e+02 4.586e+02 6.344e+02 1.913e+03, threshold=9.172e+02, percent-clipped=11.0 2023-06-18 04:52:27,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=126240.0, ans=0.125 2023-06-18 04:52:30,660 INFO [train.py:996] (0/4) Epoch 1, batch 21050, loss[loss=0.2243, simple_loss=0.2869, pruned_loss=0.0808, over 15946.00 frames. ], tot_loss[loss=0.3307, simple_loss=0.3732, pruned_loss=0.1441, over 4259300.61 frames. ], batch size: 61, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:52:36,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-06-18 04:52:42,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.27 vs. limit=10.0 2023-06-18 04:52:54,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-18 04:53:03,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=126360.0, ans=0.2 2023-06-18 04:54:07,586 INFO [train.py:996] (0/4) Epoch 1, batch 21100, loss[loss=0.3095, simple_loss=0.3525, pruned_loss=0.1333, over 21502.00 frames. ], tot_loss[loss=0.327, simple_loss=0.3684, pruned_loss=0.1428, over 4254471.61 frames. ], batch size: 230, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:54:18,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-18 04:54:37,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=126660.0, ans=0.125 2023-06-18 04:54:39,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=126660.0, ans=0.125 2023-06-18 04:55:14,651 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.438e+02 4.271e+02 5.279e+02 9.041e+02, threshold=8.542e+02, percent-clipped=0.0 2023-06-18 04:55:18,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126780.0, ans=0.1 2023-06-18 04:55:19,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=126780.0, ans=0.125 2023-06-18 04:55:23,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.39 vs. limit=10.0 2023-06-18 04:55:42,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=126900.0, ans=0.1 2023-06-18 04:55:43,349 INFO [train.py:996] (0/4) Epoch 1, batch 21150, loss[loss=0.2917, simple_loss=0.3283, pruned_loss=0.1275, over 16432.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3627, pruned_loss=0.142, over 4244974.25 frames. ], batch size: 66, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:55:52,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=126900.0, ans=0.07 2023-06-18 04:56:26,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=127020.0, ans=0.2 2023-06-18 04:56:56,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=127080.0, ans=0.0 2023-06-18 04:56:56,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-18 04:57:01,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127080.0, ans=0.1 2023-06-18 04:57:20,016 INFO [train.py:996] (0/4) Epoch 1, batch 21200, loss[loss=0.2927, simple_loss=0.3272, pruned_loss=0.1291, over 21260.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.3598, pruned_loss=0.1408, over 4238473.02 frames. ], batch size: 159, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:57:50,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=127260.0, ans=0.0 2023-06-18 04:57:55,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=127260.0, ans=0.125 2023-06-18 04:57:59,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.55 vs. limit=22.5 2023-06-18 04:58:33,963 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.663e+02 4.545e+02 5.734e+02 1.350e+03, threshold=9.091e+02, percent-clipped=8.0 2023-06-18 04:58:37,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=127380.0, ans=0.125 2023-06-18 04:59:03,004 INFO [train.py:996] (0/4) Epoch 1, batch 21250, loss[loss=0.3712, simple_loss=0.4083, pruned_loss=0.167, over 21553.00 frames. ], tot_loss[loss=0.3211, simple_loss=0.3591, pruned_loss=0.1415, over 4236649.26 frames. ], batch size: 389, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:59:16,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=127500.0, ans=0.125 2023-06-18 05:00:41,957 INFO [train.py:996] (0/4) Epoch 1, batch 21300, loss[loss=0.3852, simple_loss=0.4213, pruned_loss=0.1745, over 21894.00 frames. ], tot_loss[loss=0.3287, simple_loss=0.3673, pruned_loss=0.145, over 4249359.59 frames. ], batch size: 118, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:00:44,560 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.32 vs. limit=22.5 2023-06-18 05:00:58,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=127860.0, ans=0.2 2023-06-18 05:01:54,308 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.515e+02 4.385e+02 5.674e+02 1.308e+03, threshold=8.770e+02, percent-clipped=8.0 2023-06-18 05:02:03,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=127980.0, ans=0.125 2023-06-18 05:02:23,589 INFO [train.py:996] (0/4) Epoch 1, batch 21350, loss[loss=0.2307, simple_loss=0.2786, pruned_loss=0.09142, over 16488.00 frames. ], tot_loss[loss=0.3322, simple_loss=0.3718, pruned_loss=0.1463, over 4251336.48 frames. ], batch size: 62, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:02:49,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=128160.0, ans=0.125 2023-06-18 05:03:46,514 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-18 05:04:06,129 INFO [train.py:996] (0/4) Epoch 1, batch 21400, loss[loss=0.3452, simple_loss=0.3941, pruned_loss=0.1482, over 20599.00 frames. ], tot_loss[loss=0.3332, simple_loss=0.3756, pruned_loss=0.1454, over 4259626.59 frames. ], batch size: 607, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:04:22,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=128400.0, ans=0.2 2023-06-18 05:04:30,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=128460.0, ans=0.0 2023-06-18 05:04:31,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.34 vs. limit=22.5 2023-06-18 05:05:18,998 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 3.244e+02 3.990e+02 4.956e+02 1.756e+03, threshold=7.981e+02, percent-clipped=8.0 2023-06-18 05:05:47,936 INFO [train.py:996] (0/4) Epoch 1, batch 21450, loss[loss=0.4266, simple_loss=0.4409, pruned_loss=0.2061, over 21783.00 frames. ], tot_loss[loss=0.3385, simple_loss=0.3803, pruned_loss=0.1484, over 4271147.12 frames. ], batch size: 441, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:05:48,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-06-18 05:06:12,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=128760.0, ans=0.025 2023-06-18 05:06:14,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=128760.0, ans=0.0 2023-06-18 05:06:23,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=128760.0, ans=15.0 2023-06-18 05:07:00,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=128880.0, ans=0.125 2023-06-18 05:07:05,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=128880.0, ans=0.125 2023-06-18 05:07:23,871 INFO [train.py:996] (0/4) Epoch 1, batch 21500, loss[loss=0.3156, simple_loss=0.3558, pruned_loss=0.1377, over 21997.00 frames. ], tot_loss[loss=0.3381, simple_loss=0.3775, pruned_loss=0.1493, over 4270157.51 frames. ], batch size: 103, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:07:43,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=129060.0, ans=0.0 2023-06-18 05:07:52,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=129060.0, ans=0.125 2023-06-18 05:08:15,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=129120.0, ans=0.125 2023-06-18 05:08:35,888 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.287e+02 4.067e+02 5.300e+02 1.405e+03, threshold=8.134e+02, percent-clipped=7.0 2023-06-18 05:08:52,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=129240.0, ans=0.125 2023-06-18 05:08:57,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-18 05:08:59,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=129240.0, ans=0.0 2023-06-18 05:09:05,154 INFO [train.py:996] (0/4) Epoch 1, batch 21550, loss[loss=0.2924, simple_loss=0.3382, pruned_loss=0.1233, over 21599.00 frames. ], tot_loss[loss=0.3323, simple_loss=0.3712, pruned_loss=0.1467, over 4256913.56 frames. ], batch size: 391, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:09:45,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=129360.0, ans=0.025 2023-06-18 05:10:10,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=129420.0, ans=0.125 2023-06-18 05:10:53,619 INFO [train.py:996] (0/4) Epoch 1, batch 21600, loss[loss=0.3949, simple_loss=0.4653, pruned_loss=0.1623, over 19692.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.367, pruned_loss=0.1442, over 4251560.75 frames. ], batch size: 703, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:11:13,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.06 vs. limit=15.0 2023-06-18 05:11:24,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=129660.0, ans=0.125 2023-06-18 05:11:42,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=129720.0, ans=0.125 2023-06-18 05:11:49,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=129720.0, ans=0.125 2023-06-18 05:11:53,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=129720.0, ans=0.125 2023-06-18 05:12:01,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.331e+02 4.167e+02 5.142e+02 1.133e+03, threshold=8.334e+02, percent-clipped=4.0 2023-06-18 05:12:05,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=129780.0, ans=0.125 2023-06-18 05:12:17,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=129840.0, ans=0.125 2023-06-18 05:12:34,537 INFO [train.py:996] (0/4) Epoch 1, batch 21650, loss[loss=0.4076, simple_loss=0.4636, pruned_loss=0.1758, over 21639.00 frames. ], tot_loss[loss=0.3266, simple_loss=0.3708, pruned_loss=0.1412, over 4257608.29 frames. ], batch size: 441, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:13:02,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=129960.0, ans=0.125 2023-06-18 05:13:18,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=129960.0, ans=0.04949747468305833 2023-06-18 05:13:29,995 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-18 05:14:03,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130140.0, ans=0.1 2023-06-18 05:14:15,286 INFO [train.py:996] (0/4) Epoch 1, batch 21700, loss[loss=0.2917, simple_loss=0.3329, pruned_loss=0.1252, over 21793.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3706, pruned_loss=0.1383, over 4260438.65 frames. ], batch size: 118, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:14:28,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=130200.0, ans=12.0 2023-06-18 05:15:16,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.479e+02 4.448e+02 5.687e+02 1.020e+03, threshold=8.895e+02, percent-clipped=10.0 2023-06-18 05:15:31,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=130440.0, ans=0.125 2023-06-18 05:15:50,776 INFO [train.py:996] (0/4) Epoch 1, batch 21750, loss[loss=0.337, simple_loss=0.3671, pruned_loss=0.1534, over 21737.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3671, pruned_loss=0.1397, over 4243549.45 frames. ], batch size: 124, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:15:59,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=130500.0, ans=0.125 2023-06-18 05:16:00,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=130500.0, ans=0.125 2023-06-18 05:17:23,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=130740.0, ans=0.125 2023-06-18 05:17:33,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-06-18 05:17:34,239 INFO [train.py:996] (0/4) Epoch 1, batch 21800, loss[loss=0.3572, simple_loss=0.4007, pruned_loss=0.1569, over 21654.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.3652, pruned_loss=0.1402, over 4240443.21 frames. ], batch size: 298, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:17:36,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=130800.0, ans=0.125 2023-06-18 05:18:26,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=130920.0, ans=0.04949747468305833 2023-06-18 05:18:26,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=130920.0, ans=0.0 2023-06-18 05:18:38,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-18 05:18:42,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.714e+02 4.449e+02 6.326e+02 1.060e+03, threshold=8.898e+02, percent-clipped=3.0 2023-06-18 05:19:16,659 INFO [train.py:996] (0/4) Epoch 1, batch 21850, loss[loss=0.4701, simple_loss=0.5083, pruned_loss=0.216, over 19749.00 frames. ], tot_loss[loss=0.3255, simple_loss=0.3698, pruned_loss=0.1406, over 4242531.23 frames. ], batch size: 702, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:20:08,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=131220.0, ans=0.125 2023-06-18 05:20:16,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=131220.0, ans=0.0 2023-06-18 05:20:58,035 INFO [train.py:996] (0/4) Epoch 1, batch 21900, loss[loss=0.3764, simple_loss=0.4278, pruned_loss=0.1625, over 21564.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3722, pruned_loss=0.1424, over 4256128.16 frames. ], batch size: 471, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:21:15,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=131400.0, ans=0.125 2023-06-18 05:21:56,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=131580.0, ans=0.125 2023-06-18 05:22:04,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.542e+02 3.404e+02 4.082e+02 5.077e+02 9.199e+02, threshold=8.164e+02, percent-clipped=1.0 2023-06-18 05:22:23,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.93 vs. limit=15.0 2023-06-18 05:22:38,139 INFO [train.py:996] (0/4) Epoch 1, batch 21950, loss[loss=0.2377, simple_loss=0.3097, pruned_loss=0.08282, over 21685.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.366, pruned_loss=0.1411, over 4252861.62 frames. ], batch size: 316, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:22:38,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=131700.0, ans=0.0 2023-06-18 05:23:05,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=131760.0, ans=0.125 2023-06-18 05:23:12,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=131760.0, ans=0.125 2023-06-18 05:23:46,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=131880.0, ans=0.0 2023-06-18 05:23:59,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-06-18 05:24:18,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=132000.0, ans=0.05 2023-06-18 05:24:19,977 INFO [train.py:996] (0/4) Epoch 1, batch 22000, loss[loss=0.2651, simple_loss=0.3166, pruned_loss=0.1068, over 21159.00 frames. ], tot_loss[loss=0.3148, simple_loss=0.3571, pruned_loss=0.1362, over 4251626.70 frames. ], batch size: 159, lr: 2.56e-02, grad_scale: 64.0 2023-06-18 05:24:21,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.88 vs. limit=10.0 2023-06-18 05:24:46,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=132060.0, ans=0.125 2023-06-18 05:25:30,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.717e+02 4.714e+02 6.490e+02 1.072e+03, threshold=9.428e+02, percent-clipped=6.0 2023-06-18 05:25:44,780 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-06-18 05:26:08,206 INFO [train.py:996] (0/4) Epoch 1, batch 22050, loss[loss=0.4469, simple_loss=0.4804, pruned_loss=0.2067, over 21294.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.3632, pruned_loss=0.1388, over 4250883.75 frames. ], batch size: 549, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:26:28,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=132360.0, ans=0.0 2023-06-18 05:26:32,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=132360.0, ans=0.2 2023-06-18 05:27:24,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=132540.0, ans=0.2 2023-06-18 05:27:47,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=132600.0, ans=0.125 2023-06-18 05:27:48,449 INFO [train.py:996] (0/4) Epoch 1, batch 22100, loss[loss=0.3644, simple_loss=0.3972, pruned_loss=0.1658, over 21281.00 frames. ], tot_loss[loss=0.3344, simple_loss=0.376, pruned_loss=0.1463, over 4261663.09 frames. ], batch size: 143, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:28:50,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=132780.0, ans=0.125 2023-06-18 05:28:54,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.580e+02 4.025e+02 4.912e+02 6.450e+02 1.246e+03, threshold=9.825e+02, percent-clipped=3.0 2023-06-18 05:29:32,061 INFO [train.py:996] (0/4) Epoch 1, batch 22150, loss[loss=0.3719, simple_loss=0.4043, pruned_loss=0.1697, over 21742.00 frames. ], tot_loss[loss=0.3381, simple_loss=0.3795, pruned_loss=0.1484, over 4270133.22 frames. ], batch size: 389, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:29:52,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=132960.0, ans=0.125 2023-06-18 05:30:07,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=132960.0, ans=0.125 2023-06-18 05:30:12,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=133020.0, ans=0.2 2023-06-18 05:30:34,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=133080.0, ans=0.0 2023-06-18 05:31:13,361 INFO [train.py:996] (0/4) Epoch 1, batch 22200, loss[loss=0.4211, simple_loss=0.5039, pruned_loss=0.1692, over 19647.00 frames. ], tot_loss[loss=0.3432, simple_loss=0.3837, pruned_loss=0.1514, over 4276729.99 frames. ], batch size: 702, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:32:01,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=133320.0, ans=0.0 2023-06-18 05:32:05,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133320.0, ans=0.1 2023-06-18 05:32:10,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=133380.0, ans=0.125 2023-06-18 05:32:16,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 4.029e+02 4.889e+02 6.211e+02 1.093e+03, threshold=9.779e+02, percent-clipped=2.0 2023-06-18 05:32:59,492 INFO [train.py:996] (0/4) Epoch 1, batch 22250, loss[loss=0.3391, simple_loss=0.3674, pruned_loss=0.1555, over 21205.00 frames. ], tot_loss[loss=0.3485, simple_loss=0.3909, pruned_loss=0.153, over 4272144.88 frames. ], batch size: 608, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:33:03,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=20.23 vs. limit=15.0 2023-06-18 05:33:49,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=133620.0, ans=0.125 2023-06-18 05:33:54,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=133680.0, ans=0.2 2023-06-18 05:34:09,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.03 vs. limit=6.0 2023-06-18 05:34:39,754 INFO [train.py:996] (0/4) Epoch 1, batch 22300, loss[loss=0.4554, simple_loss=0.456, pruned_loss=0.2274, over 21639.00 frames. ], tot_loss[loss=0.3539, simple_loss=0.3944, pruned_loss=0.1567, over 4276611.27 frames. ], batch size: 473, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:34:56,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=133860.0, ans=0.125 2023-06-18 05:34:58,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=133860.0, ans=0.125 2023-06-18 05:35:37,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 3.602e+02 4.280e+02 5.421e+02 8.254e+02, threshold=8.559e+02, percent-clipped=0.0 2023-06-18 05:36:02,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=134040.0, ans=0.2 2023-06-18 05:36:20,521 INFO [train.py:996] (0/4) Epoch 1, batch 22350, loss[loss=0.3046, simple_loss=0.3441, pruned_loss=0.1326, over 21670.00 frames. ], tot_loss[loss=0.3538, simple_loss=0.3928, pruned_loss=0.1574, over 4279681.69 frames. ], batch size: 263, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:36:31,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=134100.0, ans=0.125 2023-06-18 05:36:53,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=134160.0, ans=0.2 2023-06-18 05:37:01,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=134220.0, ans=0.04949747468305833 2023-06-18 05:37:03,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-18 05:38:03,435 INFO [train.py:996] (0/4) Epoch 1, batch 22400, loss[loss=0.314, simple_loss=0.3516, pruned_loss=0.1382, over 21465.00 frames. ], tot_loss[loss=0.3455, simple_loss=0.3873, pruned_loss=0.1519, over 4286098.14 frames. ], batch size: 212, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:38:11,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134400.0, ans=0.1 2023-06-18 05:38:35,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2023-06-18 05:38:43,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134520.0, ans=0.1 2023-06-18 05:38:46,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=134520.0, ans=0.125 2023-06-18 05:39:07,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.472e+02 4.159e+02 5.652e+02 9.879e+02, threshold=8.318e+02, percent-clipped=2.0 2023-06-18 05:39:39,909 INFO [train.py:996] (0/4) Epoch 1, batch 22450, loss[loss=0.3256, simple_loss=0.3605, pruned_loss=0.1453, over 21895.00 frames. ], tot_loss[loss=0.3394, simple_loss=0.3794, pruned_loss=0.1497, over 4279253.80 frames. ], batch size: 373, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:39:40,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=134700.0, ans=0.2 2023-06-18 05:40:35,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=134820.0, ans=0.2 2023-06-18 05:41:23,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=134940.0, ans=0.5 2023-06-18 05:41:26,160 INFO [train.py:996] (0/4) Epoch 1, batch 22500, loss[loss=0.3816, simple_loss=0.4382, pruned_loss=0.1625, over 21214.00 frames. ], tot_loss[loss=0.3341, simple_loss=0.3724, pruned_loss=0.1479, over 4273538.13 frames. ], batch size: 549, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:42:29,711 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:42:40,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.574e+02 4.487e+02 5.410e+02 9.033e+02, threshold=8.975e+02, percent-clipped=2.0 2023-06-18 05:43:06,901 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:43:06,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135240.0, ans=0.1 2023-06-18 05:43:09,853 INFO [train.py:996] (0/4) Epoch 1, batch 22550, loss[loss=0.4238, simple_loss=0.4354, pruned_loss=0.2061, over 21785.00 frames. ], tot_loss[loss=0.3375, simple_loss=0.3779, pruned_loss=0.1486, over 4277886.61 frames. ], batch size: 441, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:43:20,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.17 vs. limit=6.0 2023-06-18 05:44:08,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=135420.0, ans=0.0 2023-06-18 05:44:22,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=135480.0, ans=0.125 2023-06-18 05:44:59,051 INFO [train.py:996] (0/4) Epoch 1, batch 22600, loss[loss=0.3033, simple_loss=0.335, pruned_loss=0.1358, over 21193.00 frames. ], tot_loss[loss=0.3397, simple_loss=0.3812, pruned_loss=0.1491, over 4284765.42 frames. ], batch size: 159, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:45:38,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2023-06-18 05:45:50,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=135720.0, ans=0.125 2023-06-18 05:46:07,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 4.199e+02 5.117e+02 6.564e+02 1.237e+03, threshold=1.023e+03, percent-clipped=4.0 2023-06-18 05:46:38,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=135900.0, ans=0.125 2023-06-18 05:46:39,976 INFO [train.py:996] (0/4) Epoch 1, batch 22650, loss[loss=0.2948, simple_loss=0.3324, pruned_loss=0.1286, over 21113.00 frames. ], tot_loss[loss=0.3358, simple_loss=0.3773, pruned_loss=0.1471, over 4274870.06 frames. ], batch size: 176, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:46:40,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=135900.0, ans=0.0 2023-06-18 05:47:08,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=135960.0, ans=0.125 2023-06-18 05:48:20,099 INFO [train.py:996] (0/4) Epoch 1, batch 22700, loss[loss=0.3039, simple_loss=0.3617, pruned_loss=0.123, over 20687.00 frames. ], tot_loss[loss=0.331, simple_loss=0.3705, pruned_loss=0.1458, over 4278372.14 frames. ], batch size: 607, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:48:40,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=136260.0, ans=0.0 2023-06-18 05:49:16,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=136320.0, ans=0.2 2023-06-18 05:49:23,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=136380.0, ans=0.125 2023-06-18 05:49:24,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.785e+02 4.714e+02 6.670e+02 1.093e+03, threshold=9.427e+02, percent-clipped=5.0 2023-06-18 05:49:41,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=136440.0, ans=0.125 2023-06-18 05:49:57,049 INFO [train.py:996] (0/4) Epoch 1, batch 22750, loss[loss=0.3788, simple_loss=0.4115, pruned_loss=0.173, over 21248.00 frames. ], tot_loss[loss=0.334, simple_loss=0.3716, pruned_loss=0.1482, over 4282330.24 frames. ], batch size: 549, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:50:22,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=136560.0, ans=0.2 2023-06-18 05:50:59,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=136680.0, ans=0.125 2023-06-18 05:51:24,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=136740.0, ans=0.1 2023-06-18 05:51:39,048 INFO [train.py:996] (0/4) Epoch 1, batch 22800, loss[loss=0.3781, simple_loss=0.4138, pruned_loss=0.1712, over 21851.00 frames. ], tot_loss[loss=0.3417, simple_loss=0.3784, pruned_loss=0.1525, over 4285532.83 frames. ], batch size: 118, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:51:58,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=136860.0, ans=0.125 2023-06-18 05:52:10,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=136860.0, ans=0.05 2023-06-18 05:52:28,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=136920.0, ans=0.125 2023-06-18 05:52:47,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.251e+02 4.256e+02 5.590e+02 8.268e+02 1.334e+03, threshold=1.118e+03, percent-clipped=16.0 2023-06-18 05:52:52,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136980.0, ans=0.1 2023-06-18 05:53:20,673 INFO [train.py:996] (0/4) Epoch 1, batch 22850, loss[loss=0.2997, simple_loss=0.3401, pruned_loss=0.1297, over 22041.00 frames. ], tot_loss[loss=0.3401, simple_loss=0.3767, pruned_loss=0.1518, over 4285878.94 frames. ], batch size: 103, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:53:47,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-18 05:53:53,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-06-18 05:54:43,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=137340.0, ans=0.125 2023-06-18 05:55:09,109 INFO [train.py:996] (0/4) Epoch 1, batch 22900, loss[loss=0.4174, simple_loss=0.4893, pruned_loss=0.1728, over 21460.00 frames. ], tot_loss[loss=0.3384, simple_loss=0.3765, pruned_loss=0.1502, over 4279119.47 frames. ], batch size: 471, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:55:21,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=137400.0, ans=0.2 2023-06-18 05:55:22,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=137400.0, ans=0.0 2023-06-18 05:55:25,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=137460.0, ans=0.05 2023-06-18 05:55:53,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.27 vs. limit=6.0 2023-06-18 05:55:54,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137520.0, ans=0.1 2023-06-18 05:55:54,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=137520.0, ans=0.0 2023-06-18 05:56:13,236 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.811e+02 3.596e+02 4.279e+02 5.382e+02 9.756e+02, threshold=8.557e+02, percent-clipped=0.0 2023-06-18 05:56:22,853 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:56:40,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-18 05:56:43,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=137640.0, ans=0.125 2023-06-18 05:56:52,915 INFO [train.py:996] (0/4) Epoch 1, batch 22950, loss[loss=0.3373, simple_loss=0.4487, pruned_loss=0.113, over 20752.00 frames. ], tot_loss[loss=0.3433, simple_loss=0.3899, pruned_loss=0.1483, over 4282594.21 frames. ], batch size: 607, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:57:40,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=137820.0, ans=0.2 2023-06-18 05:58:34,374 INFO [train.py:996] (0/4) Epoch 1, batch 23000, loss[loss=0.4175, simple_loss=0.4333, pruned_loss=0.2009, over 21625.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.39, pruned_loss=0.1437, over 4272855.68 frames. ], batch size: 471, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:58:37,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=138000.0, ans=0.125 2023-06-18 05:59:42,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.447e+02 4.093e+02 5.344e+02 1.227e+03, threshold=8.186e+02, percent-clipped=4.0 2023-06-18 06:00:15,832 INFO [train.py:996] (0/4) Epoch 1, batch 23050, loss[loss=0.3762, simple_loss=0.414, pruned_loss=0.1692, over 21709.00 frames. ], tot_loss[loss=0.3447, simple_loss=0.3932, pruned_loss=0.1481, over 4274490.88 frames. ], batch size: 351, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:01:20,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-18 06:01:29,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=138480.0, ans=0.0 2023-06-18 06:01:36,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=138480.0, ans=0.0 2023-06-18 06:01:43,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138540.0, ans=0.1 2023-06-18 06:01:46,822 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:02:02,492 INFO [train.py:996] (0/4) Epoch 1, batch 23100, loss[loss=0.3157, simple_loss=0.3471, pruned_loss=0.1422, over 21893.00 frames. ], tot_loss[loss=0.3428, simple_loss=0.3882, pruned_loss=0.1487, over 4280890.04 frames. ], batch size: 373, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:02:35,392 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-18 06:03:00,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=138780.0, ans=0.125 2023-06-18 06:03:11,218 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.259e+02 3.616e+02 4.226e+02 5.778e+02 1.152e+03, threshold=8.452e+02, percent-clipped=7.0 2023-06-18 06:03:16,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=138780.0, ans=0.125 2023-06-18 06:03:37,803 INFO [train.py:996] (0/4) Epoch 1, batch 23150, loss[loss=0.2483, simple_loss=0.2881, pruned_loss=0.1042, over 20767.00 frames. ], tot_loss[loss=0.3363, simple_loss=0.3795, pruned_loss=0.1465, over 4284907.79 frames. ], batch size: 609, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:04:22,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=139020.0, ans=0.125 2023-06-18 06:04:51,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=139080.0, ans=0.125 2023-06-18 06:05:15,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=12.0 2023-06-18 06:05:23,902 INFO [train.py:996] (0/4) Epoch 1, batch 23200, loss[loss=0.3446, simple_loss=0.3803, pruned_loss=0.1545, over 21859.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3774, pruned_loss=0.1469, over 4283178.09 frames. ], batch size: 391, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:05:53,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=139260.0, ans=0.125 2023-06-18 06:06:12,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=139320.0, ans=0.2 2023-06-18 06:06:26,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.420e+02 3.602e+02 4.092e+02 5.263e+02 8.445e+02, threshold=8.184e+02, percent-clipped=0.0 2023-06-18 06:06:52,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.62 vs. limit=15.0 2023-06-18 06:06:55,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=139440.0, ans=0.02 2023-06-18 06:06:58,370 INFO [train.py:996] (0/4) Epoch 1, batch 23250, loss[loss=0.3582, simple_loss=0.3941, pruned_loss=0.1612, over 21902.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3784, pruned_loss=0.1485, over 4291011.18 frames. ], batch size: 333, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:07:21,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=139500.0, ans=0.04949747468305833 2023-06-18 06:07:47,936 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.25 vs. limit=10.0 2023-06-18 06:08:52,734 INFO [train.py:996] (0/4) Epoch 1, batch 23300, loss[loss=0.3732, simple_loss=0.4449, pruned_loss=0.1507, over 21397.00 frames. ], tot_loss[loss=0.3469, simple_loss=0.389, pruned_loss=0.1524, over 4288659.92 frames. ], batch size: 211, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:09:58,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.429e+02 3.912e+02 5.511e+02 7.628e+02 1.360e+03, threshold=1.102e+03, percent-clipped=20.0 2023-06-18 06:10:02,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=139980.0, ans=0.0 2023-06-18 06:10:38,101 INFO [train.py:996] (0/4) Epoch 1, batch 23350, loss[loss=0.2402, simple_loss=0.3072, pruned_loss=0.08656, over 21608.00 frames. ], tot_loss[loss=0.3486, simple_loss=0.3945, pruned_loss=0.1514, over 4286210.60 frames. ], batch size: 230, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:10:43,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=140100.0, ans=0.0 2023-06-18 06:11:01,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=140160.0, ans=0.0 2023-06-18 06:11:48,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=140280.0, ans=0.125 2023-06-18 06:12:04,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=140340.0, ans=0.0 2023-06-18 06:12:11,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=140340.0, ans=0.125 2023-06-18 06:12:18,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=140400.0, ans=0.125 2023-06-18 06:12:19,211 INFO [train.py:996] (0/4) Epoch 1, batch 23400, loss[loss=0.3158, simple_loss=0.3648, pruned_loss=0.1334, over 21518.00 frames. ], tot_loss[loss=0.3397, simple_loss=0.3869, pruned_loss=0.1462, over 4287226.20 frames. ], batch size: 211, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:12:50,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=140460.0, ans=0.125 2023-06-18 06:12:50,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=140460.0, ans=0.125 2023-06-18 06:12:51,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=140460.0, ans=0.0 2023-06-18 06:12:59,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-18 06:13:06,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=140520.0, ans=0.0 2023-06-18 06:13:17,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=22.5 2023-06-18 06:13:27,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.226e+02 4.219e+02 5.285e+02 8.873e+02, threshold=8.438e+02, percent-clipped=0.0 2023-06-18 06:13:45,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140640.0, ans=0.1 2023-06-18 06:14:00,449 INFO [train.py:996] (0/4) Epoch 1, batch 23450, loss[loss=0.4596, simple_loss=0.4605, pruned_loss=0.2293, over 21346.00 frames. ], tot_loss[loss=0.3459, simple_loss=0.3897, pruned_loss=0.1511, over 4296488.66 frames. ], batch size: 507, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:14:53,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=140820.0, ans=0.95 2023-06-18 06:15:05,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=140880.0, ans=0.125 2023-06-18 06:15:29,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140940.0, ans=0.1 2023-06-18 06:15:39,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=140940.0, ans=0.125 2023-06-18 06:15:41,740 INFO [train.py:996] (0/4) Epoch 1, batch 23500, loss[loss=0.3413, simple_loss=0.3725, pruned_loss=0.155, over 21155.00 frames. ], tot_loss[loss=0.3455, simple_loss=0.3878, pruned_loss=0.1516, over 4292187.03 frames. ], batch size: 607, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:16:35,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-18 06:16:37,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-18 06:16:44,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.87 vs. limit=6.0 2023-06-18 06:16:49,424 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.488e+02 3.727e+02 4.969e+02 6.081e+02 9.256e+02, threshold=9.939e+02, percent-clipped=2.0 2023-06-18 06:16:55,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-18 06:17:07,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=141240.0, ans=0.125 2023-06-18 06:17:22,068 INFO [train.py:996] (0/4) Epoch 1, batch 23550, loss[loss=0.3257, simple_loss=0.3539, pruned_loss=0.1487, over 21319.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3825, pruned_loss=0.1505, over 4285228.37 frames. ], batch size: 131, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:17:31,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141300.0, ans=0.1 2023-06-18 06:18:39,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=141480.0, ans=0.125 2023-06-18 06:19:05,132 INFO [train.py:996] (0/4) Epoch 1, batch 23600, loss[loss=0.4098, simple_loss=0.4424, pruned_loss=0.1886, over 21237.00 frames. ], tot_loss[loss=0.3445, simple_loss=0.3851, pruned_loss=0.152, over 4287543.71 frames. ], batch size: 143, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:20:21,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.501e+02 3.688e+02 4.463e+02 5.931e+02 8.627e+02, threshold=8.927e+02, percent-clipped=0.0 2023-06-18 06:20:36,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=141840.0, ans=0.0 2023-06-18 06:20:48,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=141840.0, ans=0.125 2023-06-18 06:20:59,607 INFO [train.py:996] (0/4) Epoch 1, batch 23650, loss[loss=0.4186, simple_loss=0.4517, pruned_loss=0.1928, over 21575.00 frames. ], tot_loss[loss=0.3419, simple_loss=0.3844, pruned_loss=0.1497, over 4280727.01 frames. ], batch size: 414, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:21:15,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=141960.0, ans=0.125 2023-06-18 06:21:53,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.50 vs. limit=15.0 2023-06-18 06:22:02,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=142080.0, ans=0.0 2023-06-18 06:22:42,935 INFO [train.py:996] (0/4) Epoch 1, batch 23700, loss[loss=0.2615, simple_loss=0.3246, pruned_loss=0.0992, over 21616.00 frames. ], tot_loss[loss=0.3431, simple_loss=0.3876, pruned_loss=0.1493, over 4281086.01 frames. ], batch size: 230, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:22:43,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=142200.0, ans=0.0 2023-06-18 06:23:53,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.745e+02 4.445e+02 5.198e+02 9.027e+02, threshold=8.891e+02, percent-clipped=1.0 2023-06-18 06:23:54,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=142380.0, ans=0.125 2023-06-18 06:24:32,876 INFO [train.py:996] (0/4) Epoch 1, batch 23750, loss[loss=0.4232, simple_loss=0.4501, pruned_loss=0.1981, over 21768.00 frames. ], tot_loss[loss=0.3438, simple_loss=0.3895, pruned_loss=0.1491, over 4276691.25 frames. ], batch size: 441, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:25:13,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=15.0 2023-06-18 06:25:41,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=142680.0, ans=0.125 2023-06-18 06:26:17,134 INFO [train.py:996] (0/4) Epoch 1, batch 23800, loss[loss=0.4214, simple_loss=0.4818, pruned_loss=0.1805, over 21218.00 frames. ], tot_loss[loss=0.3378, simple_loss=0.3858, pruned_loss=0.1449, over 4268705.22 frames. ], batch size: 548, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:26:22,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=142800.0, ans=0.2 2023-06-18 06:27:27,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.344e+02 4.883e+02 6.088e+02 1.077e+03, threshold=9.766e+02, percent-clipped=8.0 2023-06-18 06:27:53,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143040.0, ans=0.1 2023-06-18 06:28:06,755 INFO [train.py:996] (0/4) Epoch 1, batch 23850, loss[loss=0.4625, simple_loss=0.4683, pruned_loss=0.2284, over 21405.00 frames. ], tot_loss[loss=0.3505, simple_loss=0.3997, pruned_loss=0.1506, over 4261815.59 frames. ], batch size: 471, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:28:34,806 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.46 vs. limit=15.0 2023-06-18 06:29:13,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=143280.0, ans=0.0 2023-06-18 06:29:15,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=143280.0, ans=0.125 2023-06-18 06:29:18,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=143280.0, ans=0.125 2023-06-18 06:29:22,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-18 06:29:23,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=143280.0, ans=0.0 2023-06-18 06:29:48,586 INFO [train.py:996] (0/4) Epoch 1, batch 23900, loss[loss=0.4226, simple_loss=0.4747, pruned_loss=0.1853, over 21446.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.408, pruned_loss=0.1536, over 4262313.65 frames. ], batch size: 471, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:30:18,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=143460.0, ans=0.125 2023-06-18 06:30:34,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=143520.0, ans=0.2 2023-06-18 06:30:42,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-18 06:30:56,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.496e+02 3.761e+02 4.724e+02 6.134e+02 1.060e+03, threshold=9.448e+02, percent-clipped=2.0 2023-06-18 06:31:30,074 INFO [train.py:996] (0/4) Epoch 1, batch 23950, loss[loss=0.3321, simple_loss=0.3668, pruned_loss=0.1487, over 20644.00 frames. ], tot_loss[loss=0.3518, simple_loss=0.3985, pruned_loss=0.1525, over 4262070.46 frames. ], batch size: 607, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:31:55,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=143760.0, ans=0.125 2023-06-18 06:32:13,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=143820.0, ans=0.05 2023-06-18 06:32:20,224 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:32:48,619 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-18 06:33:11,021 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-24000.pt 2023-06-18 06:33:13,949 INFO [train.py:996] (0/4) Epoch 1, batch 24000, loss[loss=0.4183, simple_loss=0.4431, pruned_loss=0.1968, over 21584.00 frames. ], tot_loss[loss=0.3562, simple_loss=0.3997, pruned_loss=0.1563, over 4264029.19 frames. ], batch size: 389, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:33:13,950 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 06:33:36,587 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.32, simple_loss=0.4122, pruned_loss=0.1139, over 1796401.00 frames. 2023-06-18 06:33:36,588 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 06:33:51,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144000.0, ans=0.1 2023-06-18 06:34:04,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=144060.0, ans=0.0 2023-06-18 06:34:37,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-18 06:34:44,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=144180.0, ans=0.125 2023-06-18 06:34:48,622 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.687e+02 4.611e+02 5.908e+02 1.149e+03, threshold=9.222e+02, percent-clipped=2.0 2023-06-18 06:35:11,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144240.0, ans=0.1 2023-06-18 06:35:11,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=144240.0, ans=0.05 2023-06-18 06:35:20,191 INFO [train.py:996] (0/4) Epoch 1, batch 24050, loss[loss=0.2911, simple_loss=0.3527, pruned_loss=0.1147, over 21151.00 frames. ], tot_loss[loss=0.3573, simple_loss=0.4013, pruned_loss=0.1566, over 4275100.78 frames. ], batch size: 143, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:36:02,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=144420.0, ans=0.125 2023-06-18 06:36:15,560 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=15.0 2023-06-18 06:36:37,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=144480.0, ans=0.125 2023-06-18 06:37:07,459 INFO [train.py:996] (0/4) Epoch 1, batch 24100, loss[loss=0.4064, simple_loss=0.4476, pruned_loss=0.1826, over 21557.00 frames. ], tot_loss[loss=0.3516, simple_loss=0.3989, pruned_loss=0.1522, over 4276462.90 frames. ], batch size: 414, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:37:42,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=144660.0, ans=0.125 2023-06-18 06:37:47,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=144720.0, ans=0.07 2023-06-18 06:38:02,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=144720.0, ans=0.125 2023-06-18 06:38:04,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=144720.0, ans=0.125 2023-06-18 06:38:04,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-06-18 06:38:12,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 3.321e+02 4.048e+02 5.410e+02 1.299e+03, threshold=8.096e+02, percent-clipped=1.0 2023-06-18 06:38:13,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=144780.0, ans=0.0 2023-06-18 06:38:25,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=144780.0, ans=0.1 2023-06-18 06:38:49,127 INFO [train.py:996] (0/4) Epoch 1, batch 24150, loss[loss=0.3923, simple_loss=0.4139, pruned_loss=0.1854, over 21802.00 frames. ], tot_loss[loss=0.3543, simple_loss=0.3984, pruned_loss=0.1551, over 4281423.66 frames. ], batch size: 441, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:38:52,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=20.23 vs. limit=15.0 2023-06-18 06:39:33,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.07 vs. limit=22.5 2023-06-18 06:39:39,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=145020.0, ans=0.07 2023-06-18 06:40:09,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=145140.0, ans=0.125 2023-06-18 06:40:31,840 INFO [train.py:996] (0/4) Epoch 1, batch 24200, loss[loss=0.3352, simple_loss=0.387, pruned_loss=0.1417, over 21629.00 frames. ], tot_loss[loss=0.3561, simple_loss=0.3993, pruned_loss=0.1565, over 4282049.56 frames. ], batch size: 230, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:40:36,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.83 vs. limit=22.5 2023-06-18 06:41:49,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 3.664e+02 4.494e+02 5.781e+02 1.168e+03, threshold=8.988e+02, percent-clipped=4.0 2023-06-18 06:42:18,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=145440.0, ans=10.0 2023-06-18 06:42:21,391 INFO [train.py:996] (0/4) Epoch 1, batch 24250, loss[loss=0.2648, simple_loss=0.3384, pruned_loss=0.09559, over 21291.00 frames. ], tot_loss[loss=0.3431, simple_loss=0.3942, pruned_loss=0.146, over 4283037.90 frames. ], batch size: 176, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:42:24,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-18 06:42:50,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=145560.0, ans=0.125 2023-06-18 06:43:11,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=145620.0, ans=0.125 2023-06-18 06:43:33,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=145680.0, ans=0.0 2023-06-18 06:43:45,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=145740.0, ans=0.0 2023-06-18 06:44:01,978 INFO [train.py:996] (0/4) Epoch 1, batch 24300, loss[loss=0.2547, simple_loss=0.3229, pruned_loss=0.09327, over 21770.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3831, pruned_loss=0.1365, over 4272798.35 frames. ], batch size: 332, lr: 2.44e-02, grad_scale: 16.0 2023-06-18 06:44:07,659 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:45:13,978 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 3.046e+02 3.863e+02 5.440e+02 1.504e+03, threshold=7.726e+02, percent-clipped=4.0 2023-06-18 06:45:24,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-18 06:45:43,172 INFO [train.py:996] (0/4) Epoch 1, batch 24350, loss[loss=0.2896, simple_loss=0.3358, pruned_loss=0.1217, over 21476.00 frames. ], tot_loss[loss=0.3267, simple_loss=0.3786, pruned_loss=0.1374, over 4274345.30 frames. ], batch size: 177, lr: 2.44e-02, grad_scale: 16.0 2023-06-18 06:45:43,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=146100.0, ans=0.125 2023-06-18 06:45:54,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=146100.0, ans=0.125 2023-06-18 06:46:54,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-18 06:47:08,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=146340.0, ans=0.125 2023-06-18 06:47:32,126 INFO [train.py:996] (0/4) Epoch 1, batch 24400, loss[loss=0.3167, simple_loss=0.3682, pruned_loss=0.1326, over 21319.00 frames. ], tot_loss[loss=0.3373, simple_loss=0.3855, pruned_loss=0.1445, over 4275253.94 frames. ], batch size: 548, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 06:47:40,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146400.0, ans=0.1 2023-06-18 06:47:43,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.00 vs. limit=15.0 2023-06-18 06:48:11,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-18 06:48:44,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.695e+02 4.218e+02 5.437e+02 7.202e+02 1.402e+03, threshold=1.087e+03, percent-clipped=21.0 2023-06-18 06:48:51,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=146640.0, ans=0.125 2023-06-18 06:49:08,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=146640.0, ans=0.2 2023-06-18 06:49:10,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=146640.0, ans=0.04949747468305833 2023-06-18 06:49:14,363 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-18 06:49:14,781 INFO [train.py:996] (0/4) Epoch 1, batch 24450, loss[loss=0.3444, simple_loss=0.4174, pruned_loss=0.1357, over 21727.00 frames. ], tot_loss[loss=0.3396, simple_loss=0.3883, pruned_loss=0.1455, over 4268321.85 frames. ], batch size: 414, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 06:50:22,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=146880.0, ans=0.125 2023-06-18 06:50:25,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=146880.0, ans=0.125 2023-06-18 06:50:37,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=146880.0, ans=0.125 2023-06-18 06:50:56,344 INFO [train.py:996] (0/4) Epoch 1, batch 24500, loss[loss=0.3614, simple_loss=0.396, pruned_loss=0.1634, over 21789.00 frames. ], tot_loss[loss=0.3401, simple_loss=0.3881, pruned_loss=0.1461, over 4266025.73 frames. ], batch size: 441, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:51:05,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=147000.0, ans=10.0 2023-06-18 06:51:07,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147000.0, ans=0.1 2023-06-18 06:51:15,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=147000.0, ans=0.0 2023-06-18 06:51:33,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=147060.0, ans=0.2 2023-06-18 06:52:14,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.850e+02 5.028e+02 6.051e+02 9.604e+02, threshold=1.006e+03, percent-clipped=0.0 2023-06-18 06:52:28,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-18 06:52:44,349 INFO [train.py:996] (0/4) Epoch 1, batch 24550, loss[loss=0.4392, simple_loss=0.464, pruned_loss=0.2072, over 21338.00 frames. ], tot_loss[loss=0.3471, simple_loss=0.393, pruned_loss=0.1506, over 4273279.90 frames. ], batch size: 507, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:53:02,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=147300.0, ans=0.2 2023-06-18 06:53:34,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-18 06:53:53,324 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-18 06:54:08,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=147540.0, ans=15.0 2023-06-18 06:54:23,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147540.0, ans=0.1 2023-06-18 06:54:26,344 INFO [train.py:996] (0/4) Epoch 1, batch 24600, loss[loss=0.3594, simple_loss=0.3903, pruned_loss=0.1642, over 21815.00 frames. ], tot_loss[loss=0.3445, simple_loss=0.3876, pruned_loss=0.1507, over 4271895.14 frames. ], batch size: 352, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:54:51,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=147660.0, ans=0.125 2023-06-18 06:55:03,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147720.0, ans=0.1 2023-06-18 06:55:18,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=147720.0, ans=0.0 2023-06-18 06:55:34,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-18 06:55:38,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.603e+02 4.230e+02 5.450e+02 1.074e+03, threshold=8.460e+02, percent-clipped=1.0 2023-06-18 06:56:08,528 INFO [train.py:996] (0/4) Epoch 1, batch 24650, loss[loss=0.2905, simple_loss=0.3375, pruned_loss=0.1218, over 15718.00 frames. ], tot_loss[loss=0.3374, simple_loss=0.3791, pruned_loss=0.1479, over 4272437.78 frames. ], batch size: 63, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:56:19,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=147900.0, ans=0.0 2023-06-18 06:56:21,389 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.25 vs. limit=22.5 2023-06-18 06:56:23,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.07 vs. limit=6.0 2023-06-18 06:56:57,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148020.0, ans=0.1 2023-06-18 06:57:24,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=148080.0, ans=0.07 2023-06-18 06:57:50,931 INFO [train.py:996] (0/4) Epoch 1, batch 24700, loss[loss=0.2849, simple_loss=0.334, pruned_loss=0.1179, over 21457.00 frames. ], tot_loss[loss=0.3319, simple_loss=0.3747, pruned_loss=0.1445, over 4277408.47 frames. ], batch size: 212, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:58:03,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148200.0, ans=0.1 2023-06-18 06:58:43,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=148320.0, ans=0.125 2023-06-18 06:58:54,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-18 06:59:03,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.227e+02 3.816e+02 4.904e+02 7.765e+02, threshold=7.633e+02, percent-clipped=0.0 2023-06-18 06:59:03,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=148380.0, ans=0.0 2023-06-18 06:59:04,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-18 06:59:05,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=148380.0, ans=0.125 2023-06-18 06:59:11,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=148440.0, ans=0.125 2023-06-18 06:59:32,586 INFO [train.py:996] (0/4) Epoch 1, batch 24750, loss[loss=0.3141, simple_loss=0.3446, pruned_loss=0.1419, over 21989.00 frames. ], tot_loss[loss=0.3245, simple_loss=0.3672, pruned_loss=0.1409, over 4273225.79 frames. ], batch size: 119, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:00:07,377 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2023-06-18 07:00:18,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=148620.0, ans=0.125 2023-06-18 07:00:56,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148740.0, ans=0.1 2023-06-18 07:01:01,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=148740.0, ans=0.125 2023-06-18 07:01:09,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=148740.0, ans=0.2 2023-06-18 07:01:14,224 INFO [train.py:996] (0/4) Epoch 1, batch 24800, loss[loss=0.3712, simple_loss=0.3962, pruned_loss=0.1731, over 21819.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.362, pruned_loss=0.1404, over 4277788.50 frames. ], batch size: 414, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:02:27,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.589e+02 4.591e+02 5.888e+02 8.855e+02, threshold=9.183e+02, percent-clipped=11.0 2023-06-18 07:02:43,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=149040.0, ans=0.0 2023-06-18 07:02:50,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=149100.0, ans=0.125 2023-06-18 07:02:56,572 INFO [train.py:996] (0/4) Epoch 1, batch 24850, loss[loss=0.2546, simple_loss=0.2956, pruned_loss=0.1068, over 21342.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3636, pruned_loss=0.1423, over 4288608.91 frames. ], batch size: 131, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:03:08,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.26 vs. limit=22.5 2023-06-18 07:03:15,502 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:03:39,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=149220.0, ans=0.0 2023-06-18 07:03:45,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149220.0, ans=0.1 2023-06-18 07:04:39,634 INFO [train.py:996] (0/4) Epoch 1, batch 24900, loss[loss=0.3618, simple_loss=0.4079, pruned_loss=0.1579, over 21418.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3686, pruned_loss=0.1438, over 4287980.27 frames. ], batch size: 131, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:05:40,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=22.5 2023-06-18 07:05:53,726 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.811e+02 4.758e+02 6.118e+02 1.056e+03, threshold=9.515e+02, percent-clipped=2.0 2023-06-18 07:06:23,769 INFO [train.py:996] (0/4) Epoch 1, batch 24950, loss[loss=0.4801, simple_loss=0.4799, pruned_loss=0.2401, over 21435.00 frames. ], tot_loss[loss=0.3412, simple_loss=0.3798, pruned_loss=0.1513, over 4288683.19 frames. ], batch size: 510, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:06:28,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-18 07:06:46,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-18 07:06:50,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=149760.0, ans=0.0 2023-06-18 07:07:08,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=149820.0, ans=0.1 2023-06-18 07:07:23,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=149880.0, ans=0.0 2023-06-18 07:07:37,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=149880.0, ans=0.0 2023-06-18 07:07:58,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.13 vs. limit=6.0 2023-06-18 07:07:58,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=149940.0, ans=0.025 2023-06-18 07:08:02,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=149940.0, ans=0.0 2023-06-18 07:08:08,501 INFO [train.py:996] (0/4) Epoch 1, batch 25000, loss[loss=0.3339, simple_loss=0.3784, pruned_loss=0.1447, over 21642.00 frames. ], tot_loss[loss=0.3455, simple_loss=0.3864, pruned_loss=0.1523, over 4287911.18 frames. ], batch size: 298, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:08:08,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150000.0, ans=0.1 2023-06-18 07:08:22,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=150000.0, ans=0.125 2023-06-18 07:08:38,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=150060.0, ans=0.125 2023-06-18 07:08:39,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150060.0, ans=0.1 2023-06-18 07:08:41,095 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:09:19,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150180.0, ans=0.1 2023-06-18 07:09:27,600 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.657e+02 3.412e+02 4.030e+02 5.230e+02 1.013e+03, threshold=8.059e+02, percent-clipped=2.0 2023-06-18 07:09:39,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150240.0, ans=0.1 2023-06-18 07:09:57,200 INFO [train.py:996] (0/4) Epoch 1, batch 25050, loss[loss=0.286, simple_loss=0.3249, pruned_loss=0.1236, over 21733.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3787, pruned_loss=0.1494, over 4273928.78 frames. ], batch size: 124, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:09:58,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-18 07:11:22,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=150540.0, ans=0.0 2023-06-18 07:11:40,511 INFO [train.py:996] (0/4) Epoch 1, batch 25100, loss[loss=0.3332, simple_loss=0.4045, pruned_loss=0.1309, over 21689.00 frames. ], tot_loss[loss=0.3327, simple_loss=0.3722, pruned_loss=0.1466, over 4276878.13 frames. ], batch size: 332, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:12:29,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=150720.0, ans=0.125 2023-06-18 07:12:52,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.624e+02 4.936e+02 6.636e+02 1.221e+03, threshold=9.872e+02, percent-clipped=16.0 2023-06-18 07:12:52,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=150780.0, ans=0.2 2023-06-18 07:13:15,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=150900.0, ans=0.2 2023-06-18 07:13:16,006 INFO [train.py:996] (0/4) Epoch 1, batch 25150, loss[loss=0.3693, simple_loss=0.4333, pruned_loss=0.1527, over 21459.00 frames. ], tot_loss[loss=0.3291, simple_loss=0.3733, pruned_loss=0.1425, over 4267502.59 frames. ], batch size: 471, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:13:42,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150960.0, ans=0.1 2023-06-18 07:14:47,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=151140.0, ans=0.07 2023-06-18 07:14:56,933 INFO [train.py:996] (0/4) Epoch 1, batch 25200, loss[loss=0.2764, simple_loss=0.3492, pruned_loss=0.1018, over 21556.00 frames. ], tot_loss[loss=0.3245, simple_loss=0.371, pruned_loss=0.139, over 4269283.35 frames. ], batch size: 230, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:15:15,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=151200.0, ans=0.125 2023-06-18 07:15:27,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=151260.0, ans=0.5 2023-06-18 07:15:46,587 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:16:13,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=151380.0, ans=0.125 2023-06-18 07:16:14,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.108e+02 3.238e+02 4.132e+02 5.215e+02 8.390e+02, threshold=8.263e+02, percent-clipped=0.0 2023-06-18 07:16:38,474 INFO [train.py:996] (0/4) Epoch 1, batch 25250, loss[loss=0.3014, simple_loss=0.3503, pruned_loss=0.1263, over 21680.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.3672, pruned_loss=0.1353, over 4267624.19 frames. ], batch size: 298, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:17:28,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=151620.0, ans=0.0 2023-06-18 07:17:47,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-06-18 07:17:57,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=151680.0, ans=0.125 2023-06-18 07:18:05,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=151740.0, ans=0.05 2023-06-18 07:18:07,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-18 07:18:21,306 INFO [train.py:996] (0/4) Epoch 1, batch 25300, loss[loss=0.3111, simple_loss=0.3643, pruned_loss=0.129, over 21633.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3646, pruned_loss=0.1355, over 4262097.86 frames. ], batch size: 263, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:18:44,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=151860.0, ans=0.0 2023-06-18 07:19:04,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.28 vs. limit=15.0 2023-06-18 07:19:23,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151920.0, ans=0.1 2023-06-18 07:19:23,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=151920.0, ans=0.2 2023-06-18 07:19:35,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=151980.0, ans=0.125 2023-06-18 07:19:39,723 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.535e+02 3.569e+02 4.461e+02 5.778e+02 9.355e+02, threshold=8.922e+02, percent-clipped=5.0 2023-06-18 07:20:03,968 INFO [train.py:996] (0/4) Epoch 1, batch 25350, loss[loss=0.3306, simple_loss=0.3782, pruned_loss=0.1415, over 21523.00 frames. ], tot_loss[loss=0.3193, simple_loss=0.3674, pruned_loss=0.1356, over 4258236.75 frames. ], batch size: 389, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:20:29,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=152160.0, ans=0.025 2023-06-18 07:20:31,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=152160.0, ans=0.125 2023-06-18 07:20:36,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-18 07:20:38,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=152160.0, ans=0.125 2023-06-18 07:21:12,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=152280.0, ans=0.5 2023-06-18 07:21:12,842 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:21:14,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=152280.0, ans=0.125 2023-06-18 07:21:16,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-18 07:21:17,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152280.0, ans=0.1 2023-06-18 07:21:25,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=152340.0, ans=0.2 2023-06-18 07:21:27,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=152340.0, ans=10.0 2023-06-18 07:21:30,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=152340.0, ans=0.0 2023-06-18 07:21:39,293 INFO [train.py:996] (0/4) Epoch 1, batch 25400, loss[loss=0.2838, simple_loss=0.3567, pruned_loss=0.1055, over 21594.00 frames. ], tot_loss[loss=0.3184, simple_loss=0.3656, pruned_loss=0.1356, over 4257127.01 frames. ], batch size: 441, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:21:46,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152400.0, ans=0.1 2023-06-18 07:22:01,194 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-18 07:22:19,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.01 vs. limit=12.0 2023-06-18 07:22:51,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=152580.0, ans=0.125 2023-06-18 07:22:56,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.405e+02 3.535e+02 4.232e+02 5.710e+02 1.225e+03, threshold=8.465e+02, percent-clipped=5.0 2023-06-18 07:23:20,778 INFO [train.py:996] (0/4) Epoch 1, batch 25450, loss[loss=0.3318, simple_loss=0.3645, pruned_loss=0.1496, over 21814.00 frames. ], tot_loss[loss=0.3208, simple_loss=0.3664, pruned_loss=0.1376, over 4252369.42 frames. ], batch size: 282, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:23:28,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.86 vs. limit=15.0 2023-06-18 07:23:40,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-18 07:23:54,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-18 07:24:03,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=152820.0, ans=0.0 2023-06-18 07:24:30,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=152880.0, ans=0.125 2023-06-18 07:25:01,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=152940.0, ans=0.125 2023-06-18 07:25:03,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=153000.0, ans=0.0 2023-06-18 07:25:04,165 INFO [train.py:996] (0/4) Epoch 1, batch 25500, loss[loss=0.3784, simple_loss=0.4257, pruned_loss=0.1655, over 21648.00 frames. ], tot_loss[loss=0.3202, simple_loss=0.3688, pruned_loss=0.1358, over 4263291.80 frames. ], batch size: 389, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:25:32,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=153060.0, ans=0.0 2023-06-18 07:25:40,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153060.0, ans=0.1 2023-06-18 07:25:41,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-18 07:25:46,312 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.59 vs. limit=5.0 2023-06-18 07:25:47,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=153120.0, ans=0.125 2023-06-18 07:26:22,541 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 3.487e+02 4.551e+02 5.429e+02 1.003e+03, threshold=9.102e+02, percent-clipped=2.0 2023-06-18 07:26:23,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=153180.0, ans=0.0 2023-06-18 07:26:36,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=153240.0, ans=0.125 2023-06-18 07:26:43,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.31 vs. limit=6.0 2023-06-18 07:26:52,247 INFO [train.py:996] (0/4) Epoch 1, batch 25550, loss[loss=0.2978, simple_loss=0.3842, pruned_loss=0.1057, over 21633.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3759, pruned_loss=0.1363, over 4271019.82 frames. ], batch size: 389, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:28:28,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=153600.0, ans=0.125 2023-06-18 07:28:34,441 INFO [train.py:996] (0/4) Epoch 1, batch 25600, loss[loss=0.4031, simple_loss=0.4331, pruned_loss=0.1866, over 21824.00 frames. ], tot_loss[loss=0.328, simple_loss=0.3812, pruned_loss=0.1375, over 4272834.23 frames. ], batch size: 441, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:28:40,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=153600.0, ans=0.125 2023-06-18 07:29:09,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=153660.0, ans=0.015 2023-06-18 07:29:36,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.502e+02 4.172e+02 4.983e+02 8.051e+02, threshold=8.344e+02, percent-clipped=0.0 2023-06-18 07:29:40,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=153780.0, ans=0.0 2023-06-18 07:29:59,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=153840.0, ans=0.125 2023-06-18 07:30:10,900 INFO [train.py:996] (0/4) Epoch 1, batch 25650, loss[loss=0.3342, simple_loss=0.3616, pruned_loss=0.1534, over 21717.00 frames. ], tot_loss[loss=0.3328, simple_loss=0.3829, pruned_loss=0.1414, over 4265004.00 frames. ], batch size: 300, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:30:13,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-18 07:30:23,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=153900.0, ans=0.0 2023-06-18 07:31:08,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=154080.0, ans=0.125 2023-06-18 07:31:46,155 INFO [train.py:996] (0/4) Epoch 1, batch 25700, loss[loss=0.4854, simple_loss=0.5371, pruned_loss=0.2169, over 19771.00 frames. ], tot_loss[loss=0.3323, simple_loss=0.3798, pruned_loss=0.1424, over 4269151.49 frames. ], batch size: 702, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:31:59,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=154200.0, ans=0.0 2023-06-18 07:32:18,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=154260.0, ans=0.125 2023-06-18 07:32:21,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.72 vs. limit=6.0 2023-06-18 07:32:24,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=154260.0, ans=0.1 2023-06-18 07:32:31,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=154320.0, ans=0.125 2023-06-18 07:32:37,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=22.5 2023-06-18 07:32:55,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2023-06-18 07:33:00,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.737e+02 4.412e+02 5.511e+02 6.649e+02 1.111e+03, threshold=1.102e+03, percent-clipped=12.0 2023-06-18 07:33:29,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=154500.0, ans=0.125 2023-06-18 07:33:30,735 INFO [train.py:996] (0/4) Epoch 1, batch 25750, loss[loss=0.3868, simple_loss=0.424, pruned_loss=0.1748, over 21452.00 frames. ], tot_loss[loss=0.339, simple_loss=0.3847, pruned_loss=0.1467, over 4271032.95 frames. ], batch size: 131, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:34:23,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=154620.0, ans=0.125 2023-06-18 07:34:23,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=154620.0, ans=0.0 2023-06-18 07:34:32,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=154680.0, ans=0.125 2023-06-18 07:34:41,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-18 07:34:47,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=154680.0, ans=0.0 2023-06-18 07:35:16,167 INFO [train.py:996] (0/4) Epoch 1, batch 25800, loss[loss=0.4096, simple_loss=0.4549, pruned_loss=0.1821, over 21379.00 frames. ], tot_loss[loss=0.3519, simple_loss=0.3977, pruned_loss=0.153, over 4271076.62 frames. ], batch size: 131, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:35:23,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=154800.0, ans=0.5 2023-06-18 07:35:38,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=154860.0, ans=0.0 2023-06-18 07:35:54,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=154920.0, ans=0.0 2023-06-18 07:36:16,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=154980.0, ans=0.0 2023-06-18 07:36:28,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.698e+02 3.746e+02 4.301e+02 5.401e+02 1.441e+03, threshold=8.601e+02, percent-clipped=2.0 2023-06-18 07:36:42,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=155040.0, ans=0.07 2023-06-18 07:36:56,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=155100.0, ans=0.0 2023-06-18 07:36:57,513 INFO [train.py:996] (0/4) Epoch 1, batch 25850, loss[loss=0.306, simple_loss=0.3469, pruned_loss=0.1326, over 21834.00 frames. ], tot_loss[loss=0.3531, simple_loss=0.4007, pruned_loss=0.1528, over 4270323.67 frames. ], batch size: 247, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:37:14,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=155100.0, ans=0.0 2023-06-18 07:37:26,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=155160.0, ans=0.125 2023-06-18 07:37:41,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=155220.0, ans=0.125 2023-06-18 07:38:46,030 INFO [train.py:996] (0/4) Epoch 1, batch 25900, loss[loss=0.4021, simple_loss=0.4242, pruned_loss=0.19, over 20030.00 frames. ], tot_loss[loss=0.3536, simple_loss=0.4014, pruned_loss=0.1529, over 4274522.91 frames. ], batch size: 702, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:39:50,772 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.71 vs. limit=22.5 2023-06-18 07:39:58,794 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.678e+02 3.705e+02 4.419e+02 5.739e+02 1.257e+03, threshold=8.839e+02, percent-clipped=5.0 2023-06-18 07:40:03,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-18 07:40:28,229 INFO [train.py:996] (0/4) Epoch 1, batch 25950, loss[loss=0.3663, simple_loss=0.4085, pruned_loss=0.1621, over 21616.00 frames. ], tot_loss[loss=0.3583, simple_loss=0.4063, pruned_loss=0.1551, over 4271570.52 frames. ], batch size: 263, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:40:38,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=155700.0, ans=0.2 2023-06-18 07:40:39,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=155700.0, ans=0.125 2023-06-18 07:41:20,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=155820.0, ans=0.125 2023-06-18 07:41:31,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=155880.0, ans=0.0 2023-06-18 07:41:33,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=155880.0, ans=0.0 2023-06-18 07:41:53,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=155940.0, ans=0.125 2023-06-18 07:42:10,855 INFO [train.py:996] (0/4) Epoch 1, batch 26000, loss[loss=0.4599, simple_loss=0.4861, pruned_loss=0.2169, over 21409.00 frames. ], tot_loss[loss=0.3574, simple_loss=0.4067, pruned_loss=0.154, over 4273262.27 frames. ], batch size: 509, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:42:26,780 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.70 vs. limit=15.0 2023-06-18 07:43:27,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.511e+02 4.125e+02 5.678e+02 8.372e+02, threshold=8.249e+02, percent-clipped=0.0 2023-06-18 07:43:38,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=156240.0, ans=0.125 2023-06-18 07:43:47,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=156240.0, ans=0.125 2023-06-18 07:43:48,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=156240.0, ans=0.125 2023-06-18 07:43:51,453 INFO [train.py:996] (0/4) Epoch 1, batch 26050, loss[loss=0.2949, simple_loss=0.4227, pruned_loss=0.08358, over 19908.00 frames. ], tot_loss[loss=0.3585, simple_loss=0.4065, pruned_loss=0.1552, over 4278421.15 frames. ], batch size: 702, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:43:52,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=156300.0, ans=0.2 2023-06-18 07:43:56,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=156300.0, ans=0.125 2023-06-18 07:44:36,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156360.0, ans=0.1 2023-06-18 07:44:53,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=156480.0, ans=0.125 2023-06-18 07:45:26,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=156540.0, ans=0.1 2023-06-18 07:45:31,176 INFO [train.py:996] (0/4) Epoch 1, batch 26100, loss[loss=0.2952, simple_loss=0.3389, pruned_loss=0.1258, over 21890.00 frames. ], tot_loss[loss=0.355, simple_loss=0.4016, pruned_loss=0.1542, over 4280829.62 frames. ], batch size: 298, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:46:25,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=156720.0, ans=0.125 2023-06-18 07:46:48,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.762e+02 4.665e+02 5.349e+02 1.153e+03, threshold=9.330e+02, percent-clipped=6.0 2023-06-18 07:47:03,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=156840.0, ans=0.125 2023-06-18 07:47:12,644 INFO [train.py:996] (0/4) Epoch 1, batch 26150, loss[loss=0.4008, simple_loss=0.4308, pruned_loss=0.1854, over 21314.00 frames. ], tot_loss[loss=0.3534, simple_loss=0.3982, pruned_loss=0.1543, over 4290141.00 frames. ], batch size: 159, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:48:25,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=157080.0, ans=0.04949747468305833 2023-06-18 07:48:30,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157080.0, ans=0.1 2023-06-18 07:48:56,366 INFO [train.py:996] (0/4) Epoch 1, batch 26200, loss[loss=0.2972, simple_loss=0.3785, pruned_loss=0.1079, over 21449.00 frames. ], tot_loss[loss=0.3496, simple_loss=0.3969, pruned_loss=0.1511, over 4287498.99 frames. ], batch size: 211, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:49:45,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=157320.0, ans=0.125 2023-06-18 07:50:09,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.383e+02 4.279e+02 5.483e+02 1.348e+03, threshold=8.558e+02, percent-clipped=4.0 2023-06-18 07:50:33,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=157440.0, ans=0.2 2023-06-18 07:50:35,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=157440.0, ans=0.125 2023-06-18 07:50:50,281 INFO [train.py:996] (0/4) Epoch 1, batch 26250, loss[loss=0.3477, simple_loss=0.4035, pruned_loss=0.146, over 21163.00 frames. ], tot_loss[loss=0.3486, simple_loss=0.3999, pruned_loss=0.1487, over 4284109.52 frames. ], batch size: 608, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:51:03,368 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:51:11,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=157560.0, ans=0.125 2023-06-18 07:51:28,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.14 vs. limit=10.0 2023-06-18 07:51:40,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157620.0, ans=0.1 2023-06-18 07:51:45,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=157680.0, ans=0.125 2023-06-18 07:51:56,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=157680.0, ans=0.0 2023-06-18 07:52:07,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=157740.0, ans=0.0 2023-06-18 07:52:11,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=157740.0, ans=0.125 2023-06-18 07:52:21,311 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-18 07:52:31,382 INFO [train.py:996] (0/4) Epoch 1, batch 26300, loss[loss=0.3298, simple_loss=0.3703, pruned_loss=0.1446, over 21775.00 frames. ], tot_loss[loss=0.3482, simple_loss=0.3964, pruned_loss=0.15, over 4290547.70 frames. ], batch size: 112, lr: 2.36e-02, grad_scale: 64.0 2023-06-18 07:53:06,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=157860.0, ans=0.0 2023-06-18 07:53:07,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=157920.0, ans=0.0 2023-06-18 07:53:38,870 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 3.639e+02 4.284e+02 5.347e+02 9.355e+02, threshold=8.568e+02, percent-clipped=1.0 2023-06-18 07:54:13,147 INFO [train.py:996] (0/4) Epoch 1, batch 26350, loss[loss=0.364, simple_loss=0.3963, pruned_loss=0.1659, over 20030.00 frames. ], tot_loss[loss=0.3468, simple_loss=0.3934, pruned_loss=0.1501, over 4281354.20 frames. ], batch size: 702, lr: 2.35e-02, grad_scale: 64.0 2023-06-18 07:54:18,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=158100.0, ans=0.125 2023-06-18 07:54:38,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=158160.0, ans=0.025 2023-06-18 07:54:40,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=158160.0, ans=0.025 2023-06-18 07:54:52,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-18 07:55:55,505 INFO [train.py:996] (0/4) Epoch 1, batch 26400, loss[loss=0.3351, simple_loss=0.3687, pruned_loss=0.1507, over 21775.00 frames. ], tot_loss[loss=0.3435, simple_loss=0.3877, pruned_loss=0.1497, over 4282169.14 frames. ], batch size: 98, lr: 2.35e-02, grad_scale: 64.0 2023-06-18 07:55:57,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=158400.0, ans=0.125 2023-06-18 07:56:25,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-06-18 07:56:26,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=158460.0, ans=0.125 2023-06-18 07:56:37,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=158520.0, ans=0.0 2023-06-18 07:56:41,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=158520.0, ans=0.125 2023-06-18 07:57:03,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158580.0, ans=0.1 2023-06-18 07:57:16,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.627e+02 4.358e+02 5.298e+02 1.261e+03, threshold=8.716e+02, percent-clipped=4.0 2023-06-18 07:57:20,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158580.0, ans=0.1 2023-06-18 07:57:44,280 INFO [train.py:996] (0/4) Epoch 1, batch 26450, loss[loss=0.3977, simple_loss=0.4637, pruned_loss=0.1658, over 21865.00 frames. ], tot_loss[loss=0.3441, simple_loss=0.3884, pruned_loss=0.15, over 4277394.64 frames. ], batch size: 372, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 07:58:08,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=158760.0, ans=0.0 2023-06-18 07:58:09,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=158760.0, ans=0.125 2023-06-18 07:59:02,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-18 07:59:11,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=158940.0, ans=0.035 2023-06-18 07:59:28,078 INFO [train.py:996] (0/4) Epoch 1, batch 26500, loss[loss=0.3136, simple_loss=0.3813, pruned_loss=0.123, over 21735.00 frames. ], tot_loss[loss=0.3422, simple_loss=0.3889, pruned_loss=0.1478, over 4276543.07 frames. ], batch size: 351, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 08:00:03,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=159060.0, ans=0.0 2023-06-18 08:00:23,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=159120.0, ans=0.0 2023-06-18 08:00:45,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-18 08:00:49,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.803e+02 4.749e+02 6.034e+02 1.314e+03, threshold=9.498e+02, percent-clipped=6.0 2023-06-18 08:00:50,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-18 08:00:55,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=159240.0, ans=0.0 2023-06-18 08:01:00,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159240.0, ans=0.1 2023-06-18 08:01:10,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159240.0, ans=0.0 2023-06-18 08:01:13,502 INFO [train.py:996] (0/4) Epoch 1, batch 26550, loss[loss=0.2647, simple_loss=0.3497, pruned_loss=0.08984, over 21742.00 frames. ], tot_loss[loss=0.3348, simple_loss=0.3843, pruned_loss=0.1427, over 4268468.79 frames. ], batch size: 332, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 08:02:43,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=159540.0, ans=0.125 2023-06-18 08:02:45,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159540.0, ans=0.0 2023-06-18 08:02:50,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=159540.0, ans=0.125 2023-06-18 08:02:59,967 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-06-18 08:03:00,188 INFO [train.py:996] (0/4) Epoch 1, batch 26600, loss[loss=0.3133, simple_loss=0.3583, pruned_loss=0.1341, over 21192.00 frames. ], tot_loss[loss=0.3307, simple_loss=0.3831, pruned_loss=0.1391, over 4263724.13 frames. ], batch size: 176, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:03:26,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=159660.0, ans=0.0 2023-06-18 08:04:08,046 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.521e+02 4.224e+02 5.242e+02 1.118e+03, threshold=8.449e+02, percent-clipped=1.0 2023-06-18 08:04:21,324 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:04:36,035 INFO [train.py:996] (0/4) Epoch 1, batch 26650, loss[loss=0.3255, simple_loss=0.3545, pruned_loss=0.1483, over 21345.00 frames. ], tot_loss[loss=0.328, simple_loss=0.3774, pruned_loss=0.1392, over 4257386.56 frames. ], batch size: 194, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:04:46,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=159900.0, ans=0.04949747468305833 2023-06-18 08:05:32,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-18 08:06:15,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=160200.0, ans=0.125 2023-06-18 08:06:16,338 INFO [train.py:996] (0/4) Epoch 1, batch 26700, loss[loss=0.3823, simple_loss=0.4098, pruned_loss=0.1773, over 21803.00 frames. ], tot_loss[loss=0.3191, simple_loss=0.3688, pruned_loss=0.1347, over 4262835.54 frames. ], batch size: 441, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:07:02,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=160320.0, ans=0.125 2023-06-18 08:07:23,785 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.935e+02 3.510e+02 4.681e+02 9.206e+02, threshold=7.020e+02, percent-clipped=3.0 2023-06-18 08:07:36,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=160440.0, ans=0.1 2023-06-18 08:07:47,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=160440.0, ans=0.1 2023-06-18 08:07:57,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=160440.0, ans=0.125 2023-06-18 08:08:03,393 INFO [train.py:996] (0/4) Epoch 1, batch 26750, loss[loss=0.326, simple_loss=0.3807, pruned_loss=0.1357, over 21720.00 frames. ], tot_loss[loss=0.3161, simple_loss=0.3677, pruned_loss=0.1323, over 4269248.50 frames. ], batch size: 332, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:08:24,646 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2023-06-18 08:09:52,196 INFO [train.py:996] (0/4) Epoch 1, batch 26800, loss[loss=0.2861, simple_loss=0.342, pruned_loss=0.1151, over 21923.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.3776, pruned_loss=0.1387, over 4262985.86 frames. ], batch size: 98, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:09:52,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=160800.0, ans=0.035 2023-06-18 08:09:59,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=160800.0, ans=0.125 2023-06-18 08:10:08,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=160860.0, ans=0.125 2023-06-18 08:10:10,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=160860.0, ans=0.2 2023-06-18 08:10:14,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=160860.0, ans=0.025 2023-06-18 08:10:33,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=160920.0, ans=0.125 2023-06-18 08:11:01,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.593e+02 3.549e+02 4.364e+02 5.200e+02 1.402e+03, threshold=8.728e+02, percent-clipped=9.0 2023-06-18 08:11:11,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=161040.0, ans=0.2 2023-06-18 08:11:23,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=161040.0, ans=0.125 2023-06-18 08:11:27,772 INFO [train.py:996] (0/4) Epoch 1, batch 26850, loss[loss=0.3213, simple_loss=0.3487, pruned_loss=0.147, over 21700.00 frames. ], tot_loss[loss=0.3358, simple_loss=0.382, pruned_loss=0.1448, over 4263438.48 frames. ], batch size: 333, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:12:04,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=161220.0, ans=0.125 2023-06-18 08:12:26,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=161280.0, ans=0.2 2023-06-18 08:12:58,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-18 08:13:02,132 INFO [train.py:996] (0/4) Epoch 1, batch 26900, loss[loss=0.2908, simple_loss=0.3298, pruned_loss=0.1259, over 21656.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3718, pruned_loss=0.142, over 4256977.32 frames. ], batch size: 333, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:13:11,534 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:13:32,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=161520.0, ans=0.95 2023-06-18 08:14:06,398 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 3.385e+02 4.142e+02 4.911e+02 9.199e+02, threshold=8.284e+02, percent-clipped=1.0 2023-06-18 08:14:37,084 INFO [train.py:996] (0/4) Epoch 1, batch 26950, loss[loss=0.3311, simple_loss=0.3991, pruned_loss=0.1315, over 21647.00 frames. ], tot_loss[loss=0.3271, simple_loss=0.3708, pruned_loss=0.1417, over 4254753.88 frames. ], batch size: 263, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:14:50,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=161700.0, ans=0.07 2023-06-18 08:15:11,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=161820.0, ans=0.125 2023-06-18 08:15:39,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=161880.0, ans=0.0 2023-06-18 08:16:09,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-18 08:16:13,290 INFO [train.py:996] (0/4) Epoch 1, batch 27000, loss[loss=0.2851, simple_loss=0.3583, pruned_loss=0.106, over 21654.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3702, pruned_loss=0.1382, over 4250479.76 frames. ], batch size: 298, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:16:13,291 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 08:16:29,106 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.2828, simple_loss=0.3784, pruned_loss=0.09358, over 1796401.00 frames. 2023-06-18 08:16:29,107 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 08:16:42,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=162000.0, ans=0.125 2023-06-18 08:16:45,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162060.0, ans=0.1 2023-06-18 08:16:46,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=162060.0, ans=0.025 2023-06-18 08:16:48,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=162060.0, ans=0.125 2023-06-18 08:17:06,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=162060.0, ans=0.2 2023-06-18 08:17:25,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=162180.0, ans=6.0 2023-06-18 08:17:39,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-18 08:17:39,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.208e+02 3.737e+02 4.814e+02 7.556e+02, threshold=7.473e+02, percent-clipped=0.0 2023-06-18 08:17:51,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=162240.0, ans=0.0 2023-06-18 08:18:01,113 INFO [train.py:996] (0/4) Epoch 1, batch 27050, loss[loss=0.2999, simple_loss=0.3554, pruned_loss=0.1222, over 21813.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3701, pruned_loss=0.1328, over 4253422.16 frames. ], batch size: 247, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:18:59,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=162420.0, ans=0.0 2023-06-18 08:19:37,661 INFO [train.py:996] (0/4) Epoch 1, batch 27100, loss[loss=0.2813, simple_loss=0.3703, pruned_loss=0.09616, over 21826.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3741, pruned_loss=0.137, over 4263015.75 frames. ], batch size: 282, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:19:55,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=162600.0, ans=0.0 2023-06-18 08:20:31,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=162720.0, ans=0.0 2023-06-18 08:20:52,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.734e+02 4.835e+02 6.632e+02 1.398e+03, threshold=9.671e+02, percent-clipped=18.0 2023-06-18 08:21:14,200 INFO [train.py:996] (0/4) Epoch 1, batch 27150, loss[loss=0.3758, simple_loss=0.4394, pruned_loss=0.1561, over 21821.00 frames. ], tot_loss[loss=0.3363, simple_loss=0.3879, pruned_loss=0.1424, over 4266550.22 frames. ], batch size: 371, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:22:10,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=163020.0, ans=0.05 2023-06-18 08:22:26,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=163080.0, ans=0.035 2023-06-18 08:22:33,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=163140.0, ans=0.125 2023-06-18 08:22:55,444 INFO [train.py:996] (0/4) Epoch 1, batch 27200, loss[loss=0.3514, simple_loss=0.3934, pruned_loss=0.1547, over 21359.00 frames. ], tot_loss[loss=0.3407, simple_loss=0.3941, pruned_loss=0.1437, over 4271956.95 frames. ], batch size: 176, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:23:13,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=163200.0, ans=0.0 2023-06-18 08:23:18,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=163200.0, ans=0.0 2023-06-18 08:23:44,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=163320.0, ans=0.0 2023-06-18 08:24:03,409 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:24:05,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.595e+02 4.705e+02 6.129e+02 1.080e+03, threshold=9.409e+02, percent-clipped=7.0 2023-06-18 08:24:39,410 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:24:41,937 INFO [train.py:996] (0/4) Epoch 1, batch 27250, loss[loss=0.3595, simple_loss=0.3928, pruned_loss=0.1631, over 20605.00 frames. ], tot_loss[loss=0.3494, simple_loss=0.3992, pruned_loss=0.1498, over 4271932.47 frames. ], batch size: 607, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:24:50,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=163500.0, ans=0.125 2023-06-18 08:24:52,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=163500.0, ans=0.125 2023-06-18 08:24:55,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=163500.0, ans=0.125 2023-06-18 08:25:31,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-18 08:26:19,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-18 08:26:20,977 INFO [train.py:996] (0/4) Epoch 1, batch 27300, loss[loss=0.3982, simple_loss=0.4446, pruned_loss=0.1759, over 21731.00 frames. ], tot_loss[loss=0.3526, simple_loss=0.4018, pruned_loss=0.1517, over 4275847.30 frames. ], batch size: 441, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:26:45,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=163860.0, ans=0.1 2023-06-18 08:27:01,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=163920.0, ans=0.125 2023-06-18 08:27:36,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.630e+02 3.615e+02 4.138e+02 5.244e+02 1.044e+03, threshold=8.277e+02, percent-clipped=1.0 2023-06-18 08:28:02,483 INFO [train.py:996] (0/4) Epoch 1, batch 27350, loss[loss=0.419, simple_loss=0.5169, pruned_loss=0.1606, over 19827.00 frames. ], tot_loss[loss=0.3581, simple_loss=0.4067, pruned_loss=0.1547, over 4266512.88 frames. ], batch size: 703, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:28:09,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-18 08:28:18,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-18 08:28:48,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=164220.0, ans=0.07 2023-06-18 08:28:54,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=164220.0, ans=0.125 2023-06-18 08:28:55,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=164220.0, ans=0.125 2023-06-18 08:28:56,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.51 vs. limit=8.0 2023-06-18 08:29:15,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=164280.0, ans=0.0 2023-06-18 08:29:33,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=164340.0, ans=0.125 2023-06-18 08:29:37,993 INFO [train.py:996] (0/4) Epoch 1, batch 27400, loss[loss=0.2976, simple_loss=0.3395, pruned_loss=0.1278, over 21260.00 frames. ], tot_loss[loss=0.3526, simple_loss=0.4001, pruned_loss=0.1525, over 4271764.19 frames. ], batch size: 176, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:29:52,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=164460.0, ans=0.125 2023-06-18 08:29:58,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=164460.0, ans=0.0 2023-06-18 08:30:36,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=164520.0, ans=0.125 2023-06-18 08:30:47,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.591e+02 4.552e+02 5.428e+02 9.216e+02, threshold=9.104e+02, percent-clipped=2.0 2023-06-18 08:31:13,688 INFO [train.py:996] (0/4) Epoch 1, batch 27450, loss[loss=0.3267, simple_loss=0.3878, pruned_loss=0.1328, over 21301.00 frames. ], tot_loss[loss=0.3457, simple_loss=0.3926, pruned_loss=0.1494, over 4267021.55 frames. ], batch size: 548, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:32:11,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=164820.0, ans=0.04949747468305833 2023-06-18 08:32:14,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=164880.0, ans=0.125 2023-06-18 08:32:22,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-18 08:32:49,997 INFO [train.py:996] (0/4) Epoch 1, batch 27500, loss[loss=0.3075, simple_loss=0.3514, pruned_loss=0.1318, over 21865.00 frames. ], tot_loss[loss=0.3472, simple_loss=0.3924, pruned_loss=0.151, over 4270230.14 frames. ], batch size: 282, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:33:04,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=165060.0, ans=0.125 2023-06-18 08:33:29,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=165120.0, ans=0.125 2023-06-18 08:33:37,744 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-18 08:33:49,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=165180.0, ans=0.0 2023-06-18 08:33:52,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=165180.0, ans=0.2 2023-06-18 08:34:03,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.309e+02 3.875e+02 5.024e+02 1.518e+03, threshold=7.749e+02, percent-clipped=3.0 2023-06-18 08:34:04,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=165180.0, ans=0.05 2023-06-18 08:34:14,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=165240.0, ans=0.0 2023-06-18 08:34:23,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=165300.0, ans=0.125 2023-06-18 08:34:24,891 INFO [train.py:996] (0/4) Epoch 1, batch 27550, loss[loss=0.3046, simple_loss=0.3501, pruned_loss=0.1296, over 21819.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3883, pruned_loss=0.1477, over 4274661.76 frames. ], batch size: 98, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:35:31,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=165480.0, ans=0.0 2023-06-18 08:35:46,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=165540.0, ans=0.2 2023-06-18 08:35:59,086 INFO [train.py:996] (0/4) Epoch 1, batch 27600, loss[loss=0.2907, simple_loss=0.3343, pruned_loss=0.1236, over 21748.00 frames. ], tot_loss[loss=0.335, simple_loss=0.3799, pruned_loss=0.1451, over 4280318.97 frames. ], batch size: 351, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:36:00,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=165600.0, ans=0.125 2023-06-18 08:36:37,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=165720.0, ans=0.125 2023-06-18 08:36:52,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-18 08:36:56,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=165780.0, ans=0.125 2023-06-18 08:37:06,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.205e+02 4.118e+02 5.523e+02 1.130e+03, threshold=8.236e+02, percent-clipped=6.0 2023-06-18 08:37:31,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=165900.0, ans=0.2 2023-06-18 08:37:32,570 INFO [train.py:996] (0/4) Epoch 1, batch 27650, loss[loss=0.3087, simple_loss=0.3564, pruned_loss=0.1305, over 21365.00 frames. ], tot_loss[loss=0.3306, simple_loss=0.3733, pruned_loss=0.1439, over 4274275.55 frames. ], batch size: 144, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:37:49,029 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=2.537e-03 2023-06-18 08:37:51,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165960.0, ans=0.1 2023-06-18 08:38:34,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=166080.0, ans=0.125 2023-06-18 08:38:36,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=166080.0, ans=0.0 2023-06-18 08:39:06,515 INFO [train.py:996] (0/4) Epoch 1, batch 27700, loss[loss=0.311, simple_loss=0.3804, pruned_loss=0.1208, over 20952.00 frames. ], tot_loss[loss=0.3254, simple_loss=0.3714, pruned_loss=0.1397, over 4270252.15 frames. ], batch size: 608, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:39:30,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=166260.0, ans=0.125 2023-06-18 08:39:41,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=166320.0, ans=0.2 2023-06-18 08:40:20,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.538e+02 3.610e+02 4.452e+02 5.999e+02 1.124e+03, threshold=8.903e+02, percent-clipped=7.0 2023-06-18 08:40:41,468 INFO [train.py:996] (0/4) Epoch 1, batch 27750, loss[loss=0.2499, simple_loss=0.3209, pruned_loss=0.08941, over 21403.00 frames. ], tot_loss[loss=0.325, simple_loss=0.3739, pruned_loss=0.1381, over 4272569.97 frames. ], batch size: 211, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:41:00,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=166560.0, ans=0.125 2023-06-18 08:41:01,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=166560.0, ans=0.125 2023-06-18 08:41:37,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=166620.0, ans=0.125 2023-06-18 08:41:37,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=166620.0, ans=0.125 2023-06-18 08:42:01,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=166740.0, ans=0.04949747468305833 2023-06-18 08:42:11,892 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:42:16,151 INFO [train.py:996] (0/4) Epoch 1, batch 27800, loss[loss=0.3041, simple_loss=0.3451, pruned_loss=0.1316, over 21539.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3703, pruned_loss=0.1372, over 4279990.63 frames. ], batch size: 212, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:42:33,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=166860.0, ans=0.125 2023-06-18 08:42:40,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166860.0, ans=0.1 2023-06-18 08:43:20,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 3.396e+02 4.185e+02 5.590e+02 8.815e+02, threshold=8.371e+02, percent-clipped=0.0 2023-06-18 08:43:37,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-18 08:43:46,602 INFO [train.py:996] (0/4) Epoch 1, batch 27850, loss[loss=0.3346, simple_loss=0.3662, pruned_loss=0.1515, over 21568.00 frames. ], tot_loss[loss=0.3224, simple_loss=0.3688, pruned_loss=0.138, over 4289058.47 frames. ], batch size: 548, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:43:47,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=167100.0, ans=0.2 2023-06-18 08:43:57,373 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-06-18 08:45:01,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=167340.0, ans=0.1 2023-06-18 08:45:15,243 INFO [train.py:996] (0/4) Epoch 1, batch 27900, loss[loss=0.3536, simple_loss=0.4104, pruned_loss=0.1484, over 19921.00 frames. ], tot_loss[loss=0.33, simple_loss=0.3791, pruned_loss=0.1404, over 4281241.77 frames. ], batch size: 702, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:45:16,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.90 vs. limit=10.0 2023-06-18 08:45:41,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=167460.0, ans=0.125 2023-06-18 08:45:44,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=167460.0, ans=0.2 2023-06-18 08:46:18,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-18 08:46:26,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.916e+02 4.626e+02 5.836e+02 1.013e+03, threshold=9.252e+02, percent-clipped=5.0 2023-06-18 08:46:58,173 INFO [train.py:996] (0/4) Epoch 1, batch 27950, loss[loss=0.3545, simple_loss=0.4091, pruned_loss=0.15, over 21717.00 frames. ], tot_loss[loss=0.3263, simple_loss=0.3804, pruned_loss=0.1361, over 4277557.51 frames. ], batch size: 351, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:47:08,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=167700.0, ans=0.1 2023-06-18 08:47:10,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=167700.0, ans=0.2 2023-06-18 08:47:39,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=167820.0, ans=0.125 2023-06-18 08:48:09,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=167940.0, ans=0.2 2023-06-18 08:48:25,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=167940.0, ans=0.2 2023-06-18 08:48:33,263 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-28000.pt 2023-06-18 08:48:35,921 INFO [train.py:996] (0/4) Epoch 1, batch 28000, loss[loss=0.3106, simple_loss=0.3508, pruned_loss=0.1352, over 21822.00 frames. ], tot_loss[loss=0.324, simple_loss=0.3784, pruned_loss=0.1348, over 4277977.70 frames. ], batch size: 247, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:49:35,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 3.441e+02 4.640e+02 5.582e+02 1.043e+03, threshold=9.281e+02, percent-clipped=2.0 2023-06-18 08:49:36,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-06-18 08:49:40,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168240.0, ans=0.1 2023-06-18 08:50:07,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=168240.0, ans=0.125 2023-06-18 08:50:11,362 INFO [train.py:996] (0/4) Epoch 1, batch 28050, loss[loss=0.3369, simple_loss=0.3775, pruned_loss=0.1481, over 21187.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3763, pruned_loss=0.1371, over 4281936.11 frames. ], batch size: 607, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:50:33,362 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:50:50,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=168420.0, ans=0.2 2023-06-18 08:50:53,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168420.0, ans=0.1 2023-06-18 08:50:57,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=168420.0, ans=0.125 2023-06-18 08:51:08,925 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-18 08:51:19,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.89 vs. limit=22.5 2023-06-18 08:51:41,518 INFO [train.py:996] (0/4) Epoch 1, batch 28100, loss[loss=0.298, simple_loss=0.3351, pruned_loss=0.1305, over 21483.00 frames. ], tot_loss[loss=0.325, simple_loss=0.3755, pruned_loss=0.1373, over 4275957.95 frames. ], batch size: 195, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:52:03,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=168660.0, ans=0.05 2023-06-18 08:52:51,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.651e+02 4.541e+02 5.753e+02 9.912e+02, threshold=9.083e+02, percent-clipped=1.0 2023-06-18 08:52:53,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-18 08:53:01,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=168840.0, ans=0.05 2023-06-18 08:53:03,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.55 vs. limit=10.0 2023-06-18 08:53:08,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=168840.0, ans=0.125 2023-06-18 08:53:16,356 INFO [train.py:996] (0/4) Epoch 1, batch 28150, loss[loss=0.2867, simple_loss=0.3202, pruned_loss=0.1266, over 21537.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3688, pruned_loss=0.1375, over 4279930.59 frames. ], batch size: 263, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:53:24,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=168900.0, ans=0.0 2023-06-18 08:53:45,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-06-18 08:54:02,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=169020.0, ans=0.125 2023-06-18 08:54:33,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=169140.0, ans=0.125 2023-06-18 08:54:36,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=169140.0, ans=0.125 2023-06-18 08:54:50,994 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-18 08:54:53,288 INFO [train.py:996] (0/4) Epoch 1, batch 28200, loss[loss=0.3461, simple_loss=0.3799, pruned_loss=0.1561, over 20697.00 frames. ], tot_loss[loss=0.3237, simple_loss=0.3676, pruned_loss=0.1399, over 4269843.65 frames. ], batch size: 607, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:54:56,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=169200.0, ans=0.125 2023-06-18 08:55:21,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=169260.0, ans=0.125 2023-06-18 08:56:04,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.823e+02 5.073e+02 6.497e+02 1.031e+03, threshold=1.015e+03, percent-clipped=3.0 2023-06-18 08:56:28,477 INFO [train.py:996] (0/4) Epoch 1, batch 28250, loss[loss=0.3389, simple_loss=0.3795, pruned_loss=0.1491, over 21606.00 frames. ], tot_loss[loss=0.3302, simple_loss=0.3721, pruned_loss=0.1441, over 4263114.33 frames. ], batch size: 263, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:56:56,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=169560.0, ans=0.2 2023-06-18 08:57:04,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=169620.0, ans=0.125 2023-06-18 08:57:14,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=169620.0, ans=0.125 2023-06-18 08:57:15,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=169620.0, ans=0.0 2023-06-18 08:57:25,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=169680.0, ans=0.0 2023-06-18 08:57:56,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=169740.0, ans=0.125 2023-06-18 08:58:00,558 INFO [train.py:996] (0/4) Epoch 1, batch 28300, loss[loss=0.2611, simple_loss=0.3589, pruned_loss=0.08167, over 20771.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3692, pruned_loss=0.1401, over 4264486.08 frames. ], batch size: 608, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:58:02,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=169800.0, ans=0.2 2023-06-18 08:58:37,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=169920.0, ans=0.035 2023-06-18 08:59:00,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=169980.0, ans=0.2 2023-06-18 08:59:10,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=169980.0, ans=0.125 2023-06-18 08:59:11,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 3.420e+02 4.323e+02 5.538e+02 1.121e+03, threshold=8.647e+02, percent-clipped=1.0 2023-06-18 08:59:22,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=170040.0, ans=0.0 2023-06-18 08:59:31,051 INFO [train.py:996] (0/4) Epoch 1, batch 28350, loss[loss=0.2893, simple_loss=0.3233, pruned_loss=0.1276, over 21850.00 frames. ], tot_loss[loss=0.3155, simple_loss=0.366, pruned_loss=0.1325, over 4254092.41 frames. ], batch size: 107, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:59:32,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-18 09:00:54,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-18 09:00:58,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=170340.0, ans=0.125 2023-06-18 09:01:02,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=170340.0, ans=0.125 2023-06-18 09:01:06,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=170340.0, ans=0.04949747468305833 2023-06-18 09:01:09,032 INFO [train.py:996] (0/4) Epoch 1, batch 28400, loss[loss=0.3328, simple_loss=0.356, pruned_loss=0.1549, over 21238.00 frames. ], tot_loss[loss=0.3127, simple_loss=0.3616, pruned_loss=0.1319, over 4248903.93 frames. ], batch size: 471, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:01:12,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=170400.0, ans=0.2 2023-06-18 09:01:18,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=170400.0, ans=0.2 2023-06-18 09:02:20,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=170580.0, ans=0.0 2023-06-18 09:02:24,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 3.627e+02 4.521e+02 5.478e+02 1.024e+03, threshold=9.042e+02, percent-clipped=4.0 2023-06-18 09:02:25,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-18 09:02:41,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=170640.0, ans=0.015 2023-06-18 09:02:44,593 INFO [train.py:996] (0/4) Epoch 1, batch 28450, loss[loss=0.3188, simple_loss=0.3561, pruned_loss=0.1407, over 21420.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3695, pruned_loss=0.1383, over 4253931.54 frames. ], batch size: 211, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:02:45,788 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.02 vs. limit=22.5 2023-06-18 09:02:48,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=170700.0, ans=0.0 2023-06-18 09:02:51,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=170700.0, ans=0.125 2023-06-18 09:02:53,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=170700.0, ans=0.0 2023-06-18 09:03:30,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=170820.0, ans=0.2 2023-06-18 09:03:33,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=170820.0, ans=0.125 2023-06-18 09:04:01,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=170880.0, ans=0.125 2023-06-18 09:04:20,715 INFO [train.py:996] (0/4) Epoch 1, batch 28500, loss[loss=0.3619, simple_loss=0.4065, pruned_loss=0.1586, over 21327.00 frames. ], tot_loss[loss=0.3271, simple_loss=0.3721, pruned_loss=0.1411, over 4267290.62 frames. ], batch size: 159, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:04:46,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=171060.0, ans=0.125 2023-06-18 09:04:56,067 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-18 09:05:09,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=171120.0, ans=0.0 2023-06-18 09:05:15,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=171120.0, ans=0.125 2023-06-18 09:05:37,218 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.668e+02 4.799e+02 6.213e+02 1.260e+03, threshold=9.598e+02, percent-clipped=4.0 2023-06-18 09:05:42,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=171240.0, ans=0.125 2023-06-18 09:06:07,535 INFO [train.py:996] (0/4) Epoch 1, batch 28550, loss[loss=0.4051, simple_loss=0.4747, pruned_loss=0.1678, over 20728.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3812, pruned_loss=0.1448, over 4270099.05 frames. ], batch size: 607, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:07:47,169 INFO [train.py:996] (0/4) Epoch 1, batch 28600, loss[loss=0.4175, simple_loss=0.4377, pruned_loss=0.1986, over 21577.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3884, pruned_loss=0.1476, over 4273807.99 frames. ], batch size: 414, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:07:52,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=171600.0, ans=0.0 2023-06-18 09:08:23,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=171720.0, ans=0.125 2023-06-18 09:08:47,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.320e+02 4.208e+02 5.237e+02 8.981e+02, threshold=8.415e+02, percent-clipped=0.0 2023-06-18 09:09:22,054 INFO [train.py:996] (0/4) Epoch 1, batch 28650, loss[loss=0.2618, simple_loss=0.3082, pruned_loss=0.1077, over 21532.00 frames. ], tot_loss[loss=0.3378, simple_loss=0.3824, pruned_loss=0.1466, over 4265540.03 frames. ], batch size: 263, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:09:48,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=171960.0, ans=0.035 2023-06-18 09:10:24,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=172080.0, ans=0.05 2023-06-18 09:10:51,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=172140.0, ans=0.125 2023-06-18 09:10:58,010 INFO [train.py:996] (0/4) Epoch 1, batch 28700, loss[loss=0.3551, simple_loss=0.3992, pruned_loss=0.1555, over 21826.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.3821, pruned_loss=0.1484, over 4269023.23 frames. ], batch size: 282, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:11:48,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=172380.0, ans=0.125 2023-06-18 09:11:51,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.69 vs. limit=10.0 2023-06-18 09:12:09,234 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.359e+02 4.224e+02 5.589e+02 9.530e+02, threshold=8.447e+02, percent-clipped=4.0 2023-06-18 09:12:22,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-18 09:12:38,149 INFO [train.py:996] (0/4) Epoch 1, batch 28750, loss[loss=0.3285, simple_loss=0.403, pruned_loss=0.127, over 20987.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.3819, pruned_loss=0.1485, over 4276317.21 frames. ], batch size: 607, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:13:00,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.46 vs. limit=15.0 2023-06-18 09:13:03,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=172560.0, ans=0.0 2023-06-18 09:13:17,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.50 vs. limit=10.0 2023-06-18 09:14:11,458 INFO [train.py:996] (0/4) Epoch 1, batch 28800, loss[loss=0.3458, simple_loss=0.3935, pruned_loss=0.1491, over 21754.00 frames. ], tot_loss[loss=0.3425, simple_loss=0.3862, pruned_loss=0.1494, over 4280749.15 frames. ], batch size: 298, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:14:20,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-18 09:14:23,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-06-18 09:15:25,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.326e+02 4.027e+02 5.437e+02 1.151e+03, threshold=8.055e+02, percent-clipped=4.0 2023-06-18 09:15:33,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=173040.0, ans=0.125 2023-06-18 09:15:37,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173040.0, ans=0.1 2023-06-18 09:15:49,279 INFO [train.py:996] (0/4) Epoch 1, batch 28850, loss[loss=0.3758, simple_loss=0.4394, pruned_loss=0.1561, over 19859.00 frames. ], tot_loss[loss=0.3451, simple_loss=0.3877, pruned_loss=0.1513, over 4283847.35 frames. ], batch size: 702, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:16:13,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=173160.0, ans=15.0 2023-06-18 09:16:58,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=173280.0, ans=0.0 2023-06-18 09:17:09,830 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-18 09:17:25,768 INFO [train.py:996] (0/4) Epoch 1, batch 28900, loss[loss=0.3689, simple_loss=0.4116, pruned_loss=0.1631, over 21872.00 frames. ], tot_loss[loss=0.3486, simple_loss=0.3907, pruned_loss=0.1533, over 4283147.42 frames. ], batch size: 371, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:17:26,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=173400.0, ans=0.125 2023-06-18 09:18:28,453 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.56 vs. limit=15.0 2023-06-18 09:18:38,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.786e+02 4.531e+02 6.034e+02 1.219e+03, threshold=9.062e+02, percent-clipped=7.0 2023-06-18 09:18:58,999 INFO [train.py:996] (0/4) Epoch 1, batch 28950, loss[loss=0.4253, simple_loss=0.5134, pruned_loss=0.1686, over 19749.00 frames. ], tot_loss[loss=0.346, simple_loss=0.3905, pruned_loss=0.1508, over 4277372.22 frames. ], batch size: 702, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:19:07,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=173700.0, ans=0.0 2023-06-18 09:19:12,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=173700.0, ans=0.2 2023-06-18 09:19:47,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=173820.0, ans=0.0 2023-06-18 09:20:30,543 INFO [train.py:996] (0/4) Epoch 1, batch 29000, loss[loss=0.3272, simple_loss=0.3833, pruned_loss=0.1355, over 21327.00 frames. ], tot_loss[loss=0.3469, simple_loss=0.3943, pruned_loss=0.1498, over 4276151.94 frames. ], batch size: 159, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:21:12,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=174060.0, ans=0.0 2023-06-18 09:21:44,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=174180.0, ans=0.04949747468305833 2023-06-18 09:21:45,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 3.309e+02 4.233e+02 5.463e+02 9.741e+02, threshold=8.465e+02, percent-clipped=3.0 2023-06-18 09:22:05,673 INFO [train.py:996] (0/4) Epoch 1, batch 29050, loss[loss=0.2866, simple_loss=0.3328, pruned_loss=0.1203, over 21821.00 frames. ], tot_loss[loss=0.3492, simple_loss=0.3941, pruned_loss=0.1521, over 4281373.43 frames. ], batch size: 247, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:22:51,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=174360.0, ans=0.125 2023-06-18 09:23:24,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=174540.0, ans=0.125 2023-06-18 09:23:40,549 INFO [train.py:996] (0/4) Epoch 1, batch 29100, loss[loss=0.2891, simple_loss=0.3324, pruned_loss=0.1229, over 21810.00 frames. ], tot_loss[loss=0.3402, simple_loss=0.3839, pruned_loss=0.1482, over 4274093.05 frames. ], batch size: 98, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:23:42,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=174600.0, ans=0.0 2023-06-18 09:23:54,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-06-18 09:24:31,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=174720.0, ans=0.125 2023-06-18 09:24:48,541 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.440e+02 4.060e+02 5.417e+02 8.880e+02, threshold=8.120e+02, percent-clipped=2.0 2023-06-18 09:25:17,861 INFO [train.py:996] (0/4) Epoch 1, batch 29150, loss[loss=0.3356, simple_loss=0.3957, pruned_loss=0.1377, over 21529.00 frames. ], tot_loss[loss=0.3368, simple_loss=0.3828, pruned_loss=0.1453, over 4261433.61 frames. ], batch size: 389, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:25:44,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=174960.0, ans=0.125 2023-06-18 09:25:57,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=175020.0, ans=0.125 2023-06-18 09:26:04,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.20 vs. limit=6.0 2023-06-18 09:26:32,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=175140.0, ans=0.0 2023-06-18 09:26:48,393 INFO [train.py:996] (0/4) Epoch 1, batch 29200, loss[loss=0.269, simple_loss=0.3168, pruned_loss=0.1106, over 21379.00 frames. ], tot_loss[loss=0.3309, simple_loss=0.3759, pruned_loss=0.1429, over 4258123.86 frames. ], batch size: 131, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:27:20,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=175260.0, ans=0.125 2023-06-18 09:27:37,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-18 09:27:50,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.297e+02 4.275e+02 5.517e+02 1.101e+03, threshold=8.550e+02, percent-clipped=8.0 2023-06-18 09:27:55,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=175440.0, ans=0.125 2023-06-18 09:28:25,611 INFO [train.py:996] (0/4) Epoch 1, batch 29250, loss[loss=0.3567, simple_loss=0.4179, pruned_loss=0.1478, over 21762.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3726, pruned_loss=0.139, over 4263723.78 frames. ], batch size: 352, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:28:37,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=175500.0, ans=0.125 2023-06-18 09:28:46,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=175560.0, ans=0.04949747468305833 2023-06-18 09:28:59,367 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.21 vs. limit=22.5 2023-06-18 09:29:01,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=175620.0, ans=0.09899494936611666 2023-06-18 09:29:55,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-18 09:30:05,161 INFO [train.py:996] (0/4) Epoch 1, batch 29300, loss[loss=0.3256, simple_loss=0.3764, pruned_loss=0.1374, over 21324.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.373, pruned_loss=0.137, over 4265439.43 frames. ], batch size: 176, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:30:10,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=175800.0, ans=0.2 2023-06-18 09:30:20,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=175860.0, ans=0.07 2023-06-18 09:30:29,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2023-06-18 09:30:35,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=15.0 2023-06-18 09:30:53,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=175980.0, ans=0.125 2023-06-18 09:31:00,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=22.5 2023-06-18 09:31:07,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 3.858e+02 5.248e+02 6.398e+02 1.119e+03, threshold=1.050e+03, percent-clipped=2.0 2023-06-18 09:31:36,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=176100.0, ans=0.0 2023-06-18 09:31:36,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-18 09:31:37,317 INFO [train.py:996] (0/4) Epoch 1, batch 29350, loss[loss=0.2821, simple_loss=0.3544, pruned_loss=0.1049, over 21618.00 frames. ], tot_loss[loss=0.3212, simple_loss=0.3695, pruned_loss=0.1364, over 4262976.29 frames. ], batch size: 263, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:31:39,435 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:33:05,658 INFO [train.py:996] (0/4) Epoch 1, batch 29400, loss[loss=0.3487, simple_loss=0.3966, pruned_loss=0.1504, over 21516.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3662, pruned_loss=0.1327, over 4251717.95 frames. ], batch size: 508, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:34:22,286 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 3.521e+02 4.187e+02 5.229e+02 1.148e+03, threshold=8.373e+02, percent-clipped=2.0 2023-06-18 09:34:32,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-18 09:34:42,251 INFO [train.py:996] (0/4) Epoch 1, batch 29450, loss[loss=0.3561, simple_loss=0.3937, pruned_loss=0.1593, over 21800.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3657, pruned_loss=0.1324, over 4252947.89 frames. ], batch size: 247, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:35:00,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=176760.0, ans=0.125 2023-06-18 09:35:13,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=176820.0, ans=0.0 2023-06-18 09:36:18,608 INFO [train.py:996] (0/4) Epoch 1, batch 29500, loss[loss=0.3044, simple_loss=0.3483, pruned_loss=0.1302, over 21487.00 frames. ], tot_loss[loss=0.3248, simple_loss=0.3732, pruned_loss=0.1382, over 4252470.48 frames. ], batch size: 211, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:36:24,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=177000.0, ans=0.0 2023-06-18 09:37:00,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=177120.0, ans=0.0 2023-06-18 09:37:10,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=177120.0, ans=0.125 2023-06-18 09:37:29,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.303e+02 3.922e+02 4.990e+02 9.245e+02, threshold=7.844e+02, percent-clipped=1.0 2023-06-18 09:37:54,390 INFO [train.py:996] (0/4) Epoch 1, batch 29550, loss[loss=0.3587, simple_loss=0.3894, pruned_loss=0.164, over 21895.00 frames. ], tot_loss[loss=0.3264, simple_loss=0.3719, pruned_loss=0.1404, over 4264131.89 frames. ], batch size: 414, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:38:05,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=177300.0, ans=0.125 2023-06-18 09:38:09,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2023-06-18 09:38:51,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=177420.0, ans=0.125 2023-06-18 09:39:30,930 INFO [train.py:996] (0/4) Epoch 1, batch 29600, loss[loss=0.3333, simple_loss=0.3945, pruned_loss=0.136, over 21739.00 frames. ], tot_loss[loss=0.3329, simple_loss=0.3784, pruned_loss=0.1437, over 4273586.48 frames. ], batch size: 247, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:39:31,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=177600.0, ans=0.125 2023-06-18 09:39:34,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=177600.0, ans=0.125 2023-06-18 09:39:46,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=177660.0, ans=0.1 2023-06-18 09:40:06,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=177720.0, ans=0.125 2023-06-18 09:40:23,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=177720.0, ans=0.0 2023-06-18 09:40:41,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.582e+02 4.234e+02 5.701e+02 1.045e+03, threshold=8.469e+02, percent-clipped=5.0 2023-06-18 09:41:01,410 INFO [train.py:996] (0/4) Epoch 1, batch 29650, loss[loss=0.3304, simple_loss=0.3772, pruned_loss=0.1418, over 21801.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.377, pruned_loss=0.139, over 4275696.23 frames. ], batch size: 107, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:41:25,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=177960.0, ans=0.0 2023-06-18 09:41:53,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=178020.0, ans=0.125 2023-06-18 09:41:58,730 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.66 vs. limit=10.0 2023-06-18 09:42:37,430 INFO [train.py:996] (0/4) Epoch 1, batch 29700, loss[loss=0.3629, simple_loss=0.4565, pruned_loss=0.1347, over 21265.00 frames. ], tot_loss[loss=0.3301, simple_loss=0.3797, pruned_loss=0.1402, over 4278078.62 frames. ], batch size: 548, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:43:03,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=178260.0, ans=0.0 2023-06-18 09:43:25,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-18 09:43:34,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=178320.0, ans=0.125 2023-06-18 09:43:34,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=178320.0, ans=0.5 2023-06-18 09:43:51,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.547e+02 3.946e+02 4.847e+02 6.771e+02 1.201e+03, threshold=9.693e+02, percent-clipped=9.0 2023-06-18 09:43:59,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=178440.0, ans=0.0 2023-06-18 09:44:02,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=178440.0, ans=10.0 2023-06-18 09:44:05,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=178440.0, ans=0.1 2023-06-18 09:44:11,039 INFO [train.py:996] (0/4) Epoch 1, batch 29750, loss[loss=0.3387, simple_loss=0.414, pruned_loss=0.1317, over 21850.00 frames. ], tot_loss[loss=0.3303, simple_loss=0.3834, pruned_loss=0.1387, over 4277390.50 frames. ], batch size: 316, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:44:38,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=178560.0, ans=0.125 2023-06-18 09:45:31,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=178740.0, ans=15.0 2023-06-18 09:45:41,177 INFO [train.py:996] (0/4) Epoch 1, batch 29800, loss[loss=0.3423, simple_loss=0.3845, pruned_loss=0.1501, over 21797.00 frames. ], tot_loss[loss=0.3321, simple_loss=0.3851, pruned_loss=0.1396, over 4276742.03 frames. ], batch size: 282, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:45:41,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=178800.0, ans=0.125 2023-06-18 09:45:46,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=178800.0, ans=0.125 2023-06-18 09:46:57,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.287e+02 3.786e+02 4.602e+02 9.209e+02, threshold=7.572e+02, percent-clipped=0.0 2023-06-18 09:47:17,294 INFO [train.py:996] (0/4) Epoch 1, batch 29850, loss[loss=0.3152, simple_loss=0.3671, pruned_loss=0.1317, over 21702.00 frames. ], tot_loss[loss=0.3278, simple_loss=0.3815, pruned_loss=0.1371, over 4276405.49 frames. ], batch size: 389, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:47:22,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=179100.0, ans=0.07 2023-06-18 09:48:04,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=179220.0, ans=0.0 2023-06-18 09:48:21,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-06-18 09:48:52,474 INFO [train.py:996] (0/4) Epoch 1, batch 29900, loss[loss=0.3325, simple_loss=0.3643, pruned_loss=0.1504, over 21825.00 frames. ], tot_loss[loss=0.3282, simple_loss=0.3788, pruned_loss=0.1388, over 4279044.78 frames. ], batch size: 298, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:49:31,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=179460.0, ans=0.0 2023-06-18 09:49:32,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=179460.0, ans=0.2 2023-06-18 09:49:34,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=179460.0, ans=0.125 2023-06-18 09:50:09,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 3.414e+02 4.153e+02 5.110e+02 1.047e+03, threshold=8.306e+02, percent-clipped=3.0 2023-06-18 09:50:34,874 INFO [train.py:996] (0/4) Epoch 1, batch 29950, loss[loss=0.3824, simple_loss=0.4133, pruned_loss=0.1757, over 21482.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3841, pruned_loss=0.1455, over 4277778.91 frames. ], batch size: 194, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:50:42,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=179700.0, ans=0.0 2023-06-18 09:51:41,577 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-18 09:52:09,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-06-18 09:52:17,252 INFO [train.py:996] (0/4) Epoch 1, batch 30000, loss[loss=0.2853, simple_loss=0.3555, pruned_loss=0.1075, over 21612.00 frames. ], tot_loss[loss=0.3399, simple_loss=0.3873, pruned_loss=0.1462, over 4275472.91 frames. ], batch size: 230, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:52:17,253 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 09:52:35,094 INFO [train.py:1028] (0/4) Epoch 1, validation: loss=0.2819, simple_loss=0.3813, pruned_loss=0.09129, over 1796401.00 frames. 2023-06-18 09:52:35,095 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 09:52:42,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-18 09:52:59,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=180060.0, ans=0.125 2023-06-18 09:53:19,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=180120.0, ans=0.0 2023-06-18 09:53:28,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.55 vs. limit=22.5 2023-06-18 09:53:54,184 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 3.360e+02 4.099e+02 5.189e+02 8.987e+02, threshold=8.197e+02, percent-clipped=1.0 2023-06-18 09:54:19,761 INFO [train.py:996] (0/4) Epoch 1, batch 30050, loss[loss=0.3234, simple_loss=0.4141, pruned_loss=0.1164, over 21754.00 frames. ], tot_loss[loss=0.3352, simple_loss=0.3891, pruned_loss=0.1406, over 4274794.85 frames. ], batch size: 351, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 09:55:39,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=180540.0, ans=0.0 2023-06-18 09:55:56,183 INFO [train.py:996] (0/4) Epoch 1, batch 30100, loss[loss=0.2749, simple_loss=0.3254, pruned_loss=0.1122, over 21720.00 frames. ], tot_loss[loss=0.3339, simple_loss=0.3874, pruned_loss=0.1402, over 4261745.15 frames. ], batch size: 282, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 09:56:05,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=180600.0, ans=0.0 2023-06-18 09:56:50,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=180720.0, ans=0.1 2023-06-18 09:56:51,918 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:56:55,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=180780.0, ans=0.1 2023-06-18 09:57:11,279 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.434e+02 4.118e+02 5.111e+02 9.252e+02, threshold=8.235e+02, percent-clipped=1.0 2023-06-18 09:57:17,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=180840.0, ans=0.125 2023-06-18 09:57:31,619 INFO [train.py:996] (0/4) Epoch 1, batch 30150, loss[loss=0.3376, simple_loss=0.3844, pruned_loss=0.1454, over 21282.00 frames. ], tot_loss[loss=0.3348, simple_loss=0.384, pruned_loss=0.1428, over 4268944.90 frames. ], batch size: 159, lr: 2.21e-02, grad_scale: 64.0 2023-06-18 09:58:40,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=181080.0, ans=0.125 2023-06-18 09:58:54,456 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-18 09:59:00,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=181140.0, ans=0.2 2023-06-18 09:59:05,031 INFO [train.py:996] (0/4) Epoch 1, batch 30200, loss[loss=0.4289, simple_loss=0.4835, pruned_loss=0.1872, over 21422.00 frames. ], tot_loss[loss=0.3361, simple_loss=0.3872, pruned_loss=0.1425, over 4267897.44 frames. ], batch size: 507, lr: 2.21e-02, grad_scale: 64.0 2023-06-18 09:59:13,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=181200.0, ans=0.0 2023-06-18 09:59:44,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=181260.0, ans=0.1 2023-06-18 10:00:10,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=181380.0, ans=0.125 2023-06-18 10:00:19,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=181380.0, ans=0.0 2023-06-18 10:00:23,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 3.548e+02 4.753e+02 6.643e+02 1.324e+03, threshold=9.506e+02, percent-clipped=12.0 2023-06-18 10:00:51,982 INFO [train.py:996] (0/4) Epoch 1, batch 30250, loss[loss=0.327, simple_loss=0.4049, pruned_loss=0.1245, over 21423.00 frames. ], tot_loss[loss=0.3462, simple_loss=0.398, pruned_loss=0.1472, over 4268791.24 frames. ], batch size: 131, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 10:01:28,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=181560.0, ans=0.04949747468305833 2023-06-18 10:02:00,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-18 10:02:22,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=181740.0, ans=0.125 2023-06-18 10:02:32,635 INFO [train.py:996] (0/4) Epoch 1, batch 30300, loss[loss=0.3306, simple_loss=0.3622, pruned_loss=0.1495, over 21608.00 frames. ], tot_loss[loss=0.3431, simple_loss=0.3933, pruned_loss=0.1465, over 4268807.39 frames. ], batch size: 298, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 10:02:40,108 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-06-18 10:03:09,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=181920.0, ans=0.125 2023-06-18 10:03:38,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=181980.0, ans=0.125 2023-06-18 10:03:47,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.818e+02 4.498e+02 5.757e+02 1.296e+03, threshold=8.996e+02, percent-clipped=5.0 2023-06-18 10:04:04,303 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:04:11,456 INFO [train.py:996] (0/4) Epoch 1, batch 30350, loss[loss=0.3037, simple_loss=0.3381, pruned_loss=0.1346, over 21275.00 frames. ], tot_loss[loss=0.3453, simple_loss=0.3951, pruned_loss=0.1478, over 4271378.19 frames. ], batch size: 176, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:04:36,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=182160.0, ans=0.125 2023-06-18 10:04:38,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-18 10:04:46,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=182220.0, ans=0.09899494936611666 2023-06-18 10:05:17,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=182340.0, ans=0.0 2023-06-18 10:05:21,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=182340.0, ans=0.2 2023-06-18 10:05:25,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=182400.0, ans=0.09899494936611666 2023-06-18 10:05:26,627 INFO [train.py:996] (0/4) Epoch 1, batch 30400, loss[loss=0.3052, simple_loss=0.3333, pruned_loss=0.1386, over 20174.00 frames. ], tot_loss[loss=0.3346, simple_loss=0.3844, pruned_loss=0.1424, over 4264263.79 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:05:44,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=182460.0, ans=0.0 2023-06-18 10:05:55,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=182460.0, ans=0.125 2023-06-18 10:06:31,074 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 4.537e+02 5.674e+02 8.183e+02 2.727e+03, threshold=1.135e+03, percent-clipped=13.0 2023-06-18 10:06:47,840 INFO [train.py:996] (0/4) Epoch 1, batch 30450, loss[loss=0.3923, simple_loss=0.4721, pruned_loss=0.1562, over 19950.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3872, pruned_loss=0.1441, over 4204834.47 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:07:06,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=182760.0, ans=0.5 2023-06-18 10:07:12,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=182760.0, ans=0.2 2023-06-18 10:07:34,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=182880.0, ans=0.125 2023-06-18 10:07:35,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-18 10:07:39,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=182880.0, ans=0.0 2023-06-18 10:07:55,076 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-1.pt 2023-06-18 10:09:26,531 INFO [train.py:996] (0/4) Epoch 2, batch 0, loss[loss=0.4689, simple_loss=0.4437, pruned_loss=0.247, over 21351.00 frames. ], tot_loss[loss=0.4689, simple_loss=0.4437, pruned_loss=0.247, over 21351.00 frames. ], batch size: 473, lr: 2.01e-02, grad_scale: 32.0 2023-06-18 10:09:26,533 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 10:09:33,383 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.0044, 4.0909, 3.7673, 3.8230], device='cuda:0') 2023-06-18 10:09:35,003 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.9767, 1.7876, 2.5583, 2.5616], device='cuda:0') 2023-06-18 10:09:43,672 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.3124, simple_loss=0.4068, pruned_loss=0.109, over 1796401.00 frames. 2023-06-18 10:09:43,673 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 10:09:47,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.52 vs. limit=15.0 2023-06-18 10:09:49,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-18 10:10:04,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=183030.0, ans=0.1 2023-06-18 10:10:19,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=183090.0, ans=0.125 2023-06-18 10:10:35,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=183150.0, ans=10.0 2023-06-18 10:10:45,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=8.0 2023-06-18 10:10:52,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183150.0, ans=0.1 2023-06-18 10:10:52,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=183150.0, ans=0.1 2023-06-18 10:10:56,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=183210.0, ans=0.0 2023-06-18 10:11:04,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 4.740e+02 6.519e+02 1.031e+03 2.172e+03, threshold=1.304e+03, percent-clipped=18.0 2023-06-18 10:11:10,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.49 vs. limit=6.0 2023-06-18 10:11:10,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=183210.0, ans=0.1 2023-06-18 10:11:13,456 INFO [train.py:996] (0/4) Epoch 2, batch 50, loss[loss=0.3377, simple_loss=0.3893, pruned_loss=0.143, over 21695.00 frames. ], tot_loss[loss=0.3471, simple_loss=0.3938, pruned_loss=0.1502, over 962889.02 frames. ], batch size: 332, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:11:40,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=183330.0, ans=0.0 2023-06-18 10:11:40,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=183330.0, ans=0.2 2023-06-18 10:12:01,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=183390.0, ans=0.0 2023-06-18 10:12:05,927 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:12:12,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-18 10:12:32,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=183450.0, ans=0.2 2023-06-18 10:12:44,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-18 10:12:49,812 INFO [train.py:996] (0/4) Epoch 2, batch 100, loss[loss=0.3434, simple_loss=0.4168, pruned_loss=0.135, over 21840.00 frames. ], tot_loss[loss=0.3492, simple_loss=0.4046, pruned_loss=0.1469, over 1688939.06 frames. ], batch size: 316, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:12:52,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-18 10:13:01,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=6.0 2023-06-18 10:14:07,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183750.0, ans=0.1 2023-06-18 10:14:16,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 3.494e+02 4.383e+02 5.480e+02 8.773e+02, threshold=8.766e+02, percent-clipped=0.0 2023-06-18 10:14:22,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=183810.0, ans=0.125 2023-06-18 10:14:29,956 INFO [train.py:996] (0/4) Epoch 2, batch 150, loss[loss=0.3232, simple_loss=0.3992, pruned_loss=0.1236, over 21745.00 frames. ], tot_loss[loss=0.3554, simple_loss=0.4114, pruned_loss=0.1497, over 2263587.57 frames. ], batch size: 351, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:15:38,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=184050.0, ans=0.0 2023-06-18 10:15:40,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=184050.0, ans=0.125 2023-06-18 10:15:59,065 INFO [train.py:996] (0/4) Epoch 2, batch 200, loss[loss=0.3153, simple_loss=0.3676, pruned_loss=0.1315, over 21162.00 frames. ], tot_loss[loss=0.3465, simple_loss=0.4039, pruned_loss=0.1446, over 2711956.84 frames. ], batch size: 143, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:16:06,623 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=22.5 2023-06-18 10:16:22,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=184230.0, ans=0.0 2023-06-18 10:16:45,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.36 vs. limit=10.0 2023-06-18 10:16:46,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=184290.0, ans=0.0 2023-06-18 10:17:20,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=184410.0, ans=0.0 2023-06-18 10:17:21,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=184410.0, ans=0.125 2023-06-18 10:17:24,371 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.721e+02 4.278e+02 5.715e+02 9.625e+02, threshold=8.556e+02, percent-clipped=2.0 2023-06-18 10:17:33,214 INFO [train.py:996] (0/4) Epoch 2, batch 250, loss[loss=0.331, simple_loss=0.3723, pruned_loss=0.1448, over 21258.00 frames. ], tot_loss[loss=0.3419, simple_loss=0.398, pruned_loss=0.1429, over 3054691.07 frames. ], batch size: 159, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:17:39,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=184470.0, ans=0.125 2023-06-18 10:17:41,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=184470.0, ans=0.0 2023-06-18 10:18:09,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=184530.0, ans=0.125 2023-06-18 10:19:15,578 INFO [train.py:996] (0/4) Epoch 2, batch 300, loss[loss=0.3141, simple_loss=0.356, pruned_loss=0.1361, over 21869.00 frames. ], tot_loss[loss=0.3371, simple_loss=0.392, pruned_loss=0.1411, over 3332174.56 frames. ], batch size: 373, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:20:03,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=184890.0, ans=0.125 2023-06-18 10:20:29,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=184950.0, ans=0.125 2023-06-18 10:20:39,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.502e+02 4.825e+02 6.381e+02 1.072e+03, threshold=9.650e+02, percent-clipped=6.0 2023-06-18 10:20:41,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=185010.0, ans=0.125 2023-06-18 10:20:48,552 INFO [train.py:996] (0/4) Epoch 2, batch 350, loss[loss=0.3013, simple_loss=0.3568, pruned_loss=0.1229, over 21207.00 frames. ], tot_loss[loss=0.3314, simple_loss=0.384, pruned_loss=0.1395, over 3532574.88 frames. ], batch size: 159, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:21:05,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185070.0, ans=0.1 2023-06-18 10:21:28,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185190.0, ans=0.1 2023-06-18 10:21:34,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=185190.0, ans=0.2 2023-06-18 10:21:39,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=185190.0, ans=0.125 2023-06-18 10:22:18,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=185310.0, ans=0.125 2023-06-18 10:22:26,781 INFO [train.py:996] (0/4) Epoch 2, batch 400, loss[loss=0.2588, simple_loss=0.3044, pruned_loss=0.1066, over 21625.00 frames. ], tot_loss[loss=0.3256, simple_loss=0.3767, pruned_loss=0.1372, over 3694133.34 frames. ], batch size: 247, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:22:41,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=185370.0, ans=15.0 2023-06-18 10:22:41,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=6.0 2023-06-18 10:22:45,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=185430.0, ans=0.125 2023-06-18 10:22:54,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=185430.0, ans=0.0 2023-06-18 10:23:00,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=185430.0, ans=0.125 2023-06-18 10:23:41,953 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:23:49,220 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.316e+02 4.407e+02 6.179e+02 1.311e+03, threshold=8.814e+02, percent-clipped=2.0 2023-06-18 10:23:58,124 INFO [train.py:996] (0/4) Epoch 2, batch 450, loss[loss=0.271, simple_loss=0.3227, pruned_loss=0.1096, over 21267.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.3693, pruned_loss=0.1332, over 3827268.99 frames. ], batch size: 176, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:24:27,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-18 10:24:42,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.94 vs. limit=15.0 2023-06-18 10:25:28,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-18 10:25:33,381 INFO [train.py:996] (0/4) Epoch 2, batch 500, loss[loss=0.2662, simple_loss=0.3168, pruned_loss=0.1078, over 21539.00 frames. ], tot_loss[loss=0.3166, simple_loss=0.3719, pruned_loss=0.1307, over 3929509.71 frames. ], batch size: 263, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:26:06,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=186030.0, ans=0.0 2023-06-18 10:26:49,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=186210.0, ans=0.125 2023-06-18 10:26:51,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=186210.0, ans=0.2 2023-06-18 10:26:55,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.837e+02 4.927e+02 6.704e+02 1.422e+03, threshold=9.853e+02, percent-clipped=11.0 2023-06-18 10:27:04,406 INFO [train.py:996] (0/4) Epoch 2, batch 550, loss[loss=0.2569, simple_loss=0.3226, pruned_loss=0.09555, over 21602.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3731, pruned_loss=0.1299, over 4007972.38 frames. ], batch size: 263, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:27:04,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=186270.0, ans=0.0 2023-06-18 10:27:50,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=186390.0, ans=0.0 2023-06-18 10:27:55,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=186390.0, ans=0.0 2023-06-18 10:28:46,194 INFO [train.py:996] (0/4) Epoch 2, batch 600, loss[loss=0.3633, simple_loss=0.4114, pruned_loss=0.1576, over 21906.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3733, pruned_loss=0.1297, over 4064789.56 frames. ], batch size: 316, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:28:46,577 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:29:20,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=12.0 2023-06-18 10:29:25,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=186690.0, ans=0.125 2023-06-18 10:30:07,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.420e+02 4.275e+02 5.622e+02 1.549e+03, threshold=8.550e+02, percent-clipped=4.0 2023-06-18 10:30:21,295 INFO [train.py:996] (0/4) Epoch 2, batch 650, loss[loss=0.3384, simple_loss=0.3703, pruned_loss=0.1533, over 14842.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.3745, pruned_loss=0.1306, over 4104542.14 frames. ], batch size: 60, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:30:29,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=186870.0, ans=0.1 2023-06-18 10:30:30,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=186870.0, ans=0.125 2023-06-18 10:30:38,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=186930.0, ans=0.0 2023-06-18 10:30:57,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-18 10:31:11,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.95 vs. limit=15.0 2023-06-18 10:31:21,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=187050.0, ans=0.125 2023-06-18 10:31:46,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=187110.0, ans=0.125 2023-06-18 10:31:56,865 INFO [train.py:996] (0/4) Epoch 2, batch 700, loss[loss=0.3395, simple_loss=0.3868, pruned_loss=0.1461, over 15798.00 frames. ], tot_loss[loss=0.3192, simple_loss=0.3755, pruned_loss=0.1314, over 4140571.42 frames. ], batch size: 60, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:32:25,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=6.0 2023-06-18 10:33:14,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.98 vs. limit=10.0 2023-06-18 10:33:18,462 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.924e+02 4.620e+02 5.995e+02 1.020e+03, threshold=9.239e+02, percent-clipped=3.0 2023-06-18 10:33:25,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=187410.0, ans=0.125 2023-06-18 10:33:32,509 INFO [train.py:996] (0/4) Epoch 2, batch 750, loss[loss=0.317, simple_loss=0.3586, pruned_loss=0.1377, over 21926.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3748, pruned_loss=0.1336, over 4182621.15 frames. ], batch size: 316, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:33:50,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=187530.0, ans=0.125 2023-06-18 10:34:11,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=187590.0, ans=0.0 2023-06-18 10:34:32,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-18 10:35:06,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=187770.0, ans=0.125 2023-06-18 10:35:07,275 INFO [train.py:996] (0/4) Epoch 2, batch 800, loss[loss=0.3101, simple_loss=0.3521, pruned_loss=0.1341, over 21450.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.3739, pruned_loss=0.1344, over 4207460.01 frames. ], batch size: 211, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:35:09,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=187770.0, ans=0.125 2023-06-18 10:35:10,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=187770.0, ans=0.0 2023-06-18 10:35:45,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.21 vs. limit=6.0 2023-06-18 10:36:12,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=187950.0, ans=0.02 2023-06-18 10:36:18,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=187950.0, ans=0.125 2023-06-18 10:36:21,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=188010.0, ans=0.5 2023-06-18 10:36:24,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=188010.0, ans=0.125 2023-06-18 10:36:28,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.568e+02 4.374e+02 5.699e+02 1.207e+03, threshold=8.749e+02, percent-clipped=3.0 2023-06-18 10:36:39,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=188010.0, ans=0.0 2023-06-18 10:36:41,765 INFO [train.py:996] (0/4) Epoch 2, batch 850, loss[loss=0.3321, simple_loss=0.3734, pruned_loss=0.1454, over 21929.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3714, pruned_loss=0.1341, over 4225132.38 frames. ], batch size: 107, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:36:45,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=188070.0, ans=0.125 2023-06-18 10:36:58,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=188130.0, ans=0.125 2023-06-18 10:37:50,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=188250.0, ans=0.125 2023-06-18 10:38:17,638 INFO [train.py:996] (0/4) Epoch 2, batch 900, loss[loss=0.3664, simple_loss=0.3915, pruned_loss=0.1706, over 21592.00 frames. ], tot_loss[loss=0.3192, simple_loss=0.37, pruned_loss=0.1342, over 4240289.87 frames. ], batch size: 471, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:39:14,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=188550.0, ans=0.125 2023-06-18 10:39:45,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.152e+02 3.796e+02 5.190e+02 9.493e+02, threshold=7.592e+02, percent-clipped=1.0 2023-06-18 10:39:55,108 INFO [train.py:996] (0/4) Epoch 2, batch 950, loss[loss=0.269, simple_loss=0.3436, pruned_loss=0.0972, over 21864.00 frames. ], tot_loss[loss=0.314, simple_loss=0.365, pruned_loss=0.1315, over 4239167.21 frames. ], batch size: 351, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:40:50,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=188790.0, ans=0.0 2023-06-18 10:41:04,686 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-18 10:41:18,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-18 10:41:30,581 INFO [train.py:996] (0/4) Epoch 2, batch 1000, loss[loss=0.2807, simple_loss=0.3209, pruned_loss=0.1202, over 21585.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3645, pruned_loss=0.131, over 4251648.35 frames. ], batch size: 247, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:42:26,884 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:43:00,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.236e+02 4.022e+02 4.696e+02 7.726e+02, threshold=8.043e+02, percent-clipped=1.0 2023-06-18 10:43:06,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-18 10:43:09,841 INFO [train.py:996] (0/4) Epoch 2, batch 1050, loss[loss=0.3352, simple_loss=0.3799, pruned_loss=0.1452, over 21307.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3671, pruned_loss=0.1321, over 4262925.13 frames. ], batch size: 159, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:43:22,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=189270.0, ans=0.5 2023-06-18 10:43:25,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=189270.0, ans=0.09899494936611666 2023-06-18 10:43:41,830 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-06-18 10:44:05,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=189390.0, ans=0.0 2023-06-18 10:44:36,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=189510.0, ans=0.0 2023-06-18 10:44:45,551 INFO [train.py:996] (0/4) Epoch 2, batch 1100, loss[loss=0.335, simple_loss=0.3929, pruned_loss=0.1386, over 21614.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3676, pruned_loss=0.1318, over 4276520.47 frames. ], batch size: 441, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:44:57,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.61 vs. limit=10.0 2023-06-18 10:45:45,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=189690.0, ans=0.125 2023-06-18 10:46:13,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 4.116e+02 5.178e+02 8.115e+02 1.294e+03, threshold=1.036e+03, percent-clipped=24.0 2023-06-18 10:46:24,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=189810.0, ans=0.2 2023-06-18 10:46:27,150 INFO [train.py:996] (0/4) Epoch 2, batch 1150, loss[loss=0.3592, simple_loss=0.4155, pruned_loss=0.1514, over 21808.00 frames. ], tot_loss[loss=0.3155, simple_loss=0.3674, pruned_loss=0.1318, over 4274310.48 frames. ], batch size: 371, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:48:04,111 INFO [train.py:996] (0/4) Epoch 2, batch 1200, loss[loss=0.3253, simple_loss=0.394, pruned_loss=0.1283, over 21632.00 frames. ], tot_loss[loss=0.3172, simple_loss=0.3691, pruned_loss=0.1327, over 4280023.56 frames. ], batch size: 389, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:48:35,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=15.0 2023-06-18 10:49:05,114 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:49:32,218 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.616e+02 4.499e+02 6.126e+02 1.054e+03, threshold=8.999e+02, percent-clipped=1.0 2023-06-18 10:49:41,285 INFO [train.py:996] (0/4) Epoch 2, batch 1250, loss[loss=0.3415, simple_loss=0.3995, pruned_loss=0.1417, over 21703.00 frames. ], tot_loss[loss=0.3226, simple_loss=0.3743, pruned_loss=0.1354, over 4291839.51 frames. ], batch size: 389, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:50:04,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=190470.0, ans=0.125 2023-06-18 10:50:26,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=190590.0, ans=0.0 2023-06-18 10:50:30,018 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-18 10:50:33,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-18 10:51:23,601 INFO [train.py:996] (0/4) Epoch 2, batch 1300, loss[loss=0.2789, simple_loss=0.3373, pruned_loss=0.1102, over 21439.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3762, pruned_loss=0.1357, over 4293336.23 frames. ], batch size: 211, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:51:31,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=190770.0, ans=0.0 2023-06-18 10:52:19,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=190950.0, ans=10.0 2023-06-18 10:52:44,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=191010.0, ans=0.125 2023-06-18 10:52:45,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 3.558e+02 4.499e+02 5.832e+02 1.027e+03, threshold=8.998e+02, percent-clipped=2.0 2023-06-18 10:52:47,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=191010.0, ans=0.125 2023-06-18 10:53:00,057 INFO [train.py:996] (0/4) Epoch 2, batch 1350, loss[loss=0.3236, simple_loss=0.3624, pruned_loss=0.1424, over 21851.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3776, pruned_loss=0.1365, over 4295449.56 frames. ], batch size: 282, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:53:08,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=191070.0, ans=0.125 2023-06-18 10:53:34,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=191130.0, ans=0.0 2023-06-18 10:53:38,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=12.0 2023-06-18 10:53:53,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=191190.0, ans=0.0 2023-06-18 10:54:32,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=191310.0, ans=0.5 2023-06-18 10:54:40,960 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=21.33 vs. limit=15.0 2023-06-18 10:54:41,384 INFO [train.py:996] (0/4) Epoch 2, batch 1400, loss[loss=0.3047, simple_loss=0.3528, pruned_loss=0.1283, over 15233.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3746, pruned_loss=0.1366, over 4293911.88 frames. ], batch size: 60, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:54:51,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=191370.0, ans=0.2 2023-06-18 10:55:05,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=191430.0, ans=0.125 2023-06-18 10:56:04,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.616e+02 4.167e+02 4.895e+02 9.301e+02, threshold=8.333e+02, percent-clipped=3.0 2023-06-18 10:56:18,039 INFO [train.py:996] (0/4) Epoch 2, batch 1450, loss[loss=0.3001, simple_loss=0.351, pruned_loss=0.1246, over 21838.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3744, pruned_loss=0.1375, over 4292533.19 frames. ], batch size: 102, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:56:49,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.18 vs. limit=15.0 2023-06-18 10:57:33,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=191910.0, ans=0.0 2023-06-18 10:57:40,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=191910.0, ans=0.0 2023-06-18 10:57:53,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-18 10:57:54,116 INFO [train.py:996] (0/4) Epoch 2, batch 1500, loss[loss=0.3826, simple_loss=0.4178, pruned_loss=0.1737, over 21214.00 frames. ], tot_loss[loss=0.3269, simple_loss=0.3762, pruned_loss=0.1388, over 4294622.49 frames. ], batch size: 143, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:57:56,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=191970.0, ans=0.025 2023-06-18 10:58:00,627 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-32000.pt 2023-06-18 10:58:08,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-06-18 10:58:42,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-18 10:59:22,600 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.381e+02 3.361e+02 4.007e+02 4.888e+02 8.078e+02, threshold=8.013e+02, percent-clipped=0.0 2023-06-18 10:59:23,126 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:59:25,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=192210.0, ans=0.125 2023-06-18 10:59:32,405 INFO [train.py:996] (0/4) Epoch 2, batch 1550, loss[loss=0.2728, simple_loss=0.3111, pruned_loss=0.1173, over 21629.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3734, pruned_loss=0.1374, over 4294116.19 frames. ], batch size: 247, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:00:16,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-18 11:01:02,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-18 11:01:04,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=192510.0, ans=0.125 2023-06-18 11:01:16,373 INFO [train.py:996] (0/4) Epoch 2, batch 1600, loss[loss=0.3367, simple_loss=0.3854, pruned_loss=0.144, over 21721.00 frames. ], tot_loss[loss=0.3225, simple_loss=0.3716, pruned_loss=0.1367, over 4278376.05 frames. ], batch size: 351, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:01:58,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=192690.0, ans=0.0 2023-06-18 11:02:13,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=192750.0, ans=0.2 2023-06-18 11:02:32,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=192810.0, ans=0.0 2023-06-18 11:02:37,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=192810.0, ans=0.0 2023-06-18 11:02:44,874 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.757e+02 4.591e+02 6.473e+02 1.240e+03, threshold=9.183e+02, percent-clipped=13.0 2023-06-18 11:02:54,039 INFO [train.py:996] (0/4) Epoch 2, batch 1650, loss[loss=0.2649, simple_loss=0.3424, pruned_loss=0.09374, over 21565.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3715, pruned_loss=0.1352, over 4281644.69 frames. ], batch size: 230, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:03:21,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=192930.0, ans=0.125 2023-06-18 11:04:31,335 INFO [train.py:996] (0/4) Epoch 2, batch 1700, loss[loss=0.326, simple_loss=0.3809, pruned_loss=0.1356, over 21309.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3747, pruned_loss=0.1365, over 4281702.60 frames. ], batch size: 159, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:04:52,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=193230.0, ans=0.0 2023-06-18 11:05:56,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.866e+02 3.867e+02 4.593e+02 5.670e+02 8.844e+02, threshold=9.185e+02, percent-clipped=0.0 2023-06-18 11:06:05,801 INFO [train.py:996] (0/4) Epoch 2, batch 1750, loss[loss=0.224, simple_loss=0.3108, pruned_loss=0.06858, over 21716.00 frames. ], tot_loss[loss=0.3221, simple_loss=0.3757, pruned_loss=0.1342, over 4286521.53 frames. ], batch size: 298, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:06:18,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=193470.0, ans=0.125 2023-06-18 11:07:38,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193770.0, ans=0.1 2023-06-18 11:07:38,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.53 vs. limit=22.5 2023-06-18 11:07:39,025 INFO [train.py:996] (0/4) Epoch 2, batch 1800, loss[loss=0.2714, simple_loss=0.3243, pruned_loss=0.1092, over 21182.00 frames. ], tot_loss[loss=0.3166, simple_loss=0.3723, pruned_loss=0.1304, over 4289207.20 frames. ], batch size: 159, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:07:47,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=193770.0, ans=0.125 2023-06-18 11:08:04,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=193830.0, ans=0.0 2023-06-18 11:08:35,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=193890.0, ans=0.125 2023-06-18 11:08:47,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=193950.0, ans=0.125 2023-06-18 11:09:00,697 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=22.5 2023-06-18 11:09:01,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=194010.0, ans=0.125 2023-06-18 11:09:07,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.185e+02 3.782e+02 4.585e+02 7.556e+02, threshold=7.564e+02, percent-clipped=0.0 2023-06-18 11:09:13,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194010.0, ans=0.1 2023-06-18 11:09:17,327 INFO [train.py:996] (0/4) Epoch 2, batch 1850, loss[loss=0.3239, simple_loss=0.375, pruned_loss=0.1364, over 21658.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.37, pruned_loss=0.1256, over 4291878.64 frames. ], batch size: 263, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:10:14,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=194190.0, ans=0.04949747468305833 2023-06-18 11:10:38,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=194310.0, ans=0.0 2023-06-18 11:10:48,421 INFO [train.py:996] (0/4) Epoch 2, batch 1900, loss[loss=0.341, simple_loss=0.4166, pruned_loss=0.1327, over 21471.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.372, pruned_loss=0.1274, over 4294567.15 frames. ], batch size: 507, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:11:06,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=194370.0, ans=0.2 2023-06-18 11:11:19,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=194430.0, ans=15.0 2023-06-18 11:11:57,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=194550.0, ans=0.125 2023-06-18 11:12:02,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=194550.0, ans=0.2 2023-06-18 11:12:11,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194610.0, ans=0.1 2023-06-18 11:12:15,718 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.697e+02 4.739e+02 6.641e+02 1.232e+03, threshold=9.479e+02, percent-clipped=18.0 2023-06-18 11:12:24,854 INFO [train.py:996] (0/4) Epoch 2, batch 1950, loss[loss=0.2779, simple_loss=0.3288, pruned_loss=0.1135, over 15430.00 frames. ], tot_loss[loss=0.315, simple_loss=0.3709, pruned_loss=0.1296, over 4276442.60 frames. ], batch size: 60, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:12:40,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-18 11:12:47,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=194670.0, ans=0.125 2023-06-18 11:12:49,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.62 vs. limit=22.5 2023-06-18 11:12:51,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.78 vs. limit=15.0 2023-06-18 11:12:52,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=194730.0, ans=0.0 2023-06-18 11:13:03,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=194730.0, ans=0.05 2023-06-18 11:13:12,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=194730.0, ans=0.125 2023-06-18 11:13:12,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=194730.0, ans=0.125 2023-06-18 11:14:13,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=194970.0, ans=0.125 2023-06-18 11:14:14,862 INFO [train.py:996] (0/4) Epoch 2, batch 2000, loss[loss=0.2906, simple_loss=0.334, pruned_loss=0.1236, over 21610.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3681, pruned_loss=0.1295, over 4273137.70 frames. ], batch size: 247, lr: 1.95e-02, grad_scale: 64.0 2023-06-18 11:14:15,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194970.0, ans=0.1 2023-06-18 11:14:31,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.17 vs. limit=6.0 2023-06-18 11:14:51,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=195030.0, ans=0.0 2023-06-18 11:14:52,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=195030.0, ans=0.125 2023-06-18 11:15:01,349 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:15:08,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=195150.0, ans=0.125 2023-06-18 11:15:19,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=195150.0, ans=10.0 2023-06-18 11:15:31,711 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.372e+02 4.347e+02 5.379e+02 1.010e+03, threshold=8.694e+02, percent-clipped=3.0 2023-06-18 11:15:38,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=195210.0, ans=0.2 2023-06-18 11:15:45,819 INFO [train.py:996] (0/4) Epoch 2, batch 2050, loss[loss=0.3206, simple_loss=0.3858, pruned_loss=0.1278, over 21776.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3662, pruned_loss=0.1271, over 4274213.43 frames. ], batch size: 298, lr: 1.95e-02, grad_scale: 64.0 2023-06-18 11:16:23,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=195330.0, ans=0.125 2023-06-18 11:16:25,261 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:17:02,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=195510.0, ans=0.2 2023-06-18 11:17:14,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.32 vs. limit=6.0 2023-06-18 11:17:21,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=195570.0, ans=0.125 2023-06-18 11:17:22,770 INFO [train.py:996] (0/4) Epoch 2, batch 2100, loss[loss=0.3561, simple_loss=0.4101, pruned_loss=0.151, over 21579.00 frames. ], tot_loss[loss=0.315, simple_loss=0.3698, pruned_loss=0.1301, over 4283143.00 frames. ], batch size: 230, lr: 1.94e-02, grad_scale: 64.0 2023-06-18 11:18:05,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=195690.0, ans=0.5 2023-06-18 11:18:11,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=195690.0, ans=0.0 2023-06-18 11:18:35,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=195810.0, ans=0.0 2023-06-18 11:18:35,289 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:18:36,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=195810.0, ans=0.2 2023-06-18 11:18:46,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=195810.0, ans=0.0 2023-06-18 11:18:52,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.854e+02 4.642e+02 6.317e+02 1.235e+03, threshold=9.284e+02, percent-clipped=5.0 2023-06-18 11:18:59,862 INFO [train.py:996] (0/4) Epoch 2, batch 2150, loss[loss=0.3265, simple_loss=0.3591, pruned_loss=0.1469, over 21768.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.3736, pruned_loss=0.1328, over 4271481.64 frames. ], batch size: 351, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:19:09,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=195870.0, ans=0.125 2023-06-18 11:20:01,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2023-06-18 11:20:14,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=196110.0, ans=0.035 2023-06-18 11:20:32,209 INFO [train.py:996] (0/4) Epoch 2, batch 2200, loss[loss=0.3556, simple_loss=0.3778, pruned_loss=0.1667, over 21434.00 frames. ], tot_loss[loss=0.3208, simple_loss=0.3745, pruned_loss=0.1336, over 4279238.05 frames. ], batch size: 510, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:20:38,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=196170.0, ans=0.2 2023-06-18 11:20:42,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=196170.0, ans=0.0 2023-06-18 11:21:21,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=12.0 2023-06-18 11:21:40,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=196410.0, ans=0.0 2023-06-18 11:21:51,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.406e+02 4.174e+02 5.326e+02 1.037e+03, threshold=8.349e+02, percent-clipped=3.0 2023-06-18 11:21:59,032 INFO [train.py:996] (0/4) Epoch 2, batch 2250, loss[loss=0.2896, simple_loss=0.3355, pruned_loss=0.1218, over 21680.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3707, pruned_loss=0.1307, over 4279800.71 frames. ], batch size: 263, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:22:05,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=196470.0, ans=0.1 2023-06-18 11:22:24,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-06-18 11:22:33,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=196530.0, ans=0.1 2023-06-18 11:22:38,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=196590.0, ans=0.125 2023-06-18 11:23:28,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-18 11:23:28,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=196710.0, ans=0.1 2023-06-18 11:23:34,870 INFO [train.py:996] (0/4) Epoch 2, batch 2300, loss[loss=0.2576, simple_loss=0.3068, pruned_loss=0.1042, over 21429.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3656, pruned_loss=0.1297, over 4282701.49 frames. ], batch size: 211, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:23:49,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=196770.0, ans=0.1 2023-06-18 11:24:00,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-18 11:24:18,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=196890.0, ans=0.125 2023-06-18 11:24:58,610 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 3.491e+02 4.298e+02 5.253e+02 1.181e+03, threshold=8.597e+02, percent-clipped=4.0 2023-06-18 11:25:00,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=197010.0, ans=0.125 2023-06-18 11:25:06,160 INFO [train.py:996] (0/4) Epoch 2, batch 2350, loss[loss=0.3466, simple_loss=0.3754, pruned_loss=0.159, over 21562.00 frames. ], tot_loss[loss=0.3105, simple_loss=0.3617, pruned_loss=0.1297, over 4267855.14 frames. ], batch size: 548, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:25:38,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=197130.0, ans=0.0 2023-06-18 11:26:16,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=197250.0, ans=0.0 2023-06-18 11:26:40,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=197310.0, ans=6.0 2023-06-18 11:26:43,929 INFO [train.py:996] (0/4) Epoch 2, batch 2400, loss[loss=0.3341, simple_loss=0.3797, pruned_loss=0.1443, over 21375.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.3671, pruned_loss=0.1344, over 4274509.85 frames. ], batch size: 143, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:26:56,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=197370.0, ans=0.1 2023-06-18 11:27:00,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=197370.0, ans=0.125 2023-06-18 11:27:04,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=197370.0, ans=0.0 2023-06-18 11:27:33,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=197490.0, ans=0.0 2023-06-18 11:28:08,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=197610.0, ans=0.0 2023-06-18 11:28:08,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=197610.0, ans=0.125 2023-06-18 11:28:09,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-18 11:28:18,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.754e+02 3.750e+02 4.331e+02 6.076e+02 1.202e+03, threshold=8.663e+02, percent-clipped=8.0 2023-06-18 11:28:31,362 INFO [train.py:996] (0/4) Epoch 2, batch 2450, loss[loss=0.317, simple_loss=0.3678, pruned_loss=0.1331, over 21536.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3743, pruned_loss=0.1376, over 4278486.90 frames. ], batch size: 414, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:28:33,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=197670.0, ans=0.0 2023-06-18 11:28:56,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=197730.0, ans=0.125 2023-06-18 11:29:09,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-18 11:29:45,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197910.0, ans=0.1 2023-06-18 11:30:04,102 INFO [train.py:996] (0/4) Epoch 2, batch 2500, loss[loss=0.283, simple_loss=0.3707, pruned_loss=0.09763, over 21410.00 frames. ], tot_loss[loss=0.3224, simple_loss=0.371, pruned_loss=0.1369, over 4282547.95 frames. ], batch size: 211, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:30:11,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=197970.0, ans=0.125 2023-06-18 11:30:15,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-18 11:30:21,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=197970.0, ans=0.125 2023-06-18 11:30:25,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.33 vs. limit=10.0 2023-06-18 11:30:58,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=198090.0, ans=0.0 2023-06-18 11:31:14,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198150.0, ans=0.1 2023-06-18 11:31:23,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=198210.0, ans=0.0 2023-06-18 11:31:34,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.317e+02 4.381e+02 5.204e+02 7.754e+02, threshold=8.763e+02, percent-clipped=1.0 2023-06-18 11:31:36,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=198210.0, ans=0.1 2023-06-18 11:31:46,956 INFO [train.py:996] (0/4) Epoch 2, batch 2550, loss[loss=0.2722, simple_loss=0.3663, pruned_loss=0.08902, over 21559.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.3684, pruned_loss=0.1346, over 4268823.40 frames. ], batch size: 230, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:31:57,577 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-18 11:32:15,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=198330.0, ans=0.125 2023-06-18 11:32:44,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=198450.0, ans=0.125 2023-06-18 11:32:44,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=198450.0, ans=0.125 2023-06-18 11:33:18,182 INFO [train.py:996] (0/4) Epoch 2, batch 2600, loss[loss=0.3648, simple_loss=0.3958, pruned_loss=0.1669, over 21337.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.3707, pruned_loss=0.1364, over 4264235.84 frames. ], batch size: 471, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:33:37,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198630.0, ans=0.1 2023-06-18 11:33:46,742 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:34:02,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=198690.0, ans=0.125 2023-06-18 11:34:03,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=198690.0, ans=0.05 2023-06-18 11:34:30,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-18 11:34:47,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.436e+02 4.244e+02 5.240e+02 1.197e+03, threshold=8.488e+02, percent-clipped=2.0 2023-06-18 11:34:55,298 INFO [train.py:996] (0/4) Epoch 2, batch 2650, loss[loss=0.2942, simple_loss=0.3369, pruned_loss=0.1257, over 21379.00 frames. ], tot_loss[loss=0.3248, simple_loss=0.3729, pruned_loss=0.1384, over 4271568.04 frames. ], batch size: 176, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:35:57,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=199050.0, ans=0.125 2023-06-18 11:36:15,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=199050.0, ans=0.125 2023-06-18 11:36:17,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=199110.0, ans=0.125 2023-06-18 11:36:21,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=12.0 2023-06-18 11:36:39,095 INFO [train.py:996] (0/4) Epoch 2, batch 2700, loss[loss=0.2734, simple_loss=0.3204, pruned_loss=0.1132, over 21623.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3727, pruned_loss=0.137, over 4270504.83 frames. ], batch size: 230, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:36:44,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=199170.0, ans=0.125 2023-06-18 11:36:57,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-18 11:36:59,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=199230.0, ans=0.0 2023-06-18 11:37:12,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=199290.0, ans=0.125 2023-06-18 11:38:00,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-06-18 11:38:02,899 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 4.256e+02 5.020e+02 6.245e+02 1.096e+03, threshold=1.004e+03, percent-clipped=9.0 2023-06-18 11:38:03,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=199410.0, ans=0.2 2023-06-18 11:38:14,797 INFO [train.py:996] (0/4) Epoch 2, batch 2750, loss[loss=0.3597, simple_loss=0.455, pruned_loss=0.1321, over 19710.00 frames. ], tot_loss[loss=0.32, simple_loss=0.3689, pruned_loss=0.1356, over 4271869.20 frames. ], batch size: 702, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:38:24,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-18 11:38:39,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=199530.0, ans=0.1 2023-06-18 11:39:20,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-18 11:39:55,072 INFO [train.py:996] (0/4) Epoch 2, batch 2800, loss[loss=0.3654, simple_loss=0.4214, pruned_loss=0.1547, over 21797.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3746, pruned_loss=0.137, over 4275388.88 frames. ], batch size: 332, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:40:10,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=199830.0, ans=0.125 2023-06-18 11:40:38,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-18 11:41:03,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=199950.0, ans=0.125 2023-06-18 11:41:04,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=199950.0, ans=0.04949747468305833 2023-06-18 11:41:25,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 3.620e+02 4.325e+02 5.387e+02 9.118e+02, threshold=8.651e+02, percent-clipped=0.0 2023-06-18 11:41:33,775 INFO [train.py:996] (0/4) Epoch 2, batch 2850, loss[loss=0.2712, simple_loss=0.3313, pruned_loss=0.1056, over 21710.00 frames. ], tot_loss[loss=0.3237, simple_loss=0.3735, pruned_loss=0.1369, over 4274310.03 frames. ], batch size: 282, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:41:52,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-18 11:42:15,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=200190.0, ans=0.125 2023-06-18 11:42:23,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=12.0 2023-06-18 11:42:55,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-18 11:43:10,738 INFO [train.py:996] (0/4) Epoch 2, batch 2900, loss[loss=0.39, simple_loss=0.4158, pruned_loss=0.1821, over 21545.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.368, pruned_loss=0.1358, over 4274663.83 frames. ], batch size: 471, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:43:49,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-18 11:44:21,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=200550.0, ans=0.5 2023-06-18 11:44:39,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 4.023e+02 4.917e+02 6.862e+02 1.107e+03, threshold=9.834e+02, percent-clipped=8.0 2023-06-18 11:44:41,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=200610.0, ans=0.125 2023-06-18 11:44:43,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=200610.0, ans=0.025 2023-06-18 11:44:47,199 INFO [train.py:996] (0/4) Epoch 2, batch 2950, loss[loss=0.2854, simple_loss=0.3656, pruned_loss=0.1026, over 21819.00 frames. ], tot_loss[loss=0.32, simple_loss=0.3689, pruned_loss=0.1355, over 4277104.30 frames. ], batch size: 332, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:44:52,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=200670.0, ans=0.125 2023-06-18 11:45:01,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=200730.0, ans=0.2 2023-06-18 11:45:07,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-18 11:45:26,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=200790.0, ans=0.0 2023-06-18 11:45:55,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=200850.0, ans=0.125 2023-06-18 11:46:20,654 INFO [train.py:996] (0/4) Epoch 2, batch 3000, loss[loss=0.282, simple_loss=0.3631, pruned_loss=0.1005, over 21682.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3721, pruned_loss=0.1356, over 4279557.44 frames. ], batch size: 230, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:46:20,656 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 11:46:36,279 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2851, simple_loss=0.377, pruned_loss=0.09657, over 1796401.00 frames. 2023-06-18 11:46:36,280 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 11:47:28,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=201090.0, ans=0.125 2023-06-18 11:47:45,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=201150.0, ans=0.0 2023-06-18 11:48:07,023 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.149e+02 4.031e+02 5.261e+02 8.201e+02, threshold=8.061e+02, percent-clipped=0.0 2023-06-18 11:48:15,256 INFO [train.py:996] (0/4) Epoch 2, batch 3050, loss[loss=0.3189, simple_loss=0.3767, pruned_loss=0.1305, over 21737.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.3735, pruned_loss=0.1337, over 4281324.18 frames. ], batch size: 414, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:48:31,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=201270.0, ans=0.5 2023-06-18 11:48:39,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.34 vs. limit=10.0 2023-06-18 11:48:43,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-18 11:49:15,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=201390.0, ans=0.0 2023-06-18 11:49:46,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=201510.0, ans=0.125 2023-06-18 11:50:02,729 INFO [train.py:996] (0/4) Epoch 2, batch 3100, loss[loss=0.2569, simple_loss=0.3127, pruned_loss=0.1006, over 21285.00 frames. ], tot_loss[loss=0.3183, simple_loss=0.3731, pruned_loss=0.1318, over 4285326.56 frames. ], batch size: 159, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:50:26,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=201630.0, ans=0.0 2023-06-18 11:50:28,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-18 11:50:41,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=201690.0, ans=0.0 2023-06-18 11:51:31,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.312e+02 4.167e+02 4.991e+02 9.720e+02, threshold=8.334e+02, percent-clipped=2.0 2023-06-18 11:51:39,075 INFO [train.py:996] (0/4) Epoch 2, batch 3150, loss[loss=0.3235, simple_loss=0.364, pruned_loss=0.1415, over 20946.00 frames. ], tot_loss[loss=0.3201, simple_loss=0.3753, pruned_loss=0.1324, over 4289505.44 frames. ], batch size: 608, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:51:52,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=201870.0, ans=0.0 2023-06-18 11:51:53,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-18 11:52:07,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=201930.0, ans=0.125 2023-06-18 11:52:14,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=201930.0, ans=0.1 2023-06-18 11:52:29,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=201990.0, ans=12.0 2023-06-18 11:52:29,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=201990.0, ans=0.95 2023-06-18 11:52:37,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=6.0 2023-06-18 11:53:02,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202110.0, ans=0.1 2023-06-18 11:53:13,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=202110.0, ans=0.05 2023-06-18 11:53:22,384 INFO [train.py:996] (0/4) Epoch 2, batch 3200, loss[loss=0.3504, simple_loss=0.3916, pruned_loss=0.1546, over 21838.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3785, pruned_loss=0.1349, over 4285416.87 frames. ], batch size: 118, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:53:54,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.12 vs. limit=6.0 2023-06-18 11:54:12,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=202350.0, ans=0.0 2023-06-18 11:54:26,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=202350.0, ans=0.05 2023-06-18 11:54:47,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=202410.0, ans=0.0 2023-06-18 11:54:51,732 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.651e+02 4.650e+02 5.913e+02 1.032e+03, threshold=9.300e+02, percent-clipped=10.0 2023-06-18 11:55:04,316 INFO [train.py:996] (0/4) Epoch 2, batch 3250, loss[loss=0.3086, simple_loss=0.3464, pruned_loss=0.1354, over 21685.00 frames. ], tot_loss[loss=0.3271, simple_loss=0.3811, pruned_loss=0.1365, over 4288512.48 frames. ], batch size: 282, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:55:10,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202470.0, ans=0.1 2023-06-18 11:56:10,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=202650.0, ans=0.125 2023-06-18 11:56:43,245 INFO [train.py:996] (0/4) Epoch 2, batch 3300, loss[loss=0.2528, simple_loss=0.3094, pruned_loss=0.09813, over 21526.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3779, pruned_loss=0.1368, over 4273351.41 frames. ], batch size: 263, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:57:09,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=202830.0, ans=0.07 2023-06-18 11:57:30,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=202890.0, ans=0.125 2023-06-18 11:58:06,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=203010.0, ans=0.125 2023-06-18 11:58:07,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=203010.0, ans=0.125 2023-06-18 11:58:08,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 3.845e+02 4.643e+02 5.721e+02 1.092e+03, threshold=9.285e+02, percent-clipped=5.0 2023-06-18 11:58:10,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=203010.0, ans=0.5 2023-06-18 11:58:16,405 INFO [train.py:996] (0/4) Epoch 2, batch 3350, loss[loss=0.3905, simple_loss=0.4453, pruned_loss=0.1678, over 21463.00 frames. ], tot_loss[loss=0.3286, simple_loss=0.3827, pruned_loss=0.1372, over 4270649.63 frames. ], batch size: 471, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:59:09,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203190.0, ans=0.1 2023-06-18 11:59:40,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=203310.0, ans=0.0 2023-06-18 11:59:53,099 INFO [train.py:996] (0/4) Epoch 2, batch 3400, loss[loss=0.2946, simple_loss=0.352, pruned_loss=0.1186, over 21870.00 frames. ], tot_loss[loss=0.3303, simple_loss=0.3829, pruned_loss=0.1389, over 4273767.68 frames. ], batch size: 118, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 12:00:01,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=203370.0, ans=0.125 2023-06-18 12:00:17,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=203430.0, ans=0.0 2023-06-18 12:01:18,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.307e+02 4.139e+02 5.241e+02 1.031e+03, threshold=8.278e+02, percent-clipped=1.0 2023-06-18 12:01:23,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203610.0, ans=0.1 2023-06-18 12:01:25,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=203670.0, ans=0.2 2023-06-18 12:01:26,575 INFO [train.py:996] (0/4) Epoch 2, batch 3450, loss[loss=0.3909, simple_loss=0.4031, pruned_loss=0.1893, over 21377.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3787, pruned_loss=0.1385, over 4273527.59 frames. ], batch size: 507, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 12:01:34,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=203670.0, ans=0.0 2023-06-18 12:01:38,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=12.0 2023-06-18 12:02:06,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=203730.0, ans=0.125 2023-06-18 12:03:06,553 INFO [train.py:996] (0/4) Epoch 2, batch 3500, loss[loss=0.2759, simple_loss=0.3844, pruned_loss=0.08366, over 20747.00 frames. ], tot_loss[loss=0.3335, simple_loss=0.3843, pruned_loss=0.1413, over 4265806.69 frames. ], batch size: 608, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 12:03:06,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=203970.0, ans=0.1 2023-06-18 12:03:36,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=204030.0, ans=0.2 2023-06-18 12:03:38,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=204030.0, ans=0.04949747468305833 2023-06-18 12:03:41,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=204030.0, ans=0.0 2023-06-18 12:04:13,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=204150.0, ans=0.125 2023-06-18 12:04:27,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=204210.0, ans=0.0 2023-06-18 12:04:35,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.505e+02 3.628e+02 4.522e+02 5.964e+02 1.068e+03, threshold=9.044e+02, percent-clipped=7.0 2023-06-18 12:04:37,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-18 12:04:47,806 INFO [train.py:996] (0/4) Epoch 2, batch 3550, loss[loss=0.331, simple_loss=0.37, pruned_loss=0.146, over 21773.00 frames. ], tot_loss[loss=0.3373, simple_loss=0.3876, pruned_loss=0.1435, over 4266301.67 frames. ], batch size: 351, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:06:26,631 INFO [train.py:996] (0/4) Epoch 2, batch 3600, loss[loss=0.2842, simple_loss=0.3374, pruned_loss=0.1155, over 21731.00 frames. ], tot_loss[loss=0.331, simple_loss=0.3795, pruned_loss=0.1413, over 4262593.10 frames. ], batch size: 247, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:06:55,783 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-18 12:07:14,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=204690.0, ans=0.0 2023-06-18 12:07:57,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 3.619e+02 4.242e+02 5.545e+02 1.042e+03, threshold=8.484e+02, percent-clipped=1.0 2023-06-18 12:08:03,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=12.0 2023-06-18 12:08:05,241 INFO [train.py:996] (0/4) Epoch 2, batch 3650, loss[loss=0.2507, simple_loss=0.3216, pruned_loss=0.08989, over 21749.00 frames. ], tot_loss[loss=0.3329, simple_loss=0.3812, pruned_loss=0.1423, over 4261478.45 frames. ], batch size: 247, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:08:24,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=204930.0, ans=0.0 2023-06-18 12:08:53,052 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-18 12:09:20,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=205110.0, ans=0.0 2023-06-18 12:09:41,440 INFO [train.py:996] (0/4) Epoch 2, batch 3700, loss[loss=0.3409, simple_loss=0.373, pruned_loss=0.1544, over 21732.00 frames. ], tot_loss[loss=0.3289, simple_loss=0.3774, pruned_loss=0.1402, over 4271414.38 frames. ], batch size: 230, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:10:32,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=205290.0, ans=0.0 2023-06-18 12:10:32,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=205290.0, ans=0.0 2023-06-18 12:10:37,886 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.70 vs. limit=15.0 2023-06-18 12:11:09,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.219e+02 3.762e+02 4.567e+02 1.013e+03, threshold=7.524e+02, percent-clipped=2.0 2023-06-18 12:11:22,731 INFO [train.py:996] (0/4) Epoch 2, batch 3750, loss[loss=0.2684, simple_loss=0.3343, pruned_loss=0.1012, over 21859.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3726, pruned_loss=0.1375, over 4281668.91 frames. ], batch size: 333, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:11:24,640 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:11:25,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=205470.0, ans=0.125 2023-06-18 12:11:52,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=205530.0, ans=0.125 2023-06-18 12:12:59,802 INFO [train.py:996] (0/4) Epoch 2, batch 3800, loss[loss=0.2785, simple_loss=0.3258, pruned_loss=0.1155, over 21659.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.3713, pruned_loss=0.1357, over 4282735.61 frames. ], batch size: 112, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:13:45,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-18 12:13:53,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=205890.0, ans=0.07 2023-06-18 12:14:24,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.21 vs. limit=10.0 2023-06-18 12:14:28,920 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.435e+02 4.330e+02 5.504e+02 8.212e+02, threshold=8.659e+02, percent-clipped=4.0 2023-06-18 12:14:37,047 INFO [train.py:996] (0/4) Epoch 2, batch 3850, loss[loss=0.3052, simple_loss=0.3344, pruned_loss=0.138, over 21335.00 frames. ], tot_loss[loss=0.3225, simple_loss=0.3703, pruned_loss=0.1374, over 4268407.77 frames. ], batch size: 211, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:14:59,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=206070.0, ans=0.0 2023-06-18 12:15:05,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=206130.0, ans=0.1 2023-06-18 12:15:36,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.56 vs. limit=10.0 2023-06-18 12:15:40,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=206250.0, ans=0.0 2023-06-18 12:16:01,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=206310.0, ans=0.5 2023-06-18 12:16:09,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=206310.0, ans=0.0 2023-06-18 12:16:13,530 INFO [train.py:996] (0/4) Epoch 2, batch 3900, loss[loss=0.2834, simple_loss=0.3312, pruned_loss=0.1178, over 21534.00 frames. ], tot_loss[loss=0.3209, simple_loss=0.3668, pruned_loss=0.1375, over 4259674.33 frames. ], batch size: 195, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:16:45,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=206430.0, ans=0.1 2023-06-18 12:17:00,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=206490.0, ans=0.2 2023-06-18 12:17:06,772 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-18 12:17:16,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=206550.0, ans=0.125 2023-06-18 12:17:34,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=206610.0, ans=0.0 2023-06-18 12:17:42,970 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.558e+02 3.735e+02 4.662e+02 6.230e+02 1.205e+03, threshold=9.323e+02, percent-clipped=9.0 2023-06-18 12:17:50,618 INFO [train.py:996] (0/4) Epoch 2, batch 3950, loss[loss=0.3355, simple_loss=0.408, pruned_loss=0.1315, over 21631.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.372, pruned_loss=0.1374, over 4265250.41 frames. ], batch size: 389, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:18:22,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=206730.0, ans=0.125 2023-06-18 12:18:34,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=206790.0, ans=0.1 2023-06-18 12:18:39,852 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-06-18 12:18:56,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=206850.0, ans=0.0 2023-06-18 12:19:14,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=206910.0, ans=0.0 2023-06-18 12:19:27,481 INFO [train.py:996] (0/4) Epoch 2, batch 4000, loss[loss=0.3329, simple_loss=0.3635, pruned_loss=0.1512, over 21862.00 frames. ], tot_loss[loss=0.3167, simple_loss=0.3651, pruned_loss=0.1342, over 4266067.75 frames. ], batch size: 98, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:20:25,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=207090.0, ans=0.0 2023-06-18 12:20:50,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.512e+02 4.093e+02 5.285e+02 8.562e+02, threshold=8.187e+02, percent-clipped=0.0 2023-06-18 12:21:02,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-18 12:21:03,166 INFO [train.py:996] (0/4) Epoch 2, batch 4050, loss[loss=0.3006, simple_loss=0.3644, pruned_loss=0.1184, over 21797.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3638, pruned_loss=0.1313, over 4275344.77 frames. ], batch size: 332, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:21:29,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=207330.0, ans=0.0 2023-06-18 12:22:44,812 INFO [train.py:996] (0/4) Epoch 2, batch 4100, loss[loss=0.2947, simple_loss=0.359, pruned_loss=0.1152, over 21563.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.365, pruned_loss=0.1319, over 4281010.81 frames. ], batch size: 195, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:22:46,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=207570.0, ans=0.125 2023-06-18 12:22:54,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=207570.0, ans=0.125 2023-06-18 12:23:26,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=207690.0, ans=0.2 2023-06-18 12:23:38,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=207690.0, ans=0.1 2023-06-18 12:23:57,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-18 12:24:08,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.979e+02 3.498e+02 4.033e+02 7.129e+02, threshold=6.997e+02, percent-clipped=0.0 2023-06-18 12:24:20,659 INFO [train.py:996] (0/4) Epoch 2, batch 4150, loss[loss=0.252, simple_loss=0.3384, pruned_loss=0.08284, over 21343.00 frames. ], tot_loss[loss=0.3092, simple_loss=0.365, pruned_loss=0.1267, over 4279423.29 frames. ], batch size: 194, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:24:23,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2023-06-18 12:24:23,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-18 12:24:27,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-18 12:25:59,008 INFO [train.py:996] (0/4) Epoch 2, batch 4200, loss[loss=0.2765, simple_loss=0.3327, pruned_loss=0.1102, over 21359.00 frames. ], tot_loss[loss=0.3075, simple_loss=0.363, pruned_loss=0.126, over 4276532.11 frames. ], batch size: 143, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:26:43,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=208290.0, ans=0.1 2023-06-18 12:26:50,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=208290.0, ans=0.125 2023-06-18 12:27:21,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=208410.0, ans=0.125 2023-06-18 12:27:22,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=208410.0, ans=0.09899494936611666 2023-06-18 12:27:24,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=208410.0, ans=0.125 2023-06-18 12:27:31,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.591e+02 4.541e+02 5.629e+02 1.049e+03, threshold=9.081e+02, percent-clipped=10.0 2023-06-18 12:27:38,048 INFO [train.py:996] (0/4) Epoch 2, batch 4250, loss[loss=0.3902, simple_loss=0.4373, pruned_loss=0.1716, over 21587.00 frames. ], tot_loss[loss=0.3138, simple_loss=0.3704, pruned_loss=0.1286, over 4278406.68 frames. ], batch size: 414, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:27:40,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=208470.0, ans=0.125 2023-06-18 12:28:22,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=208590.0, ans=0.0 2023-06-18 12:28:52,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=208650.0, ans=0.125 2023-06-18 12:28:54,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=208650.0, ans=0.125 2023-06-18 12:28:59,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=208650.0, ans=0.125 2023-06-18 12:29:02,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=208710.0, ans=0.125 2023-06-18 12:29:25,674 INFO [train.py:996] (0/4) Epoch 2, batch 4300, loss[loss=0.2605, simple_loss=0.3364, pruned_loss=0.09234, over 21606.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3774, pruned_loss=0.1323, over 4273995.28 frames. ], batch size: 230, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:29:37,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-18 12:29:40,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=208770.0, ans=0.125 2023-06-18 12:29:43,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=208770.0, ans=0.125 2023-06-18 12:30:23,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=208890.0, ans=0.125 2023-06-18 12:30:27,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=208950.0, ans=0.2 2023-06-18 12:31:04,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.525e+02 3.305e+02 3.923e+02 4.922e+02 1.064e+03, threshold=7.846e+02, percent-clipped=2.0 2023-06-18 12:31:10,121 INFO [train.py:996] (0/4) Epoch 2, batch 4350, loss[loss=0.2449, simple_loss=0.2945, pruned_loss=0.09764, over 21187.00 frames. ], tot_loss[loss=0.3177, simple_loss=0.374, pruned_loss=0.1307, over 4265610.28 frames. ], batch size: 159, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:31:29,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=209130.0, ans=0.125 2023-06-18 12:31:49,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209190.0, ans=0.1 2023-06-18 12:31:53,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=209190.0, ans=0.0 2023-06-18 12:32:10,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209250.0, ans=0.1 2023-06-18 12:32:24,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=209250.0, ans=0.07 2023-06-18 12:32:47,837 INFO [train.py:996] (0/4) Epoch 2, batch 4400, loss[loss=0.274, simple_loss=0.3206, pruned_loss=0.1137, over 21510.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3693, pruned_loss=0.1299, over 4262153.23 frames. ], batch size: 230, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:32:54,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=209370.0, ans=0.2 2023-06-18 12:33:24,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=209430.0, ans=0.125 2023-06-18 12:33:25,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-18 12:34:10,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=209610.0, ans=0.2 2023-06-18 12:34:12,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=209610.0, ans=0.125 2023-06-18 12:34:12,974 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-06-18 12:34:13,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209610.0, ans=0.1 2023-06-18 12:34:18,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=209610.0, ans=0.2 2023-06-18 12:34:19,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.653e+02 4.609e+02 5.842e+02 1.096e+03, threshold=9.217e+02, percent-clipped=5.0 2023-06-18 12:34:30,777 INFO [train.py:996] (0/4) Epoch 2, batch 4450, loss[loss=0.2824, simple_loss=0.3396, pruned_loss=0.1126, over 21926.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3794, pruned_loss=0.1322, over 4265602.74 frames. ], batch size: 107, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:34:34,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=209670.0, ans=0.0 2023-06-18 12:34:37,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=209670.0, ans=0.125 2023-06-18 12:34:54,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=209730.0, ans=0.125 2023-06-18 12:34:56,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=209730.0, ans=0.0 2023-06-18 12:35:03,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=209730.0, ans=0.125 2023-06-18 12:35:49,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=209910.0, ans=0.2 2023-06-18 12:36:06,142 INFO [train.py:996] (0/4) Epoch 2, batch 4500, loss[loss=0.2983, simple_loss=0.3591, pruned_loss=0.1188, over 21816.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3799, pruned_loss=0.1333, over 4274535.43 frames. ], batch size: 282, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:36:22,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=209970.0, ans=0.0 2023-06-18 12:36:43,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=210090.0, ans=0.0 2023-06-18 12:37:04,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=210090.0, ans=0.0 2023-06-18 12:37:04,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210090.0, ans=0.1 2023-06-18 12:37:27,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=210210.0, ans=0.125 2023-06-18 12:37:37,915 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.555e+02 4.065e+02 4.925e+02 8.814e+02, threshold=8.131e+02, percent-clipped=0.0 2023-06-18 12:37:44,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=210210.0, ans=0.1 2023-06-18 12:37:48,834 INFO [train.py:996] (0/4) Epoch 2, batch 4550, loss[loss=0.3674, simple_loss=0.4174, pruned_loss=0.1587, over 21565.00 frames. ], tot_loss[loss=0.3237, simple_loss=0.3816, pruned_loss=0.1329, over 4274628.86 frames. ], batch size: 414, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:39:25,121 INFO [train.py:996] (0/4) Epoch 2, batch 4600, loss[loss=0.3266, simple_loss=0.3734, pruned_loss=0.1399, over 21745.00 frames. ], tot_loss[loss=0.3313, simple_loss=0.3875, pruned_loss=0.1375, over 4281655.09 frames. ], batch size: 112, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:39:29,028 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:40:46,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=210810.0, ans=0.125 2023-06-18 12:40:56,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 3.566e+02 4.300e+02 5.335e+02 1.700e+03, threshold=8.600e+02, percent-clipped=8.0 2023-06-18 12:41:02,509 INFO [train.py:996] (0/4) Epoch 2, batch 4650, loss[loss=0.272, simple_loss=0.327, pruned_loss=0.1085, over 21662.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.3845, pruned_loss=0.1353, over 4283834.50 frames. ], batch size: 414, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:41:10,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=210870.0, ans=0.125 2023-06-18 12:41:26,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=210930.0, ans=0.125 2023-06-18 12:41:51,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=210990.0, ans=0.0 2023-06-18 12:42:04,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=211050.0, ans=0.125 2023-06-18 12:42:22,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-18 12:42:25,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=211110.0, ans=0.0 2023-06-18 12:42:32,561 INFO [train.py:996] (0/4) Epoch 2, batch 4700, loss[loss=0.2918, simple_loss=0.3297, pruned_loss=0.127, over 21326.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.3735, pruned_loss=0.1321, over 4290048.05 frames. ], batch size: 144, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:42:35,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-06-18 12:42:36,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.96 vs. limit=15.0 2023-06-18 12:42:45,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-06-18 12:43:20,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=211290.0, ans=0.2 2023-06-18 12:43:59,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=211410.0, ans=0.0 2023-06-18 12:44:00,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.242e+02 4.193e+02 5.636e+02 1.011e+03, threshold=8.385e+02, percent-clipped=2.0 2023-06-18 12:44:01,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=211410.0, ans=0.125 2023-06-18 12:44:06,712 INFO [train.py:996] (0/4) Epoch 2, batch 4750, loss[loss=0.3985, simple_loss=0.4077, pruned_loss=0.1946, over 21405.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3676, pruned_loss=0.1322, over 4291421.05 frames. ], batch size: 473, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:44:35,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-18 12:45:25,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=211710.0, ans=0.125 2023-06-18 12:45:25,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=211710.0, ans=0.0 2023-06-18 12:45:33,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=211770.0, ans=0.125 2023-06-18 12:45:34,296 INFO [train.py:996] (0/4) Epoch 2, batch 4800, loss[loss=0.2992, simple_loss=0.3867, pruned_loss=0.1058, over 21704.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3694, pruned_loss=0.1328, over 4291391.36 frames. ], batch size: 389, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:46:12,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=22.5 2023-06-18 12:46:15,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=211890.0, ans=0.125 2023-06-18 12:46:36,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=211950.0, ans=0.125 2023-06-18 12:46:50,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=212010.0, ans=0.025 2023-06-18 12:47:01,227 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.590e+02 4.523e+02 5.544e+02 1.095e+03, threshold=9.046e+02, percent-clipped=1.0 2023-06-18 12:47:07,537 INFO [train.py:996] (0/4) Epoch 2, batch 4850, loss[loss=0.3923, simple_loss=0.4208, pruned_loss=0.1819, over 21756.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3661, pruned_loss=0.1316, over 4282872.14 frames. ], batch size: 441, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:47:51,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=212190.0, ans=0.0 2023-06-18 12:48:12,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=212250.0, ans=0.0 2023-06-18 12:48:32,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212310.0, ans=0.1 2023-06-18 12:48:34,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=212370.0, ans=0.125 2023-06-18 12:48:35,060 INFO [train.py:996] (0/4) Epoch 2, batch 4900, loss[loss=0.3372, simple_loss=0.3952, pruned_loss=0.1396, over 21318.00 frames. ], tot_loss[loss=0.3202, simple_loss=0.3719, pruned_loss=0.1343, over 4280529.44 frames. ], batch size: 548, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:48:45,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-18 12:50:06,445 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 3.496e+02 4.489e+02 5.539e+02 1.137e+03, threshold=8.978e+02, percent-clipped=3.0 2023-06-18 12:50:13,010 INFO [train.py:996] (0/4) Epoch 2, batch 4950, loss[loss=0.2764, simple_loss=0.3698, pruned_loss=0.09147, over 21618.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3716, pruned_loss=0.1299, over 4275908.64 frames. ], batch size: 389, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:50:59,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=212790.0, ans=0.125 2023-06-18 12:51:39,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=212910.0, ans=0.0 2023-06-18 12:51:46,444 INFO [train.py:996] (0/4) Epoch 2, batch 5000, loss[loss=0.3407, simple_loss=0.3887, pruned_loss=0.1464, over 21937.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3685, pruned_loss=0.1253, over 4275115.78 frames. ], batch size: 113, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:52:54,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=213150.0, ans=0.0 2023-06-18 12:53:06,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.969e+02 3.666e+02 4.897e+02 8.510e+02, threshold=7.332e+02, percent-clipped=0.0 2023-06-18 12:53:12,753 INFO [train.py:996] (0/4) Epoch 2, batch 5050, loss[loss=0.3339, simple_loss=0.3743, pruned_loss=0.1468, over 21854.00 frames. ], tot_loss[loss=0.314, simple_loss=0.3693, pruned_loss=0.1293, over 4282445.02 frames. ], batch size: 118, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:54:27,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=213510.0, ans=0.125 2023-06-18 12:54:39,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=213510.0, ans=0.05 2023-06-18 12:54:43,421 INFO [train.py:996] (0/4) Epoch 2, batch 5100, loss[loss=0.3187, simple_loss=0.3636, pruned_loss=0.1369, over 21843.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.367, pruned_loss=0.1297, over 4285389.51 frames. ], batch size: 351, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:54:48,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=213570.0, ans=0.0 2023-06-18 12:56:13,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.389e+02 4.046e+02 5.054e+02 9.083e+02, threshold=8.093e+02, percent-clipped=6.0 2023-06-18 12:56:15,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=213810.0, ans=0.05 2023-06-18 12:56:17,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=213810.0, ans=0.125 2023-06-18 12:56:19,876 INFO [train.py:996] (0/4) Epoch 2, batch 5150, loss[loss=0.3206, simple_loss=0.3637, pruned_loss=0.1387, over 21892.00 frames. ], tot_loss[loss=0.3122, simple_loss=0.3645, pruned_loss=0.1299, over 4295915.77 frames. ], batch size: 316, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:56:23,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=213870.0, ans=0.0 2023-06-18 12:56:49,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=213930.0, ans=0.0 2023-06-18 12:57:33,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=214050.0, ans=0.125 2023-06-18 12:57:35,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=214050.0, ans=0.125 2023-06-18 12:57:55,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=214170.0, ans=0.04949747468305833 2023-06-18 12:57:56,078 INFO [train.py:996] (0/4) Epoch 2, batch 5200, loss[loss=0.3466, simple_loss=0.4197, pruned_loss=0.1367, over 21847.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3642, pruned_loss=0.1293, over 4297255.08 frames. ], batch size: 316, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:58:24,514 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-18 12:58:25,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.73 vs. limit=6.0 2023-06-18 12:58:40,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-18 12:59:01,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=214350.0, ans=0.015 2023-06-18 12:59:25,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.610e+02 4.791e+02 6.505e+02 1.223e+03, threshold=9.582e+02, percent-clipped=11.0 2023-06-18 12:59:26,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=214410.0, ans=0.125 2023-06-18 12:59:32,067 INFO [train.py:996] (0/4) Epoch 2, batch 5250, loss[loss=0.2712, simple_loss=0.3458, pruned_loss=0.09833, over 21389.00 frames. ], tot_loss[loss=0.3092, simple_loss=0.3657, pruned_loss=0.1264, over 4289541.20 frames. ], batch size: 211, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:59:40,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=214470.0, ans=0.04949747468305833 2023-06-18 12:59:57,503 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=22.5 2023-06-18 13:00:07,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=214530.0, ans=0.02 2023-06-18 13:00:45,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2023-06-18 13:00:53,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=214710.0, ans=0.0 2023-06-18 13:01:12,320 INFO [train.py:996] (0/4) Epoch 2, batch 5300, loss[loss=0.3547, simple_loss=0.3877, pruned_loss=0.1609, over 21933.00 frames. ], tot_loss[loss=0.3115, simple_loss=0.3661, pruned_loss=0.1285, over 4295014.17 frames. ], batch size: 414, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:02:00,650 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:02:11,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-18 13:02:20,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=214950.0, ans=0.0 2023-06-18 13:02:29,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=215010.0, ans=0.0 2023-06-18 13:02:35,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 3.046e+02 3.546e+02 4.539e+02 8.571e+02, threshold=7.092e+02, percent-clipped=0.0 2023-06-18 13:02:40,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=215070.0, ans=0.0 2023-06-18 13:02:41,317 INFO [train.py:996] (0/4) Epoch 2, batch 5350, loss[loss=0.3284, simple_loss=0.375, pruned_loss=0.1409, over 21780.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3652, pruned_loss=0.1294, over 4298602.90 frames. ], batch size: 112, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:02:43,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=215070.0, ans=0.0 2023-06-18 13:03:36,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=215190.0, ans=0.0 2023-06-18 13:03:58,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=215250.0, ans=0.2 2023-06-18 13:04:17,006 INFO [train.py:996] (0/4) Epoch 2, batch 5400, loss[loss=0.2534, simple_loss=0.3178, pruned_loss=0.0945, over 21531.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3634, pruned_loss=0.13, over 4299111.67 frames. ], batch size: 212, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:04:36,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=215370.0, ans=0.1 2023-06-18 13:05:15,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=215490.0, ans=0.125 2023-06-18 13:05:17,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=215490.0, ans=0.07 2023-06-18 13:05:20,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=215490.0, ans=0.0 2023-06-18 13:05:23,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215550.0, ans=0.1 2023-06-18 13:05:31,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=215550.0, ans=0.0 2023-06-18 13:05:58,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.179e+02 4.117e+02 5.254e+02 8.433e+02, threshold=8.234e+02, percent-clipped=2.0 2023-06-18 13:06:04,241 INFO [train.py:996] (0/4) Epoch 2, batch 5450, loss[loss=0.3281, simple_loss=0.3979, pruned_loss=0.1292, over 21854.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3658, pruned_loss=0.1277, over 4295102.37 frames. ], batch size: 371, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:06:11,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=215670.0, ans=0.125 2023-06-18 13:07:36,811 INFO [train.py:996] (0/4) Epoch 2, batch 5500, loss[loss=0.2433, simple_loss=0.3225, pruned_loss=0.08204, over 21178.00 frames. ], tot_loss[loss=0.3072, simple_loss=0.3692, pruned_loss=0.1226, over 4294219.62 frames. ], batch size: 176, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:07:46,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=215970.0, ans=0.125 2023-06-18 13:07:47,965 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-36000.pt 2023-06-18 13:07:55,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=215970.0, ans=0.0 2023-06-18 13:08:25,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=15.0 2023-06-18 13:08:49,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=216150.0, ans=0.0 2023-06-18 13:09:13,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 3.111e+02 3.765e+02 4.593e+02 1.085e+03, threshold=7.530e+02, percent-clipped=3.0 2023-06-18 13:09:19,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=216270.0, ans=0.07 2023-06-18 13:09:20,402 INFO [train.py:996] (0/4) Epoch 2, batch 5550, loss[loss=0.2212, simple_loss=0.301, pruned_loss=0.07074, over 21430.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3687, pruned_loss=0.1193, over 4293963.38 frames. ], batch size: 194, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:09:41,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=216330.0, ans=0.125 2023-06-18 13:10:36,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=216450.0, ans=0.2 2023-06-18 13:11:01,306 INFO [train.py:996] (0/4) Epoch 2, batch 5600, loss[loss=0.2784, simple_loss=0.3564, pruned_loss=0.1001, over 21398.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3629, pruned_loss=0.1145, over 4285918.45 frames. ], batch size: 211, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:11:08,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=216570.0, ans=0.125 2023-06-18 13:11:20,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216630.0, ans=0.1 2023-06-18 13:11:34,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=216630.0, ans=0.2 2023-06-18 13:12:00,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=216750.0, ans=0.0 2023-06-18 13:12:15,957 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-18 13:12:19,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=216810.0, ans=0.0 2023-06-18 13:12:25,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 3.142e+02 3.780e+02 5.340e+02 1.337e+03, threshold=7.560e+02, percent-clipped=11.0 2023-06-18 13:12:35,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=216870.0, ans=0.125 2023-06-18 13:12:35,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.02 vs. limit=15.0 2023-06-18 13:12:36,223 INFO [train.py:996] (0/4) Epoch 2, batch 5650, loss[loss=0.3393, simple_loss=0.3795, pruned_loss=0.1495, over 21868.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.3678, pruned_loss=0.1178, over 4282876.90 frames. ], batch size: 282, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:14:01,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-18 13:14:11,971 INFO [train.py:996] (0/4) Epoch 2, batch 5700, loss[loss=0.2894, simple_loss=0.3602, pruned_loss=0.1093, over 21726.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3687, pruned_loss=0.1209, over 4289115.78 frames. ], batch size: 298, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:14:48,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-18 13:15:43,851 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 3.148e+02 3.823e+02 4.974e+02 1.006e+03, threshold=7.646e+02, percent-clipped=5.0 2023-06-18 13:15:49,888 INFO [train.py:996] (0/4) Epoch 2, batch 5750, loss[loss=0.2527, simple_loss=0.3315, pruned_loss=0.08691, over 21409.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3675, pruned_loss=0.1183, over 4287888.51 frames. ], batch size: 211, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:15:53,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=217470.0, ans=0.125 2023-06-18 13:16:07,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-18 13:16:30,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=217530.0, ans=15.0 2023-06-18 13:17:41,294 INFO [train.py:996] (0/4) Epoch 2, batch 5800, loss[loss=0.3664, simple_loss=0.4396, pruned_loss=0.1466, over 21512.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3654, pruned_loss=0.1172, over 4277246.77 frames. ], batch size: 471, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:18:22,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=217890.0, ans=0.125 2023-06-18 13:18:33,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=217890.0, ans=0.125 2023-06-18 13:18:49,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=217950.0, ans=0.05 2023-06-18 13:19:13,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 3.013e+02 4.071e+02 4.851e+02 8.760e+02, threshold=8.142e+02, percent-clipped=2.0 2023-06-18 13:19:19,690 INFO [train.py:996] (0/4) Epoch 2, batch 5850, loss[loss=0.3005, simple_loss=0.3951, pruned_loss=0.103, over 21248.00 frames. ], tot_loss[loss=0.2922, simple_loss=0.3619, pruned_loss=0.1113, over 4279839.78 frames. ], batch size: 548, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:19:53,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=218130.0, ans=0.125 2023-06-18 13:19:55,309 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-18 13:19:57,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=218130.0, ans=0.125 2023-06-18 13:20:03,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=218190.0, ans=0.2 2023-06-18 13:20:26,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=218250.0, ans=0.125 2023-06-18 13:20:50,780 INFO [train.py:996] (0/4) Epoch 2, batch 5900, loss[loss=0.2607, simple_loss=0.3561, pruned_loss=0.08262, over 21528.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3501, pruned_loss=0.1016, over 4276004.90 frames. ], batch size: 471, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:20:54,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-06-18 13:21:03,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-18 13:21:30,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=218490.0, ans=0.125 2023-06-18 13:22:19,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 3.163e+02 4.084e+02 5.462e+02 1.507e+03, threshold=8.168e+02, percent-clipped=5.0 2023-06-18 13:22:25,398 INFO [train.py:996] (0/4) Epoch 2, batch 5950, loss[loss=0.3313, simple_loss=0.3708, pruned_loss=0.1459, over 21882.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3528, pruned_loss=0.1079, over 4282052.12 frames. ], batch size: 351, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:22:34,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=8.0 2023-06-18 13:22:36,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=218670.0, ans=0.2 2023-06-18 13:24:04,257 INFO [train.py:996] (0/4) Epoch 2, batch 6000, loss[loss=0.2756, simple_loss=0.3184, pruned_loss=0.1164, over 20174.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3512, pruned_loss=0.1138, over 4263938.74 frames. ], batch size: 702, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:24:04,258 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 13:24:15,351 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.4098, 3.1213, 2.3961, 3.3821], device='cuda:0') 2023-06-18 13:24:20,117 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2916, simple_loss=0.3878, pruned_loss=0.09771, over 1796401.00 frames. 2023-06-18 13:24:20,118 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 13:24:45,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=219030.0, ans=0.2 2023-06-18 13:24:53,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=219030.0, ans=0.2 2023-06-18 13:25:24,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-18 13:25:51,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.782e+02 3.962e+02 4.700e+02 6.169e+02 1.115e+03, threshold=9.400e+02, percent-clipped=12.0 2023-06-18 13:25:55,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=219210.0, ans=0.02 2023-06-18 13:25:57,778 INFO [train.py:996] (0/4) Epoch 2, batch 6050, loss[loss=0.2574, simple_loss=0.313, pruned_loss=0.1009, over 21652.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3476, pruned_loss=0.1158, over 4267502.03 frames. ], batch size: 298, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:26:01,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=219270.0, ans=0.125 2023-06-18 13:26:05,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=219270.0, ans=0.07 2023-06-18 13:27:19,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219510.0, ans=0.1 2023-06-18 13:27:33,006 INFO [train.py:996] (0/4) Epoch 2, batch 6100, loss[loss=0.2757, simple_loss=0.337, pruned_loss=0.1072, over 21489.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3457, pruned_loss=0.1139, over 4263279.67 frames. ], batch size: 548, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:28:03,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=219630.0, ans=0.125 2023-06-18 13:28:19,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=219690.0, ans=0.0 2023-06-18 13:28:35,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=219750.0, ans=0.05 2023-06-18 13:29:03,476 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.959e+02 3.646e+02 4.733e+02 1.048e+03, threshold=7.291e+02, percent-clipped=1.0 2023-06-18 13:29:07,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=219810.0, ans=0.125 2023-06-18 13:29:09,589 INFO [train.py:996] (0/4) Epoch 2, batch 6150, loss[loss=0.3826, simple_loss=0.4047, pruned_loss=0.1803, over 21420.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3509, pruned_loss=0.119, over 4266038.12 frames. ], batch size: 507, lr: 1.84e-02, grad_scale: 64.0 2023-06-18 13:29:19,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=219870.0, ans=0.125 2023-06-18 13:29:36,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-18 13:29:47,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=219930.0, ans=0.1 2023-06-18 13:29:50,443 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:30:03,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=219990.0, ans=15.0 2023-06-18 13:30:40,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=220110.0, ans=0.125 2023-06-18 13:30:48,842 INFO [train.py:996] (0/4) Epoch 2, batch 6200, loss[loss=0.319, simple_loss=0.376, pruned_loss=0.131, over 21855.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3537, pruned_loss=0.1198, over 4262650.25 frames. ], batch size: 316, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:31:33,823 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-18 13:31:57,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=220350.0, ans=0.125 2023-06-18 13:32:01,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=220350.0, ans=0.0 2023-06-18 13:32:04,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=220410.0, ans=0.0 2023-06-18 13:32:20,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.171e+02 4.018e+02 5.849e+02 1.001e+03, threshold=8.035e+02, percent-clipped=11.0 2023-06-18 13:32:25,584 INFO [train.py:996] (0/4) Epoch 2, batch 6250, loss[loss=0.3854, simple_loss=0.4521, pruned_loss=0.1593, over 21448.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3585, pruned_loss=0.1193, over 4267551.69 frames. ], batch size: 507, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:32:30,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=220470.0, ans=0.125 2023-06-18 13:32:30,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220470.0, ans=0.1 2023-06-18 13:33:03,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=220590.0, ans=0.0 2023-06-18 13:33:08,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=220590.0, ans=0.0 2023-06-18 13:33:29,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=220650.0, ans=0.125 2023-06-18 13:33:40,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220710.0, ans=0.1 2023-06-18 13:33:59,284 INFO [train.py:996] (0/4) Epoch 2, batch 6300, loss[loss=0.3239, simple_loss=0.3681, pruned_loss=0.1399, over 21743.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.362, pruned_loss=0.1184, over 4278900.39 frames. ], batch size: 112, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:34:00,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-06-18 13:34:23,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=220830.0, ans=0.2 2023-06-18 13:34:51,240 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:35:25,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=221010.0, ans=0.125 2023-06-18 13:35:28,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-18 13:35:28,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2023-06-18 13:35:30,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.155e+02 3.817e+02 5.474e+02 1.365e+03, threshold=7.634e+02, percent-clipped=9.0 2023-06-18 13:35:35,472 INFO [train.py:996] (0/4) Epoch 2, batch 6350, loss[loss=0.3373, simple_loss=0.3864, pruned_loss=0.1442, over 21943.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3685, pruned_loss=0.1246, over 4284787.31 frames. ], batch size: 372, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:35:46,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-18 13:36:09,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=221130.0, ans=0.125 2023-06-18 13:36:12,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=221130.0, ans=0.0 2023-06-18 13:36:46,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=221250.0, ans=0.125 2023-06-18 13:36:58,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=221310.0, ans=0.0 2023-06-18 13:37:17,732 INFO [train.py:996] (0/4) Epoch 2, batch 6400, loss[loss=0.2317, simple_loss=0.3683, pruned_loss=0.04757, over 20801.00 frames. ], tot_loss[loss=0.315, simple_loss=0.3736, pruned_loss=0.1282, over 4277030.88 frames. ], batch size: 607, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:38:11,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=221490.0, ans=0.0 2023-06-18 13:38:23,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=221550.0, ans=0.125 2023-06-18 13:38:43,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=221610.0, ans=0.125 2023-06-18 13:38:43,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=221610.0, ans=0.0 2023-06-18 13:38:53,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.333e+02 3.952e+02 5.090e+02 9.873e+02, threshold=7.903e+02, percent-clipped=3.0 2023-06-18 13:38:55,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-18 13:38:58,466 INFO [train.py:996] (0/4) Epoch 2, batch 6450, loss[loss=0.2927, simple_loss=0.3685, pruned_loss=0.1085, over 21746.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3758, pruned_loss=0.1285, over 4273767.35 frames. ], batch size: 332, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:39:01,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=221670.0, ans=0.04949747468305833 2023-06-18 13:39:06,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=221670.0, ans=0.0 2023-06-18 13:40:11,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=221850.0, ans=0.125 2023-06-18 13:40:35,156 INFO [train.py:996] (0/4) Epoch 2, batch 6500, loss[loss=0.2727, simple_loss=0.3208, pruned_loss=0.1123, over 21764.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3656, pruned_loss=0.1255, over 4266272.88 frames. ], batch size: 124, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:40:52,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=221970.0, ans=0.125 2023-06-18 13:41:02,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=222030.0, ans=0.0 2023-06-18 13:41:22,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=222090.0, ans=0.125 2023-06-18 13:41:29,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=222150.0, ans=0.04949747468305833 2023-06-18 13:41:35,124 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:41:41,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=222150.0, ans=0.015 2023-06-18 13:42:06,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.480e+02 3.085e+02 3.485e+02 4.361e+02 6.672e+02, threshold=6.971e+02, percent-clipped=0.0 2023-06-18 13:42:10,626 INFO [train.py:996] (0/4) Epoch 2, batch 6550, loss[loss=0.2845, simple_loss=0.307, pruned_loss=0.131, over 20986.00 frames. ], tot_loss[loss=0.3047, simple_loss=0.3608, pruned_loss=0.1243, over 4268797.72 frames. ], batch size: 613, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:42:26,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=222270.0, ans=0.0 2023-06-18 13:43:03,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-18 13:43:09,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=222450.0, ans=0.0 2023-06-18 13:43:21,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-18 13:43:30,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-18 13:43:48,134 INFO [train.py:996] (0/4) Epoch 2, batch 6600, loss[loss=0.228, simple_loss=0.2815, pruned_loss=0.08726, over 21555.00 frames. ], tot_loss[loss=0.3027, simple_loss=0.3569, pruned_loss=0.1243, over 4268482.68 frames. ], batch size: 230, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:43:58,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=222570.0, ans=0.125 2023-06-18 13:44:25,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=222630.0, ans=0.0 2023-06-18 13:44:52,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=222750.0, ans=0.0 2023-06-18 13:45:04,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=222750.0, ans=0.2 2023-06-18 13:45:08,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=222810.0, ans=0.125 2023-06-18 13:45:13,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222810.0, ans=0.1 2023-06-18 13:45:19,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.099e+02 3.990e+02 5.465e+02 1.147e+03, threshold=7.980e+02, percent-clipped=13.0 2023-06-18 13:45:28,383 INFO [train.py:996] (0/4) Epoch 2, batch 6650, loss[loss=0.2856, simple_loss=0.3383, pruned_loss=0.1164, over 21808.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3491, pruned_loss=0.12, over 4277679.28 frames. ], batch size: 352, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:46:08,114 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-18 13:46:49,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=223110.0, ans=0.2 2023-06-18 13:47:02,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=223110.0, ans=0.1 2023-06-18 13:47:06,264 INFO [train.py:996] (0/4) Epoch 2, batch 6700, loss[loss=0.2791, simple_loss=0.3426, pruned_loss=0.1078, over 21740.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3464, pruned_loss=0.1203, over 4280419.16 frames. ], batch size: 333, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:47:33,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=223230.0, ans=0.0 2023-06-18 13:47:49,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=15.0 2023-06-18 13:48:07,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=223350.0, ans=0.2 2023-06-18 13:48:28,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=223410.0, ans=0.0 2023-06-18 13:48:32,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.773e+02 4.498e+02 5.331e+02 9.291e+02, threshold=8.996e+02, percent-clipped=2.0 2023-06-18 13:48:41,539 INFO [train.py:996] (0/4) Epoch 2, batch 6750, loss[loss=0.3711, simple_loss=0.3792, pruned_loss=0.1815, over 21531.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3474, pruned_loss=0.1222, over 4281051.76 frames. ], batch size: 508, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:49:19,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=223590.0, ans=0.5 2023-06-18 13:49:58,738 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.81 vs. limit=12.0 2023-06-18 13:49:59,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=223710.0, ans=0.0 2023-06-18 13:50:04,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=223710.0, ans=0.125 2023-06-18 13:50:17,684 INFO [train.py:996] (0/4) Epoch 2, batch 6800, loss[loss=0.2976, simple_loss=0.341, pruned_loss=0.1271, over 21659.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3502, pruned_loss=0.125, over 4290514.86 frames. ], batch size: 247, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:50:35,231 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:51:07,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.87 vs. limit=22.5 2023-06-18 13:51:23,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=223950.0, ans=0.0 2023-06-18 13:51:43,144 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 3.005e+02 3.766e+02 4.478e+02 7.220e+02, threshold=7.533e+02, percent-clipped=0.0 2023-06-18 13:51:52,457 INFO [train.py:996] (0/4) Epoch 2, batch 6850, loss[loss=0.3029, simple_loss=0.3466, pruned_loss=0.1296, over 21861.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3476, pruned_loss=0.1262, over 4295340.14 frames. ], batch size: 351, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:51:54,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=224070.0, ans=0.0 2023-06-18 13:51:59,627 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-18 13:52:11,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=224130.0, ans=0.2 2023-06-18 13:52:14,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-18 13:52:23,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224190.0, ans=0.1 2023-06-18 13:52:42,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=224250.0, ans=0.95 2023-06-18 13:52:42,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=224250.0, ans=0.125 2023-06-18 13:53:20,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=224310.0, ans=0.2 2023-06-18 13:53:28,622 INFO [train.py:996] (0/4) Epoch 2, batch 6900, loss[loss=0.2522, simple_loss=0.3271, pruned_loss=0.08866, over 21770.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3482, pruned_loss=0.126, over 4298327.45 frames. ], batch size: 247, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:53:54,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-18 13:54:15,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.65 vs. limit=15.0 2023-06-18 13:55:00,627 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2023-06-18 13:55:00,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 3.118e+02 3.769e+02 5.122e+02 8.656e+02, threshold=7.539e+02, percent-clipped=2.0 2023-06-18 13:55:03,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-18 13:55:05,559 INFO [train.py:996] (0/4) Epoch 2, batch 6950, loss[loss=0.2569, simple_loss=0.3372, pruned_loss=0.08832, over 21641.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.3493, pruned_loss=0.122, over 4297806.58 frames. ], batch size: 263, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:55:10,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=224670.0, ans=0.0 2023-06-18 13:55:55,309 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-18 13:56:25,079 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:56:35,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=224910.0, ans=0.125 2023-06-18 13:56:39,916 INFO [train.py:996] (0/4) Epoch 2, batch 7000, loss[loss=0.2882, simple_loss=0.3221, pruned_loss=0.1272, over 21206.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.3534, pruned_loss=0.1261, over 4288549.09 frames. ], batch size: 176, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:56:47,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=224970.0, ans=0.025 2023-06-18 13:57:09,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-18 13:57:30,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=225090.0, ans=0.125 2023-06-18 13:57:41,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=225150.0, ans=0.125 2023-06-18 13:57:46,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=225150.0, ans=0.2 2023-06-18 13:58:12,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.411e+02 3.463e+02 4.256e+02 5.508e+02 8.252e+02, threshold=8.512e+02, percent-clipped=6.0 2023-06-18 13:58:17,011 INFO [train.py:996] (0/4) Epoch 2, batch 7050, loss[loss=0.205, simple_loss=0.2592, pruned_loss=0.07546, over 16072.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3505, pruned_loss=0.124, over 4276767.40 frames. ], batch size: 60, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:58:48,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=225330.0, ans=0.0 2023-06-18 13:59:27,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=225450.0, ans=0.0 2023-06-18 13:59:37,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=225510.0, ans=0.125 2023-06-18 13:59:53,624 INFO [train.py:996] (0/4) Epoch 2, batch 7100, loss[loss=0.273, simple_loss=0.3444, pruned_loss=0.1008, over 21791.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.356, pruned_loss=0.1258, over 4280014.93 frames. ], batch size: 282, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 14:00:18,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=225630.0, ans=0.125 2023-06-18 14:00:54,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=225690.0, ans=0.2 2023-06-18 14:01:28,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.245e+02 4.248e+02 6.112e+02 1.073e+03, threshold=8.497e+02, percent-clipped=3.0 2023-06-18 14:01:31,281 INFO [train.py:996] (0/4) Epoch 2, batch 7150, loss[loss=0.3257, simple_loss=0.3817, pruned_loss=0.1348, over 21407.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.353, pruned_loss=0.1221, over 4274561.23 frames. ], batch size: 549, lr: 1.81e-02, grad_scale: 16.0 2023-06-18 14:01:41,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=225870.0, ans=0.0 2023-06-18 14:01:51,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=225870.0, ans=15.0 2023-06-18 14:02:43,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=226050.0, ans=0.025 2023-06-18 14:03:08,081 INFO [train.py:996] (0/4) Epoch 2, batch 7200, loss[loss=0.3537, simple_loss=0.3644, pruned_loss=0.1715, over 21456.00 frames. ], tot_loss[loss=0.3067, simple_loss=0.3585, pruned_loss=0.1274, over 4273760.36 frames. ], batch size: 510, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:03:09,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-18 14:04:22,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=226350.0, ans=0.0 2023-06-18 14:04:40,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.552e+02 3.455e+02 4.315e+02 5.205e+02 7.912e+02, threshold=8.629e+02, percent-clipped=0.0 2023-06-18 14:04:40,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=226410.0, ans=0.125 2023-06-18 14:04:40,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=226410.0, ans=0.125 2023-06-18 14:04:47,740 INFO [train.py:996] (0/4) Epoch 2, batch 7250, loss[loss=0.3017, simple_loss=0.3445, pruned_loss=0.1294, over 21744.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3528, pruned_loss=0.1269, over 4272860.70 frames. ], batch size: 112, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:04:55,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=226470.0, ans=0.125 2023-06-18 14:05:50,312 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.56 vs. limit=15.0 2023-06-18 14:05:57,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=226650.0, ans=0.1 2023-06-18 14:06:28,507 INFO [train.py:996] (0/4) Epoch 2, batch 7300, loss[loss=0.2486, simple_loss=0.3016, pruned_loss=0.09777, over 21817.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3461, pruned_loss=0.1248, over 4264249.92 frames. ], batch size: 318, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:07:05,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=226830.0, ans=0.0 2023-06-18 14:07:19,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=226890.0, ans=0.1 2023-06-18 14:07:26,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=226950.0, ans=0.2 2023-06-18 14:08:03,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.035e+02 3.518e+02 4.361e+02 7.798e+02, threshold=7.035e+02, percent-clipped=0.0 2023-06-18 14:08:07,008 INFO [train.py:996] (0/4) Epoch 2, batch 7350, loss[loss=0.3406, simple_loss=0.3825, pruned_loss=0.1494, over 21400.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3446, pruned_loss=0.1258, over 4253945.18 frames. ], batch size: 131, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:09:05,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=227250.0, ans=0.015 2023-06-18 14:09:49,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=227370.0, ans=0.125 2023-06-18 14:09:50,821 INFO [train.py:996] (0/4) Epoch 2, batch 7400, loss[loss=0.28, simple_loss=0.3617, pruned_loss=0.09918, over 21692.00 frames. ], tot_loss[loss=0.3051, simple_loss=0.3528, pruned_loss=0.1287, over 4257709.92 frames. ], batch size: 351, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:10:25,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-18 14:11:18,773 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-18 14:11:25,767 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.736e+02 3.593e+02 4.529e+02 5.644e+02 1.003e+03, threshold=9.058e+02, percent-clipped=10.0 2023-06-18 14:11:29,128 INFO [train.py:996] (0/4) Epoch 2, batch 7450, loss[loss=0.2973, simple_loss=0.333, pruned_loss=0.1308, over 21248.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3508, pruned_loss=0.1258, over 4264208.45 frames. ], batch size: 159, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:12:08,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=227790.0, ans=0.125 2023-06-18 14:12:15,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=227790.0, ans=0.125 2023-06-18 14:12:39,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=227850.0, ans=0.125 2023-06-18 14:13:07,161 INFO [train.py:996] (0/4) Epoch 2, batch 7500, loss[loss=0.3198, simple_loss=0.3945, pruned_loss=0.1226, over 21742.00 frames. ], tot_loss[loss=0.3074, simple_loss=0.3578, pruned_loss=0.1285, over 4261405.16 frames. ], batch size: 351, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:13:34,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228030.0, ans=0.1 2023-06-18 14:14:09,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=228150.0, ans=0.2 2023-06-18 14:14:10,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228150.0, ans=0.1 2023-06-18 14:14:15,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=228150.0, ans=0.125 2023-06-18 14:14:17,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=228150.0, ans=0.125 2023-06-18 14:14:42,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.252e+02 3.894e+02 4.815e+02 8.018e+02, threshold=7.787e+02, percent-clipped=0.0 2023-06-18 14:14:45,740 INFO [train.py:996] (0/4) Epoch 2, batch 7550, loss[loss=0.2963, simple_loss=0.3762, pruned_loss=0.1082, over 20669.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3663, pruned_loss=0.1274, over 4269881.36 frames. ], batch size: 608, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:14:58,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=228270.0, ans=0.125 2023-06-18 14:15:49,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-18 14:15:57,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.27 vs. limit=10.0 2023-06-18 14:16:22,914 INFO [train.py:996] (0/4) Epoch 2, batch 7600, loss[loss=0.3419, simple_loss=0.373, pruned_loss=0.1554, over 21297.00 frames. ], tot_loss[loss=0.3088, simple_loss=0.365, pruned_loss=0.1263, over 4276242.83 frames. ], batch size: 159, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:16:25,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228570.0, ans=0.1 2023-06-18 14:16:47,119 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:17:22,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=228750.0, ans=0.0 2023-06-18 14:17:25,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=228750.0, ans=0.125 2023-06-18 14:17:51,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.725e+02 4.604e+02 5.632e+02 9.928e+02, threshold=9.208e+02, percent-clipped=8.0 2023-06-18 14:17:54,594 INFO [train.py:996] (0/4) Epoch 2, batch 7650, loss[loss=0.2978, simple_loss=0.3449, pruned_loss=0.1254, over 21932.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.3643, pruned_loss=0.1286, over 4283032.89 frames. ], batch size: 351, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:18:03,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=228870.0, ans=0.125 2023-06-18 14:18:20,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-18 14:18:24,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=228990.0, ans=0.0 2023-06-18 14:18:43,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=228990.0, ans=0.0 2023-06-18 14:19:23,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=229110.0, ans=0.125 2023-06-18 14:19:27,543 INFO [train.py:996] (0/4) Epoch 2, batch 7700, loss[loss=0.32, simple_loss=0.3659, pruned_loss=0.137, over 21642.00 frames. ], tot_loss[loss=0.3145, simple_loss=0.3661, pruned_loss=0.1315, over 4284058.54 frames. ], batch size: 263, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:19:28,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=229170.0, ans=0.0 2023-06-18 14:19:30,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=12.0 2023-06-18 14:20:10,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=229290.0, ans=0.2 2023-06-18 14:20:30,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=229350.0, ans=0.2 2023-06-18 14:20:47,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=229410.0, ans=0.015 2023-06-18 14:20:59,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.706e+02 4.565e+02 6.512e+02 1.080e+03, threshold=9.129e+02, percent-clipped=5.0 2023-06-18 14:21:02,957 INFO [train.py:996] (0/4) Epoch 2, batch 7750, loss[loss=0.5077, simple_loss=0.5537, pruned_loss=0.2309, over 21402.00 frames. ], tot_loss[loss=0.3186, simple_loss=0.3727, pruned_loss=0.1322, over 4284387.94 frames. ], batch size: 507, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:21:18,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.00 vs. limit=15.0 2023-06-18 14:22:06,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=229590.0, ans=0.0 2023-06-18 14:22:23,892 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=15.0 2023-06-18 14:22:29,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=229710.0, ans=0.0 2023-06-18 14:22:40,254 INFO [train.py:996] (0/4) Epoch 2, batch 7800, loss[loss=0.2604, simple_loss=0.3224, pruned_loss=0.09921, over 21623.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3747, pruned_loss=0.1325, over 4276700.98 frames. ], batch size: 263, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:22:41,273 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.00 vs. limit=6.0 2023-06-18 14:22:52,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.97 vs. limit=6.0 2023-06-18 14:23:07,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229830.0, ans=0.1 2023-06-18 14:23:19,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-18 14:23:21,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=229830.0, ans=0.0 2023-06-18 14:23:36,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=229890.0, ans=0.125 2023-06-18 14:24:02,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=230010.0, ans=0.125 2023-06-18 14:24:13,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.566e+02 4.138e+02 5.286e+02 1.209e+03, threshold=8.275e+02, percent-clipped=5.0 2023-06-18 14:24:13,505 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:24:16,392 INFO [train.py:996] (0/4) Epoch 2, batch 7850, loss[loss=0.3274, simple_loss=0.3594, pruned_loss=0.1477, over 21845.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3664, pruned_loss=0.1304, over 4269958.02 frames. ], batch size: 373, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:24:18,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=230070.0, ans=0.125 2023-06-18 14:25:08,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=230190.0, ans=0.0 2023-06-18 14:25:23,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-18 14:25:30,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=230250.0, ans=0.0 2023-06-18 14:25:36,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=230250.0, ans=0.125 2023-06-18 14:25:52,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=230310.0, ans=0.125 2023-06-18 14:25:55,479 INFO [train.py:996] (0/4) Epoch 2, batch 7900, loss[loss=0.3811, simple_loss=0.4527, pruned_loss=0.1548, over 21586.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3624, pruned_loss=0.1294, over 4257825.99 frames. ], batch size: 441, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:26:40,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=230430.0, ans=0.0 2023-06-18 14:27:27,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=230610.0, ans=0.125 2023-06-18 14:27:29,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 3.545e+02 4.486e+02 5.981e+02 1.155e+03, threshold=8.972e+02, percent-clipped=9.0 2023-06-18 14:27:32,846 INFO [train.py:996] (0/4) Epoch 2, batch 7950, loss[loss=0.2887, simple_loss=0.3603, pruned_loss=0.1086, over 21642.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.3696, pruned_loss=0.1295, over 4257851.85 frames. ], batch size: 263, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:28:57,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=230910.0, ans=0.5 2023-06-18 14:29:26,748 INFO [train.py:996] (0/4) Epoch 2, batch 8000, loss[loss=0.3228, simple_loss=0.3745, pruned_loss=0.1355, over 21332.00 frames. ], tot_loss[loss=0.3217, simple_loss=0.3758, pruned_loss=0.1338, over 4265088.88 frames. ], batch size: 176, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:29:42,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=231030.0, ans=0.0 2023-06-18 14:30:46,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=231210.0, ans=0.0 2023-06-18 14:30:58,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.385e+02 3.226e+02 3.981e+02 5.095e+02 8.184e+02, threshold=7.963e+02, percent-clipped=0.0 2023-06-18 14:31:02,184 INFO [train.py:996] (0/4) Epoch 2, batch 8050, loss[loss=0.2966, simple_loss=0.3375, pruned_loss=0.1278, over 20259.00 frames. ], tot_loss[loss=0.3237, simple_loss=0.3797, pruned_loss=0.1338, over 4263712.49 frames. ], batch size: 702, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:31:26,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=231330.0, ans=0.0 2023-06-18 14:31:40,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=231390.0, ans=0.125 2023-06-18 14:31:52,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2023-06-18 14:32:43,073 INFO [train.py:996] (0/4) Epoch 2, batch 8100, loss[loss=0.3278, simple_loss=0.3778, pruned_loss=0.1389, over 21855.00 frames. ], tot_loss[loss=0.3212, simple_loss=0.3759, pruned_loss=0.1332, over 4271486.60 frames. ], batch size: 118, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:32:43,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=231570.0, ans=0.0 2023-06-18 14:32:44,510 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.10 vs. limit=15.0 2023-06-18 14:33:14,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=231630.0, ans=0.125 2023-06-18 14:33:18,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=231630.0, ans=0.125 2023-06-18 14:33:20,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-18 14:33:59,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=231750.0, ans=0.125 2023-06-18 14:34:06,126 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:34:21,412 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.933e+02 5.153e+02 6.580e+02 1.761e+03, threshold=1.031e+03, percent-clipped=12.0 2023-06-18 14:34:24,594 INFO [train.py:996] (0/4) Epoch 2, batch 8150, loss[loss=0.3965, simple_loss=0.4658, pruned_loss=0.1636, over 21513.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3805, pruned_loss=0.1336, over 4268744.39 frames. ], batch size: 507, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:34:42,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=231870.0, ans=0.0 2023-06-18 14:34:52,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=231930.0, ans=0.1 2023-06-18 14:35:23,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=231990.0, ans=0.125 2023-06-18 14:35:49,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232110.0, ans=0.1 2023-06-18 14:35:56,590 INFO [train.py:996] (0/4) Epoch 2, batch 8200, loss[loss=0.3057, simple_loss=0.3349, pruned_loss=0.1383, over 21569.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3736, pruned_loss=0.1295, over 4266646.94 frames. ], batch size: 263, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:36:21,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232230.0, ans=0.1 2023-06-18 14:36:48,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=232290.0, ans=0.125 2023-06-18 14:36:59,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=232350.0, ans=0.0 2023-06-18 14:37:14,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=232410.0, ans=0.125 2023-06-18 14:37:16,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=232410.0, ans=0.0 2023-06-18 14:37:26,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.508e+02 4.438e+02 6.300e+02 1.246e+03, threshold=8.875e+02, percent-clipped=2.0 2023-06-18 14:37:29,936 INFO [train.py:996] (0/4) Epoch 2, batch 8250, loss[loss=0.2681, simple_loss=0.3169, pruned_loss=0.1096, over 21996.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3732, pruned_loss=0.1302, over 4267636.03 frames. ], batch size: 103, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:38:21,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=232590.0, ans=0.125 2023-06-18 14:39:05,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=232710.0, ans=0.0 2023-06-18 14:39:08,471 INFO [train.py:996] (0/4) Epoch 2, batch 8300, loss[loss=0.3961, simple_loss=0.4374, pruned_loss=0.1773, over 21607.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3701, pruned_loss=0.1267, over 4266529.29 frames. ], batch size: 441, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:39:30,309 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.59 vs. limit=15.0 2023-06-18 14:39:52,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=232890.0, ans=0.125 2023-06-18 14:40:08,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=232950.0, ans=0.125 2023-06-18 14:40:38,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 3.024e+02 4.156e+02 5.477e+02 9.498e+02, threshold=8.312e+02, percent-clipped=2.0 2023-06-18 14:40:46,307 INFO [train.py:996] (0/4) Epoch 2, batch 8350, loss[loss=0.3385, simple_loss=0.3874, pruned_loss=0.1448, over 21520.00 frames. ], tot_loss[loss=0.31, simple_loss=0.3695, pruned_loss=0.1253, over 4273024.35 frames. ], batch size: 389, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:41:07,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-18 14:41:17,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=233130.0, ans=0.125 2023-06-18 14:41:34,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=233190.0, ans=0.0 2023-06-18 14:41:44,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=233250.0, ans=0.0 2023-06-18 14:42:18,640 INFO [train.py:996] (0/4) Epoch 2, batch 8400, loss[loss=0.2869, simple_loss=0.3734, pruned_loss=0.1002, over 21227.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3642, pruned_loss=0.1201, over 4278273.96 frames. ], batch size: 548, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:42:37,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=233370.0, ans=0.2 2023-06-18 14:42:46,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=233430.0, ans=0.125 2023-06-18 14:42:49,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-18 14:43:08,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=233490.0, ans=15.0 2023-06-18 14:43:38,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=233610.0, ans=0.0 2023-06-18 14:43:44,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=233610.0, ans=0.0 2023-06-18 14:43:47,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 3.200e+02 3.844e+02 5.205e+02 8.692e+02, threshold=7.689e+02, percent-clipped=1.0 2023-06-18 14:43:55,610 INFO [train.py:996] (0/4) Epoch 2, batch 8450, loss[loss=0.3139, simple_loss=0.3604, pruned_loss=0.1337, over 21751.00 frames. ], tot_loss[loss=0.3018, simple_loss=0.3622, pruned_loss=0.1207, over 4282072.47 frames. ], batch size: 389, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:44:20,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=233730.0, ans=0.1 2023-06-18 14:44:32,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=233790.0, ans=0.125 2023-06-18 14:44:54,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=233850.0, ans=0.125 2023-06-18 14:44:55,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-06-18 14:45:21,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=233970.0, ans=0.2 2023-06-18 14:45:22,577 INFO [train.py:996] (0/4) Epoch 2, batch 8500, loss[loss=0.3411, simple_loss=0.4418, pruned_loss=0.1202, over 20731.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3591, pruned_loss=0.1226, over 4275615.53 frames. ], batch size: 607, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:45:57,492 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:46:08,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=234090.0, ans=0.125 2023-06-18 14:46:19,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=234150.0, ans=0.025 2023-06-18 14:46:56,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.316e+02 3.972e+02 4.532e+02 9.950e+02, threshold=7.945e+02, percent-clipped=2.0 2023-06-18 14:47:05,344 INFO [train.py:996] (0/4) Epoch 2, batch 8550, loss[loss=0.2988, simple_loss=0.3605, pruned_loss=0.1186, over 21289.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.3643, pruned_loss=0.1266, over 4275537.19 frames. ], batch size: 159, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:47:41,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=234390.0, ans=0.0 2023-06-18 14:48:41,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=234510.0, ans=0.07 2023-06-18 14:48:41,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=234510.0, ans=0.125 2023-06-18 14:48:43,744 INFO [train.py:996] (0/4) Epoch 2, batch 8600, loss[loss=0.4079, simple_loss=0.442, pruned_loss=0.187, over 21290.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.3702, pruned_loss=0.1292, over 4279345.67 frames. ], batch size: 143, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:48:53,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=234570.0, ans=0.125 2023-06-18 14:50:17,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.454e+02 4.150e+02 5.051e+02 9.343e+02, threshold=8.300e+02, percent-clipped=1.0 2023-06-18 14:50:20,580 INFO [train.py:996] (0/4) Epoch 2, batch 8650, loss[loss=0.3372, simple_loss=0.3915, pruned_loss=0.1414, over 21462.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3757, pruned_loss=0.1287, over 4268327.31 frames. ], batch size: 211, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:50:35,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=234930.0, ans=0.0 2023-06-18 14:50:40,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=234930.0, ans=0.0 2023-06-18 14:51:55,575 INFO [train.py:996] (0/4) Epoch 2, batch 8700, loss[loss=0.2919, simple_loss=0.3371, pruned_loss=0.1234, over 21669.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.3664, pruned_loss=0.1236, over 4261873.34 frames. ], batch size: 333, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:52:17,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=235230.0, ans=0.0 2023-06-18 14:52:33,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=235290.0, ans=0.025 2023-06-18 14:52:39,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=235290.0, ans=0.2 2023-06-18 14:52:44,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=235350.0, ans=0.0 2023-06-18 14:53:29,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 3.324e+02 3.894e+02 5.284e+02 1.235e+03, threshold=7.788e+02, percent-clipped=5.0 2023-06-18 14:53:32,123 INFO [train.py:996] (0/4) Epoch 2, batch 8750, loss[loss=0.3273, simple_loss=0.3621, pruned_loss=0.1462, over 21576.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3616, pruned_loss=0.1242, over 4267665.37 frames. ], batch size: 391, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:53:34,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=235470.0, ans=0.05 2023-06-18 14:54:22,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=235650.0, ans=0.2 2023-06-18 14:55:09,963 INFO [train.py:996] (0/4) Epoch 2, batch 8800, loss[loss=0.3316, simple_loss=0.4221, pruned_loss=0.1205, over 19761.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3709, pruned_loss=0.1281, over 4272220.75 frames. ], batch size: 702, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:55:11,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-18 14:55:30,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-06-18 14:55:37,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=235830.0, ans=0.0 2023-06-18 14:56:45,186 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.886e+02 4.865e+02 6.860e+02 1.473e+03, threshold=9.729e+02, percent-clipped=14.0 2023-06-18 14:56:48,280 INFO [train.py:996] (0/4) Epoch 2, batch 8850, loss[loss=0.3182, simple_loss=0.3711, pruned_loss=0.1326, over 21462.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.3777, pruned_loss=0.1301, over 4268856.67 frames. ], batch size: 389, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:57:42,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-18 14:57:46,640 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:58:26,546 INFO [train.py:996] (0/4) Epoch 2, batch 8900, loss[loss=0.2973, simple_loss=0.3656, pruned_loss=0.1145, over 21765.00 frames. ], tot_loss[loss=0.3152, simple_loss=0.3716, pruned_loss=0.1294, over 4268641.00 frames. ], batch size: 351, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:59:14,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=236490.0, ans=0.025 2023-06-18 14:59:44,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=236550.0, ans=0.125 2023-06-18 15:00:02,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=236610.0, ans=0.125 2023-06-18 15:00:03,007 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 3.292e+02 4.166e+02 5.426e+02 1.146e+03, threshold=8.333e+02, percent-clipped=5.0 2023-06-18 15:00:04,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-18 15:00:05,992 INFO [train.py:996] (0/4) Epoch 2, batch 8950, loss[loss=0.2653, simple_loss=0.3672, pruned_loss=0.08175, over 20810.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3715, pruned_loss=0.1288, over 4267570.94 frames. ], batch size: 608, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:00:23,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236670.0, ans=0.1 2023-06-18 15:01:05,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=236790.0, ans=0.0 2023-06-18 15:01:21,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=236850.0, ans=0.125 2023-06-18 15:01:27,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-18 15:01:42,180 INFO [train.py:996] (0/4) Epoch 2, batch 9000, loss[loss=0.2785, simple_loss=0.3432, pruned_loss=0.1069, over 21726.00 frames. ], tot_loss[loss=0.3122, simple_loss=0.3675, pruned_loss=0.1285, over 4266801.80 frames. ], batch size: 282, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:01:42,181 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 15:02:02,155 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2979, simple_loss=0.3967, pruned_loss=0.09958, over 1796401.00 frames. 2023-06-18 15:02:02,156 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 15:02:08,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-18 15:03:10,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-18 15:03:36,053 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 3.310e+02 4.099e+02 5.036e+02 9.465e+02, threshold=8.198e+02, percent-clipped=3.0 2023-06-18 15:03:39,424 INFO [train.py:996] (0/4) Epoch 2, batch 9050, loss[loss=0.3001, simple_loss=0.3491, pruned_loss=0.1255, over 21547.00 frames. ], tot_loss[loss=0.3057, simple_loss=0.3632, pruned_loss=0.1241, over 4256693.86 frames. ], batch size: 230, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:03:57,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=237270.0, ans=0.125 2023-06-18 15:04:36,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=237390.0, ans=0.125 2023-06-18 15:04:39,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-18 15:05:07,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.35 vs. limit=6.0 2023-06-18 15:05:22,428 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:05:23,374 INFO [train.py:996] (0/4) Epoch 2, batch 9100, loss[loss=0.342, simple_loss=0.3862, pruned_loss=0.1489, over 19912.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.368, pruned_loss=0.1274, over 4255873.65 frames. ], batch size: 703, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:05:25,886 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=15.0 2023-06-18 15:05:47,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=237630.0, ans=0.125 2023-06-18 15:05:56,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=237630.0, ans=0.125 2023-06-18 15:06:30,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=237750.0, ans=0.0 2023-06-18 15:06:38,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=237810.0, ans=0.2 2023-06-18 15:06:58,941 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 3.002e+02 3.899e+02 5.912e+02 1.285e+03, threshold=7.799e+02, percent-clipped=7.0 2023-06-18 15:07:05,468 INFO [train.py:996] (0/4) Epoch 2, batch 9150, loss[loss=0.3349, simple_loss=0.4027, pruned_loss=0.1336, over 21801.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3732, pruned_loss=0.1248, over 4256416.54 frames. ], batch size: 351, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:07:10,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=237870.0, ans=0.125 2023-06-18 15:08:15,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=238050.0, ans=0.0 2023-06-18 15:08:43,032 INFO [train.py:996] (0/4) Epoch 2, batch 9200, loss[loss=0.4601, simple_loss=0.4793, pruned_loss=0.2205, over 21486.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.375, pruned_loss=0.1236, over 4267168.45 frames. ], batch size: 471, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:09:04,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=22.5 2023-06-18 15:09:27,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=238290.0, ans=0.04949747468305833 2023-06-18 15:10:08,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=238410.0, ans=0.125 2023-06-18 15:10:17,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 3.265e+02 3.893e+02 4.706e+02 1.094e+03, threshold=7.786e+02, percent-clipped=2.0 2023-06-18 15:10:18,933 INFO [train.py:996] (0/4) Epoch 2, batch 9250, loss[loss=0.302, simple_loss=0.3377, pruned_loss=0.1332, over 21625.00 frames. ], tot_loss[loss=0.3195, simple_loss=0.3794, pruned_loss=0.1298, over 4269809.74 frames. ], batch size: 298, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:10:47,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238530.0, ans=0.1 2023-06-18 15:11:44,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=238710.0, ans=0.2 2023-06-18 15:11:52,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=238710.0, ans=0.0 2023-06-18 15:11:59,740 INFO [train.py:996] (0/4) Epoch 2, batch 9300, loss[loss=0.3244, simple_loss=0.3788, pruned_loss=0.135, over 21657.00 frames. ], tot_loss[loss=0.3171, simple_loss=0.3728, pruned_loss=0.1307, over 4251211.74 frames. ], batch size: 332, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:12:24,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238830.0, ans=0.1 2023-06-18 15:13:15,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-18 15:13:16,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238950.0, ans=0.1 2023-06-18 15:13:37,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.699e+02 3.726e+02 4.567e+02 5.377e+02 1.117e+03, threshold=9.135e+02, percent-clipped=5.0 2023-06-18 15:13:38,735 INFO [train.py:996] (0/4) Epoch 2, batch 9350, loss[loss=0.399, simple_loss=0.4381, pruned_loss=0.1799, over 21556.00 frames. ], tot_loss[loss=0.324, simple_loss=0.382, pruned_loss=0.133, over 4258179.03 frames. ], batch size: 389, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:13:53,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=239130.0, ans=0.0 2023-06-18 15:14:38,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=239190.0, ans=0.125 2023-06-18 15:14:40,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-18 15:14:50,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=239250.0, ans=0.125 2023-06-18 15:14:57,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=239250.0, ans=0.0 2023-06-18 15:15:17,462 INFO [train.py:996] (0/4) Epoch 2, batch 9400, loss[loss=0.3031, simple_loss=0.348, pruned_loss=0.1291, over 21567.00 frames. ], tot_loss[loss=0.3266, simple_loss=0.3843, pruned_loss=0.1344, over 4252676.98 frames. ], batch size: 414, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:15:17,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=239370.0, ans=0.125 2023-06-18 15:15:38,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=239430.0, ans=0.0 2023-06-18 15:15:52,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=239430.0, ans=0.2 2023-06-18 15:16:22,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=239550.0, ans=0.125 2023-06-18 15:16:27,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=239550.0, ans=0.05 2023-06-18 15:16:33,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=239550.0, ans=0.125 2023-06-18 15:16:33,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=239550.0, ans=0.2 2023-06-18 15:16:38,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=22.5 2023-06-18 15:16:53,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.296e+02 4.208e+02 5.207e+02 1.060e+03, threshold=8.416e+02, percent-clipped=2.0 2023-06-18 15:16:54,644 INFO [train.py:996] (0/4) Epoch 2, batch 9450, loss[loss=0.2733, simple_loss=0.3169, pruned_loss=0.1148, over 21775.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.3752, pruned_loss=0.132, over 4250131.17 frames. ], batch size: 124, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:17:11,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=239670.0, ans=0.2 2023-06-18 15:18:03,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=239850.0, ans=0.0 2023-06-18 15:18:10,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=239850.0, ans=0.2 2023-06-18 15:18:27,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=239910.0, ans=0.0 2023-06-18 15:18:29,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-18 15:18:31,547 INFO [train.py:996] (0/4) Epoch 2, batch 9500, loss[loss=0.3539, simple_loss=0.4038, pruned_loss=0.152, over 21364.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3668, pruned_loss=0.1283, over 4253353.92 frames. ], batch size: 131, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:18:36,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=239970.0, ans=0.125 2023-06-18 15:18:38,127 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-40000.pt 2023-06-18 15:19:39,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=240150.0, ans=0.1 2023-06-18 15:20:01,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=240210.0, ans=0.125 2023-06-18 15:20:02,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.526e+02 3.477e+02 4.438e+02 5.411e+02 9.373e+02, threshold=8.876e+02, percent-clipped=3.0 2023-06-18 15:20:04,094 INFO [train.py:996] (0/4) Epoch 2, batch 9550, loss[loss=0.3199, simple_loss=0.389, pruned_loss=0.1254, over 21732.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.3719, pruned_loss=0.1314, over 4256659.45 frames. ], batch size: 332, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:20:06,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=240270.0, ans=0.125 2023-06-18 15:20:12,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=240270.0, ans=0.0 2023-06-18 15:20:19,411 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:20:31,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=240330.0, ans=0.125 2023-06-18 15:20:53,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240390.0, ans=0.1 2023-06-18 15:20:58,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=240390.0, ans=0.07 2023-06-18 15:21:03,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=240390.0, ans=0.125 2023-06-18 15:21:21,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-18 15:21:40,140 INFO [train.py:996] (0/4) Epoch 2, batch 9600, loss[loss=0.3175, simple_loss=0.37, pruned_loss=0.1326, over 21737.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.372, pruned_loss=0.1318, over 4267275.59 frames. ], batch size: 389, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:22:15,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=240630.0, ans=0.0 2023-06-18 15:22:15,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-18 15:22:17,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=240630.0, ans=0.125 2023-06-18 15:22:23,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=240630.0, ans=0.125 2023-06-18 15:22:31,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=240690.0, ans=0.125 2023-06-18 15:22:34,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=240690.0, ans=0.07 2023-06-18 15:23:16,551 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.171e+02 3.689e+02 4.506e+02 8.293e+02, threshold=7.377e+02, percent-clipped=0.0 2023-06-18 15:23:18,130 INFO [train.py:996] (0/4) Epoch 2, batch 9650, loss[loss=0.3015, simple_loss=0.3526, pruned_loss=0.1251, over 21838.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.3711, pruned_loss=0.1317, over 4275186.30 frames. ], batch size: 247, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:23:20,094 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:23:53,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=240930.0, ans=0.0 2023-06-18 15:24:04,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=240990.0, ans=0.125 2023-06-18 15:24:11,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=240990.0, ans=0.125 2023-06-18 15:24:14,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-18 15:24:56,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=241110.0, ans=0.125 2023-06-18 15:25:00,607 INFO [train.py:996] (0/4) Epoch 2, batch 9700, loss[loss=0.316, simple_loss=0.3707, pruned_loss=0.1307, over 21477.00 frames. ], tot_loss[loss=0.3197, simple_loss=0.3749, pruned_loss=0.1322, over 4278317.34 frames. ], batch size: 548, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:25:38,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=241230.0, ans=0.125 2023-06-18 15:26:02,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=241350.0, ans=0.125 2023-06-18 15:26:23,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=241410.0, ans=0.125 2023-06-18 15:26:35,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.207e+02 3.701e+02 4.556e+02 8.027e+02, threshold=7.401e+02, percent-clipped=3.0 2023-06-18 15:26:37,174 INFO [train.py:996] (0/4) Epoch 2, batch 9750, loss[loss=0.3497, simple_loss=0.3574, pruned_loss=0.171, over 21358.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3665, pruned_loss=0.1301, over 4277634.21 frames. ], batch size: 508, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:26:42,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=241470.0, ans=0.025 2023-06-18 15:26:57,439 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-18 15:27:47,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-18 15:28:04,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=241710.0, ans=0.125 2023-06-18 15:28:08,547 INFO [train.py:996] (0/4) Epoch 2, batch 9800, loss[loss=0.3908, simple_loss=0.4129, pruned_loss=0.1844, over 21583.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3652, pruned_loss=0.1305, over 4271553.88 frames. ], batch size: 471, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:29:00,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-06-18 15:29:43,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.313e+02 4.009e+02 5.228e+02 9.511e+02, threshold=8.018e+02, percent-clipped=4.0 2023-06-18 15:29:44,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=242070.0, ans=0.125 2023-06-18 15:29:45,214 INFO [train.py:996] (0/4) Epoch 2, batch 9850, loss[loss=0.2555, simple_loss=0.3045, pruned_loss=0.1033, over 20750.00 frames. ], tot_loss[loss=0.3112, simple_loss=0.3618, pruned_loss=0.1303, over 4276700.36 frames. ], batch size: 607, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:30:55,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=242250.0, ans=0.125 2023-06-18 15:30:57,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242250.0, ans=0.1 2023-06-18 15:31:22,237 INFO [train.py:996] (0/4) Epoch 2, batch 9900, loss[loss=0.3445, simple_loss=0.3683, pruned_loss=0.1604, over 21256.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3602, pruned_loss=0.1305, over 4275229.55 frames. ], batch size: 471, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:31:35,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=242370.0, ans=0.125 2023-06-18 15:32:11,908 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-18 15:32:34,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2023-06-18 15:33:02,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.487e+02 4.462e+02 5.702e+02 1.060e+03, threshold=8.923e+02, percent-clipped=2.0 2023-06-18 15:33:03,944 INFO [train.py:996] (0/4) Epoch 2, batch 9950, loss[loss=0.3264, simple_loss=0.3535, pruned_loss=0.1496, over 21650.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3629, pruned_loss=0.133, over 4269471.19 frames. ], batch size: 282, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:33:53,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=242790.0, ans=0.125 2023-06-18 15:33:56,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=242790.0, ans=0.2 2023-06-18 15:33:58,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-18 15:34:41,502 INFO [train.py:996] (0/4) Epoch 2, batch 10000, loss[loss=0.2697, simple_loss=0.3275, pruned_loss=0.1059, over 21412.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3575, pruned_loss=0.1302, over 4266243.97 frames. ], batch size: 211, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:34:56,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.11 vs. limit=6.0 2023-06-18 15:35:02,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=243030.0, ans=0.125 2023-06-18 15:35:11,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=243030.0, ans=0.0 2023-06-18 15:35:12,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=243030.0, ans=0.0 2023-06-18 15:35:15,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-18 15:36:14,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.362e+02 4.103e+02 5.165e+02 9.257e+02, threshold=8.205e+02, percent-clipped=2.0 2023-06-18 15:36:16,265 INFO [train.py:996] (0/4) Epoch 2, batch 10050, loss[loss=0.2932, simple_loss=0.3472, pruned_loss=0.1196, over 21591.00 frames. ], tot_loss[loss=0.3115, simple_loss=0.3608, pruned_loss=0.1311, over 4266227.81 frames. ], batch size: 441, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:36:59,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=243390.0, ans=0.125 2023-06-18 15:37:07,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-18 15:37:26,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=243450.0, ans=0.125 2023-06-18 15:38:03,650 INFO [train.py:996] (0/4) Epoch 2, batch 10100, loss[loss=0.3817, simple_loss=0.4257, pruned_loss=0.1688, over 21478.00 frames. ], tot_loss[loss=0.3055, simple_loss=0.3562, pruned_loss=0.1274, over 4254268.46 frames. ], batch size: 471, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:38:18,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-18 15:38:43,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=243690.0, ans=0.1 2023-06-18 15:38:59,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-18 15:39:39,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.265e+02 3.952e+02 5.116e+02 8.346e+02, threshold=7.904e+02, percent-clipped=1.0 2023-06-18 15:39:41,261 INFO [train.py:996] (0/4) Epoch 2, batch 10150, loss[loss=0.3162, simple_loss=0.3644, pruned_loss=0.134, over 21378.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.3637, pruned_loss=0.1306, over 4259361.36 frames. ], batch size: 144, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:39:59,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-06-18 15:40:01,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-18 15:40:19,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=243990.0, ans=0.0 2023-06-18 15:40:21,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=243990.0, ans=0.2 2023-06-18 15:40:33,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=244050.0, ans=0.0 2023-06-18 15:40:53,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244050.0, ans=0.1 2023-06-18 15:41:19,210 INFO [train.py:996] (0/4) Epoch 2, batch 10200, loss[loss=0.2824, simple_loss=0.3364, pruned_loss=0.1143, over 21749.00 frames. ], tot_loss[loss=0.309, simple_loss=0.3623, pruned_loss=0.1279, over 4251620.55 frames. ], batch size: 112, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:42:03,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=244290.0, ans=0.125 2023-06-18 15:42:13,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=244290.0, ans=0.125 2023-06-18 15:42:37,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=244350.0, ans=0.125 2023-06-18 15:42:52,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=244410.0, ans=0.0 2023-06-18 15:42:54,651 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.941e+02 3.489e+02 4.418e+02 6.706e+02, threshold=6.977e+02, percent-clipped=0.0 2023-06-18 15:42:56,140 INFO [train.py:996] (0/4) Epoch 2, batch 10250, loss[loss=0.2221, simple_loss=0.3091, pruned_loss=0.06753, over 21792.00 frames. ], tot_loss[loss=0.2975, simple_loss=0.3555, pruned_loss=0.1197, over 4253716.87 frames. ], batch size: 282, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:43:23,296 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:44:03,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=244650.0, ans=0.125 2023-06-18 15:44:11,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2023-06-18 15:44:18,168 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-18 15:44:23,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=244710.0, ans=0.0 2023-06-18 15:44:34,318 INFO [train.py:996] (0/4) Epoch 2, batch 10300, loss[loss=0.2255, simple_loss=0.3022, pruned_loss=0.07438, over 21877.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.3557, pruned_loss=0.1189, over 4262405.56 frames. ], batch size: 107, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:44:38,920 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=22.5 2023-06-18 15:45:34,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=244890.0, ans=0.1 2023-06-18 15:45:41,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=244950.0, ans=0.1 2023-06-18 15:46:17,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.346e+02 4.331e+02 5.577e+02 1.197e+03, threshold=8.662e+02, percent-clipped=10.0 2023-06-18 15:46:18,823 INFO [train.py:996] (0/4) Epoch 2, batch 10350, loss[loss=0.2698, simple_loss=0.3403, pruned_loss=0.09964, over 21262.00 frames. ], tot_loss[loss=0.2991, simple_loss=0.3584, pruned_loss=0.1198, over 4262583.94 frames. ], batch size: 176, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:46:33,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=245130.0, ans=0.125 2023-06-18 15:46:37,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=245130.0, ans=0.02 2023-06-18 15:47:45,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=245310.0, ans=0.2 2023-06-18 15:48:00,738 INFO [train.py:996] (0/4) Epoch 2, batch 10400, loss[loss=0.3155, simple_loss=0.3688, pruned_loss=0.1311, over 21705.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3505, pruned_loss=0.1171, over 4262984.21 frames. ], batch size: 391, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:48:45,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=245490.0, ans=0.125 2023-06-18 15:49:11,169 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.93 vs. limit=6.0 2023-06-18 15:49:36,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=22.5 2023-06-18 15:49:37,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=245610.0, ans=0.0 2023-06-18 15:49:40,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 3.447e+02 4.106e+02 4.896e+02 8.870e+02, threshold=8.213e+02, percent-clipped=2.0 2023-06-18 15:49:41,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=245670.0, ans=15.0 2023-06-18 15:49:42,081 INFO [train.py:996] (0/4) Epoch 2, batch 10450, loss[loss=0.3207, simple_loss=0.3768, pruned_loss=0.1323, over 21463.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3565, pruned_loss=0.1232, over 4263918.74 frames. ], batch size: 194, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:50:29,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=245790.0, ans=0.125 2023-06-18 15:51:18,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=245970.0, ans=0.125 2023-06-18 15:51:19,850 INFO [train.py:996] (0/4) Epoch 2, batch 10500, loss[loss=0.3592, simple_loss=0.3981, pruned_loss=0.1602, over 20685.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3551, pruned_loss=0.1219, over 4262375.03 frames. ], batch size: 607, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:51:59,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=246030.0, ans=0.2 2023-06-18 15:52:18,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-18 15:52:19,561 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.851e-01 2023-06-18 15:52:30,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=246150.0, ans=0.0 2023-06-18 15:52:54,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.201e+02 3.705e+02 4.440e+02 6.098e+02, threshold=7.409e+02, percent-clipped=0.0 2023-06-18 15:52:55,955 INFO [train.py:996] (0/4) Epoch 2, batch 10550, loss[loss=0.2861, simple_loss=0.3339, pruned_loss=0.1192, over 21883.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.3498, pruned_loss=0.1218, over 4247907.07 frames. ], batch size: 373, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:53:19,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=246330.0, ans=0.125 2023-06-18 15:54:33,575 INFO [train.py:996] (0/4) Epoch 2, batch 10600, loss[loss=0.231, simple_loss=0.2925, pruned_loss=0.08475, over 15418.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3467, pruned_loss=0.1195, over 4242196.77 frames. ], batch size: 60, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:54:37,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=246570.0, ans=0.125 2023-06-18 15:55:08,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=246630.0, ans=0.0 2023-06-18 15:55:57,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=246750.0, ans=0.05 2023-06-18 15:55:59,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=246750.0, ans=0.2 2023-06-18 15:56:08,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-18 15:56:22,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 3.234e+02 3.580e+02 4.539e+02 8.323e+02, threshold=7.159e+02, percent-clipped=4.0 2023-06-18 15:56:23,897 INFO [train.py:996] (0/4) Epoch 2, batch 10650, loss[loss=0.2193, simple_loss=0.3242, pruned_loss=0.05721, over 20796.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3478, pruned_loss=0.117, over 4244455.61 frames. ], batch size: 607, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:56:52,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246930.0, ans=0.1 2023-06-18 15:57:20,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.10 vs. limit=22.5 2023-06-18 15:57:25,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=247050.0, ans=0.2 2023-06-18 15:57:38,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=247110.0, ans=0.0 2023-06-18 15:58:01,630 INFO [train.py:996] (0/4) Epoch 2, batch 10700, loss[loss=0.3257, simple_loss=0.3955, pruned_loss=0.128, over 21768.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3475, pruned_loss=0.1177, over 4251017.07 frames. ], batch size: 124, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:58:17,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=247170.0, ans=0.0 2023-06-18 15:58:35,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=247230.0, ans=0.125 2023-06-18 15:59:01,078 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:59:43,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.410e+02 4.130e+02 4.973e+02 8.640e+02, threshold=8.260e+02, percent-clipped=4.0 2023-06-18 15:59:44,846 INFO [train.py:996] (0/4) Epoch 2, batch 10750, loss[loss=0.3247, simple_loss=0.4091, pruned_loss=0.1201, over 21754.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3613, pruned_loss=0.1245, over 4255301.36 frames. ], batch size: 351, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 16:00:06,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=247530.0, ans=0.125 2023-06-18 16:00:09,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=247530.0, ans=0.0 2023-06-18 16:00:58,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=247650.0, ans=0.2 2023-06-18 16:01:30,543 INFO [train.py:996] (0/4) Epoch 2, batch 10800, loss[loss=0.3496, simple_loss=0.3961, pruned_loss=0.1515, over 21445.00 frames. ], tot_loss[loss=0.3104, simple_loss=0.3687, pruned_loss=0.1261, over 4258481.21 frames. ], batch size: 194, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 16:01:31,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=247770.0, ans=0.125 2023-06-18 16:02:05,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=247890.0, ans=0.0 2023-06-18 16:02:48,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248010.0, ans=0.1 2023-06-18 16:02:53,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=248010.0, ans=0.2 2023-06-18 16:03:08,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.160e+02 3.815e+02 4.913e+02 8.496e+02, threshold=7.629e+02, percent-clipped=1.0 2023-06-18 16:03:08,363 INFO [train.py:996] (0/4) Epoch 2, batch 10850, loss[loss=0.3416, simple_loss=0.389, pruned_loss=0.1471, over 20744.00 frames. ], tot_loss[loss=0.3117, simple_loss=0.3695, pruned_loss=0.127, over 4256936.58 frames. ], batch size: 607, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:04:05,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=248250.0, ans=10.0 2023-06-18 16:04:12,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=248250.0, ans=0.125 2023-06-18 16:04:36,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248310.0, ans=0.1 2023-06-18 16:04:46,956 INFO [train.py:996] (0/4) Epoch 2, batch 10900, loss[loss=0.2417, simple_loss=0.2789, pruned_loss=0.1022, over 19926.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3629, pruned_loss=0.125, over 4252984.56 frames. ], batch size: 702, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:05:00,295 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.47 vs. limit=15.0 2023-06-18 16:05:07,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=248430.0, ans=0.125 2023-06-18 16:05:32,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248490.0, ans=0.1 2023-06-18 16:06:09,920 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:06:11,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=248610.0, ans=0.0 2023-06-18 16:06:18,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 2.990e+02 3.670e+02 4.688e+02 1.000e+03, threshold=7.341e+02, percent-clipped=2.0 2023-06-18 16:06:18,659 INFO [train.py:996] (0/4) Epoch 2, batch 10950, loss[loss=0.2898, simple_loss=0.3471, pruned_loss=0.1163, over 21744.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3576, pruned_loss=0.122, over 4241484.52 frames. ], batch size: 124, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:07:37,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=248850.0, ans=0.2 2023-06-18 16:07:42,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=248910.0, ans=0.125 2023-06-18 16:07:46,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=248910.0, ans=0.125 2023-06-18 16:07:55,484 INFO [train.py:996] (0/4) Epoch 2, batch 11000, loss[loss=0.3364, simple_loss=0.3707, pruned_loss=0.1511, over 21750.00 frames. ], tot_loss[loss=0.3005, simple_loss=0.3554, pruned_loss=0.1228, over 4248256.26 frames. ], batch size: 508, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:07:57,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=248970.0, ans=0.125 2023-06-18 16:08:14,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=249030.0, ans=0.05 2023-06-18 16:08:31,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=249090.0, ans=0.125 2023-06-18 16:08:40,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=249090.0, ans=0.0 2023-06-18 16:09:32,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.446e+02 4.232e+02 5.447e+02 9.802e+02, threshold=8.463e+02, percent-clipped=9.0 2023-06-18 16:09:32,820 INFO [train.py:996] (0/4) Epoch 2, batch 11050, loss[loss=0.2547, simple_loss=0.2916, pruned_loss=0.1089, over 21272.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3541, pruned_loss=0.1244, over 4245390.85 frames. ], batch size: 548, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:09:33,834 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-18 16:09:50,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=249330.0, ans=0.2 2023-06-18 16:09:56,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=249330.0, ans=0.125 2023-06-18 16:10:09,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-18 16:11:10,290 INFO [train.py:996] (0/4) Epoch 2, batch 11100, loss[loss=0.2672, simple_loss=0.3259, pruned_loss=0.1043, over 21291.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3516, pruned_loss=0.1242, over 4244285.41 frames. ], batch size: 194, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:11:26,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=249630.0, ans=0.125 2023-06-18 16:12:10,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=249750.0, ans=0.125 2023-06-18 16:12:29,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=249750.0, ans=0.125 2023-06-18 16:12:32,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=249810.0, ans=0.1 2023-06-18 16:12:40,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=249810.0, ans=0.1 2023-06-18 16:12:47,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.019e+02 3.669e+02 4.475e+02 9.197e+02, threshold=7.338e+02, percent-clipped=1.0 2023-06-18 16:12:47,686 INFO [train.py:996] (0/4) Epoch 2, batch 11150, loss[loss=0.2704, simple_loss=0.3197, pruned_loss=0.1106, over 21793.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3495, pruned_loss=0.1231, over 4256627.14 frames. ], batch size: 98, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:13:12,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=249930.0, ans=0.025 2023-06-18 16:13:22,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=249990.0, ans=0.0 2023-06-18 16:13:43,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=249990.0, ans=0.2 2023-06-18 16:14:18,034 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=4.705e-01 2023-06-18 16:14:21,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=250110.0, ans=0.125 2023-06-18 16:14:23,570 INFO [train.py:996] (0/4) Epoch 2, batch 11200, loss[loss=0.2767, simple_loss=0.3146, pruned_loss=0.1194, over 21563.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3466, pruned_loss=0.122, over 4263530.37 frames. ], batch size: 263, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:14:52,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-18 16:14:53,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=250290.0, ans=0.04949747468305833 2023-06-18 16:15:59,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.566e+02 4.194e+02 5.517e+02 1.156e+03, threshold=8.389e+02, percent-clipped=11.0 2023-06-18 16:15:59,483 INFO [train.py:996] (0/4) Epoch 2, batch 11250, loss[loss=0.2888, simple_loss=0.3513, pruned_loss=0.1132, over 21649.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3458, pruned_loss=0.1217, over 4275237.76 frames. ], batch size: 263, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:16:04,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250470.0, ans=0.1 2023-06-18 16:17:32,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250710.0, ans=0.1 2023-06-18 16:17:36,327 INFO [train.py:996] (0/4) Epoch 2, batch 11300, loss[loss=0.3254, simple_loss=0.3737, pruned_loss=0.1386, over 21814.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3478, pruned_loss=0.1218, over 4278638.42 frames. ], batch size: 332, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:17:36,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=250770.0, ans=0.2 2023-06-18 16:18:11,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=250890.0, ans=0.0 2023-06-18 16:18:36,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=250950.0, ans=0.0 2023-06-18 16:19:13,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.183e+02 3.739e+02 4.623e+02 9.049e+02, threshold=7.478e+02, percent-clipped=1.0 2023-06-18 16:19:13,189 INFO [train.py:996] (0/4) Epoch 2, batch 11350, loss[loss=0.2548, simple_loss=0.3185, pruned_loss=0.09562, over 21510.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3467, pruned_loss=0.1199, over 4268801.95 frames. ], batch size: 212, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:19:26,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=251070.0, ans=0.0 2023-06-18 16:19:26,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=251070.0, ans=0.125 2023-06-18 16:20:32,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.98 vs. limit=22.5 2023-06-18 16:20:36,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=251310.0, ans=8.0 2023-06-18 16:20:45,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=251310.0, ans=0.2 2023-06-18 16:20:53,114 INFO [train.py:996] (0/4) Epoch 2, batch 11400, loss[loss=0.3313, simple_loss=0.3787, pruned_loss=0.142, over 21592.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3534, pruned_loss=0.1228, over 4269113.18 frames. ], batch size: 263, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:21:21,041 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:21:30,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=251430.0, ans=0.0 2023-06-18 16:21:54,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=251490.0, ans=0.125 2023-06-18 16:22:34,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.522e+02 4.244e+02 5.675e+02 1.170e+03, threshold=8.488e+02, percent-clipped=5.0 2023-06-18 16:22:34,157 INFO [train.py:996] (0/4) Epoch 2, batch 11450, loss[loss=0.3578, simple_loss=0.3939, pruned_loss=0.1609, over 21417.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3551, pruned_loss=0.122, over 4270347.20 frames. ], batch size: 131, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:23:22,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251730.0, ans=0.1 2023-06-18 16:23:22,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=251730.0, ans=0.125 2023-06-18 16:23:37,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=251790.0, ans=0.125 2023-06-18 16:24:13,577 INFO [train.py:996] (0/4) Epoch 2, batch 11500, loss[loss=0.303, simple_loss=0.3645, pruned_loss=0.1208, over 21334.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3595, pruned_loss=0.1235, over 4271582.30 frames. ], batch size: 131, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:24:14,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=251970.0, ans=0.04949747468305833 2023-06-18 16:24:38,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252030.0, ans=0.1 2023-06-18 16:24:56,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2023-06-18 16:25:02,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252090.0, ans=0.1 2023-06-18 16:25:52,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.283e+02 4.041e+02 4.776e+02 1.091e+03, threshold=8.082e+02, percent-clipped=3.0 2023-06-18 16:25:52,877 INFO [train.py:996] (0/4) Epoch 2, batch 11550, loss[loss=0.285, simple_loss=0.3653, pruned_loss=0.1023, over 21619.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.3658, pruned_loss=0.1238, over 4275616.88 frames. ], batch size: 230, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:26:17,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=252270.0, ans=0.125 2023-06-18 16:26:57,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=252450.0, ans=0.0 2023-06-18 16:27:48,290 INFO [train.py:996] (0/4) Epoch 2, batch 11600, loss[loss=0.2828, simple_loss=0.359, pruned_loss=0.1033, over 21822.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3775, pruned_loss=0.1253, over 4277259.47 frames. ], batch size: 124, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:27:55,491 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-18 16:28:35,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=252750.0, ans=0.125 2023-06-18 16:29:14,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=252810.0, ans=0.125 2023-06-18 16:29:25,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 3.666e+02 4.957e+02 6.331e+02 1.126e+03, threshold=9.914e+02, percent-clipped=8.0 2023-06-18 16:29:25,193 INFO [train.py:996] (0/4) Epoch 2, batch 11650, loss[loss=0.299, simple_loss=0.3891, pruned_loss=0.1045, over 21458.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3827, pruned_loss=0.1246, over 4278048.62 frames. ], batch size: 194, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:29:34,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-18 16:29:48,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=252930.0, ans=0.125 2023-06-18 16:29:50,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-06-18 16:30:30,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253110.0, ans=0.1 2023-06-18 16:31:02,398 INFO [train.py:996] (0/4) Epoch 2, batch 11700, loss[loss=0.2904, simple_loss=0.3318, pruned_loss=0.1245, over 21825.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3761, pruned_loss=0.1263, over 4286293.67 frames. ], batch size: 372, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:31:20,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-18 16:31:24,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=253230.0, ans=0.2 2023-06-18 16:31:43,560 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.03 vs. limit=12.0 2023-06-18 16:31:44,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=253290.0, ans=0.2 2023-06-18 16:31:56,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=253350.0, ans=0.95 2023-06-18 16:32:38,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.237e+02 3.960e+02 5.340e+02 1.578e+03, threshold=7.920e+02, percent-clipped=3.0 2023-06-18 16:32:38,138 INFO [train.py:996] (0/4) Epoch 2, batch 11750, loss[loss=0.2667, simple_loss=0.3087, pruned_loss=0.1123, over 21253.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3659, pruned_loss=0.1253, over 4277701.35 frames. ], batch size: 159, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:33:03,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=253530.0, ans=0.125 2023-06-18 16:33:22,231 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=6.0 2023-06-18 16:33:32,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-18 16:33:45,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-06-18 16:33:48,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253650.0, ans=0.1 2023-06-18 16:34:03,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=253710.0, ans=0.2 2023-06-18 16:34:11,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=253710.0, ans=0.0 2023-06-18 16:34:17,519 INFO [train.py:996] (0/4) Epoch 2, batch 11800, loss[loss=0.3512, simple_loss=0.3969, pruned_loss=0.1527, over 21432.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.3692, pruned_loss=0.1289, over 4280990.03 frames. ], batch size: 131, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:34:19,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253770.0, ans=0.1 2023-06-18 16:34:27,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=253770.0, ans=10.0 2023-06-18 16:34:42,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=253830.0, ans=0.125 2023-06-18 16:34:51,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.03 vs. limit=10.0 2023-06-18 16:34:59,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=253890.0, ans=0.1 2023-06-18 16:34:59,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253890.0, ans=0.1 2023-06-18 16:35:57,248 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.227e+02 4.040e+02 5.078e+02 7.033e+02, threshold=8.080e+02, percent-clipped=0.0 2023-06-18 16:35:57,277 INFO [train.py:996] (0/4) Epoch 2, batch 11850, loss[loss=0.2936, simple_loss=0.358, pruned_loss=0.1146, over 21893.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3724, pruned_loss=0.1274, over 4278875.96 frames. ], batch size: 316, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:36:07,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=254070.0, ans=0.0 2023-06-18 16:36:09,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=254070.0, ans=0.0 2023-06-18 16:36:11,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=12.0 2023-06-18 16:36:16,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=254130.0, ans=0.125 2023-06-18 16:36:48,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=254250.0, ans=0.05 2023-06-18 16:37:35,375 INFO [train.py:996] (0/4) Epoch 2, batch 11900, loss[loss=0.2741, simple_loss=0.3341, pruned_loss=0.107, over 21664.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.371, pruned_loss=0.1246, over 4273053.94 frames. ], batch size: 247, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:37:46,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.09 vs. limit=15.0 2023-06-18 16:37:57,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-18 16:38:13,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=254490.0, ans=0.0 2023-06-18 16:38:23,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=254490.0, ans=0.5 2023-06-18 16:38:55,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=254550.0, ans=0.125 2023-06-18 16:39:13,284 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 3.213e+02 3.818e+02 4.903e+02 8.116e+02, threshold=7.635e+02, percent-clipped=1.0 2023-06-18 16:39:13,314 INFO [train.py:996] (0/4) Epoch 2, batch 11950, loss[loss=0.2433, simple_loss=0.3075, pruned_loss=0.08957, over 21257.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3713, pruned_loss=0.1212, over 4270363.06 frames. ], batch size: 176, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:39:35,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=254730.0, ans=0.2 2023-06-18 16:40:03,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-18 16:40:14,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=254790.0, ans=0.125 2023-06-18 16:40:22,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=254850.0, ans=0.125 2023-06-18 16:40:39,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=254910.0, ans=0.125 2023-06-18 16:40:49,307 INFO [train.py:996] (0/4) Epoch 2, batch 12000, loss[loss=0.2398, simple_loss=0.3132, pruned_loss=0.08318, over 21570.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3675, pruned_loss=0.1182, over 4268411.41 frames. ], batch size: 230, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:40:49,308 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 16:40:59,333 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.2388, 2.8706, 1.3533, 1.4620], device='cuda:0') 2023-06-18 16:41:05,153 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2926, simple_loss=0.3848, pruned_loss=0.1002, over 1796401.00 frames. 2023-06-18 16:41:05,154 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 16:41:48,204 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:42:41,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=255270.0, ans=0.0 2023-06-18 16:42:42,377 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.604e+02 5.059e+02 6.079e+02 1.381e+03, threshold=1.012e+03, percent-clipped=10.0 2023-06-18 16:42:42,398 INFO [train.py:996] (0/4) Epoch 2, batch 12050, loss[loss=0.3079, simple_loss=0.3552, pruned_loss=0.1303, over 21167.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3639, pruned_loss=0.1211, over 4264589.38 frames. ], batch size: 159, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:42:45,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.13 vs. limit=22.5 2023-06-18 16:42:46,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=255270.0, ans=0.125 2023-06-18 16:43:04,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=255330.0, ans=0.0 2023-06-18 16:43:21,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=255330.0, ans=0.125 2023-06-18 16:43:25,955 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.591e-02 2023-06-18 16:43:33,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=255390.0, ans=0.125 2023-06-18 16:43:38,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=255390.0, ans=0.125 2023-06-18 16:43:47,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=255450.0, ans=0.125 2023-06-18 16:43:55,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=255450.0, ans=0.0 2023-06-18 16:44:00,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=255510.0, ans=0.07 2023-06-18 16:44:05,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-18 16:44:15,722 INFO [train.py:996] (0/4) Epoch 2, batch 12100, loss[loss=0.3346, simple_loss=0.3858, pruned_loss=0.1417, over 21370.00 frames. ], tot_loss[loss=0.3094, simple_loss=0.3682, pruned_loss=0.1253, over 4263421.92 frames. ], batch size: 159, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:44:41,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=255630.0, ans=0.125 2023-06-18 16:44:51,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=255630.0, ans=0.0 2023-06-18 16:45:10,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-18 16:46:01,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.581e+02 3.947e+02 4.820e+02 5.748e+02 9.180e+02, threshold=9.640e+02, percent-clipped=0.0 2023-06-18 16:46:01,058 INFO [train.py:996] (0/4) Epoch 2, batch 12150, loss[loss=0.3013, simple_loss=0.3532, pruned_loss=0.1247, over 20781.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3714, pruned_loss=0.1257, over 4261726.10 frames. ], batch size: 611, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:46:06,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=255870.0, ans=0.125 2023-06-18 16:46:56,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256050.0, ans=0.1 2023-06-18 16:47:00,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=256050.0, ans=0.125 2023-06-18 16:47:47,406 INFO [train.py:996] (0/4) Epoch 2, batch 12200, loss[loss=0.3053, simple_loss=0.3922, pruned_loss=0.1092, over 21192.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3675, pruned_loss=0.1246, over 4259263.30 frames. ], batch size: 548, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:47:59,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.72 vs. limit=15.0 2023-06-18 16:48:05,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256230.0, ans=0.1 2023-06-18 16:48:21,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=256290.0, ans=0.125 2023-06-18 16:49:24,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.962e+02 3.747e+02 4.961e+02 1.098e+03, threshold=7.494e+02, percent-clipped=1.0 2023-06-18 16:49:24,928 INFO [train.py:996] (0/4) Epoch 2, batch 12250, loss[loss=0.2311, simple_loss=0.2954, pruned_loss=0.08333, over 21755.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3581, pruned_loss=0.1203, over 4255462.39 frames. ], batch size: 124, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:49:42,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256530.0, ans=0.1 2023-06-18 16:49:48,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=256530.0, ans=0.125 2023-06-18 16:49:48,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=256530.0, ans=0.2 2023-06-18 16:50:00,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256590.0, ans=0.1 2023-06-18 16:50:05,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=256590.0, ans=0.125 2023-06-18 16:50:09,443 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.59 vs. limit=5.0 2023-06-18 16:50:37,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=256710.0, ans=0.125 2023-06-18 16:51:02,002 INFO [train.py:996] (0/4) Epoch 2, batch 12300, loss[loss=0.2261, simple_loss=0.2869, pruned_loss=0.08261, over 21163.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3458, pruned_loss=0.1104, over 4255195.14 frames. ], batch size: 143, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:51:39,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=256890.0, ans=0.0 2023-06-18 16:51:50,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=256950.0, ans=0.035 2023-06-18 16:51:53,491 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-18 16:52:15,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=257010.0, ans=0.0 2023-06-18 16:52:30,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-18 16:52:37,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.926e+02 3.592e+02 4.510e+02 1.066e+03, threshold=7.183e+02, percent-clipped=4.0 2023-06-18 16:52:38,026 INFO [train.py:996] (0/4) Epoch 2, batch 12350, loss[loss=0.2852, simple_loss=0.3481, pruned_loss=0.1111, over 21616.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3515, pruned_loss=0.1131, over 4258716.55 frames. ], batch size: 263, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:52:46,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257070.0, ans=0.1 2023-06-18 16:52:47,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=257070.0, ans=0.0 2023-06-18 16:52:59,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-18 16:53:23,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=257250.0, ans=0.0 2023-06-18 16:53:33,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257250.0, ans=0.1 2023-06-18 16:53:42,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=257310.0, ans=0.125 2023-06-18 16:54:00,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=257310.0, ans=0.125 2023-06-18 16:54:07,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.92 vs. limit=10.0 2023-06-18 16:54:09,378 INFO [train.py:996] (0/4) Epoch 2, batch 12400, loss[loss=0.388, simple_loss=0.4121, pruned_loss=0.182, over 21797.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3559, pruned_loss=0.1186, over 4267538.76 frames. ], batch size: 441, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:54:30,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=257430.0, ans=0.2 2023-06-18 16:54:30,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=257430.0, ans=0.125 2023-06-18 16:54:45,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-18 16:54:55,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=257550.0, ans=0.0 2023-06-18 16:55:31,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-06-18 16:55:43,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.570e+02 4.078e+02 4.958e+02 7.763e+02, threshold=8.156e+02, percent-clipped=1.0 2023-06-18 16:55:43,244 INFO [train.py:996] (0/4) Epoch 2, batch 12450, loss[loss=0.3865, simple_loss=0.427, pruned_loss=0.173, over 21411.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3613, pruned_loss=0.124, over 4271394.36 frames. ], batch size: 471, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:56:09,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=257730.0, ans=0.125 2023-06-18 16:56:23,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=257790.0, ans=0.0 2023-06-18 16:56:29,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.78 vs. limit=10.0 2023-06-18 16:57:10,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=257910.0, ans=0.035 2023-06-18 16:57:16,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=257910.0, ans=0.0 2023-06-18 16:57:18,776 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.62 vs. limit=10.0 2023-06-18 16:57:20,743 INFO [train.py:996] (0/4) Epoch 2, batch 12500, loss[loss=0.3417, simple_loss=0.4135, pruned_loss=0.135, over 21275.00 frames. ], tot_loss[loss=0.3148, simple_loss=0.3724, pruned_loss=0.1286, over 4275904.75 frames. ], batch size: 176, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:58:04,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=258090.0, ans=0.04949747468305833 2023-06-18 16:58:58,483 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.350e+02 3.939e+02 4.917e+02 9.519e+02, threshold=7.878e+02, percent-clipped=2.0 2023-06-18 16:58:58,504 INFO [train.py:996] (0/4) Epoch 2, batch 12550, loss[loss=0.3036, simple_loss=0.4154, pruned_loss=0.09594, over 20801.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3796, pruned_loss=0.132, over 4278063.21 frames. ], batch size: 607, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:00:05,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=258450.0, ans=0.125 2023-06-18 17:00:06,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=258450.0, ans=15.0 2023-06-18 17:00:07,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=258450.0, ans=0.125 2023-06-18 17:00:18,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=258450.0, ans=0.125 2023-06-18 17:00:36,889 INFO [train.py:996] (0/4) Epoch 2, batch 12600, loss[loss=0.2062, simple_loss=0.2823, pruned_loss=0.06508, over 21278.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3756, pruned_loss=0.1278, over 4279675.37 frames. ], batch size: 176, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:00:43,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=258570.0, ans=0.0 2023-06-18 17:01:16,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=258630.0, ans=0.1 2023-06-18 17:01:47,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258750.0, ans=0.1 2023-06-18 17:02:12,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.876e+02 3.494e+02 4.502e+02 7.452e+02, threshold=6.987e+02, percent-clipped=0.0 2023-06-18 17:02:12,938 INFO [train.py:996] (0/4) Epoch 2, batch 12650, loss[loss=0.2953, simple_loss=0.3515, pruned_loss=0.1195, over 21494.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3662, pruned_loss=0.1221, over 4282713.99 frames. ], batch size: 131, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:02:59,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-18 17:03:22,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-18 17:03:49,459 INFO [train.py:996] (0/4) Epoch 2, batch 12700, loss[loss=0.3926, simple_loss=0.4097, pruned_loss=0.1877, over 21573.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3673, pruned_loss=0.1261, over 4287947.76 frames. ], batch size: 507, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:03:51,494 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:04:04,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=259170.0, ans=0.0 2023-06-18 17:04:34,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-06-18 17:04:38,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-06-18 17:04:58,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=259350.0, ans=0.2 2023-06-18 17:05:25,060 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.415e+02 4.071e+02 4.986e+02 8.988e+02, threshold=8.142e+02, percent-clipped=6.0 2023-06-18 17:05:25,082 INFO [train.py:996] (0/4) Epoch 2, batch 12750, loss[loss=0.2827, simple_loss=0.3616, pruned_loss=0.1019, over 21696.00 frames. ], tot_loss[loss=0.3093, simple_loss=0.3675, pruned_loss=0.1256, over 4282117.67 frames. ], batch size: 298, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:05:40,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-18 17:05:42,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=259470.0, ans=0.025 2023-06-18 17:05:53,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=259530.0, ans=0.0 2023-06-18 17:06:08,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=259530.0, ans=0.125 2023-06-18 17:06:09,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=259530.0, ans=0.0 2023-06-18 17:06:20,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.88 vs. limit=10.0 2023-06-18 17:06:29,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=259650.0, ans=0.125 2023-06-18 17:07:08,582 INFO [train.py:996] (0/4) Epoch 2, batch 12800, loss[loss=0.299, simple_loss=0.3594, pruned_loss=0.1193, over 21652.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3672, pruned_loss=0.1265, over 4283249.78 frames. ], batch size: 263, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:07:12,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=259770.0, ans=0.125 2023-06-18 17:07:13,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=259770.0, ans=0.0 2023-06-18 17:07:24,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=259770.0, ans=0.2 2023-06-18 17:07:34,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=259830.0, ans=0.2 2023-06-18 17:08:08,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-18 17:08:46,752 INFO [train.py:996] (0/4) Epoch 2, batch 12850, loss[loss=0.3069, simple_loss=0.3783, pruned_loss=0.1177, over 21768.00 frames. ], tot_loss[loss=0.3152, simple_loss=0.3711, pruned_loss=0.1297, over 4285437.17 frames. ], batch size: 351, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:08:48,204 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 3.179e+02 3.774e+02 4.668e+02 7.829e+02, threshold=7.547e+02, percent-clipped=0.0 2023-06-18 17:09:08,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-18 17:10:15,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=260310.0, ans=0.5 2023-06-18 17:10:37,525 INFO [train.py:996] (0/4) Epoch 2, batch 12900, loss[loss=0.2174, simple_loss=0.2889, pruned_loss=0.073, over 21083.00 frames. ], tot_loss[loss=0.3094, simple_loss=0.3688, pruned_loss=0.1251, over 4279635.15 frames. ], batch size: 143, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:10:38,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=260370.0, ans=0.125 2023-06-18 17:11:23,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=260490.0, ans=0.0 2023-06-18 17:12:14,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=260670.0, ans=0.125 2023-06-18 17:12:15,684 INFO [train.py:996] (0/4) Epoch 2, batch 12950, loss[loss=0.3186, simple_loss=0.3699, pruned_loss=0.1337, over 21809.00 frames. ], tot_loss[loss=0.3064, simple_loss=0.3679, pruned_loss=0.1225, over 4280644.21 frames. ], batch size: 282, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:12:16,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=260670.0, ans=0.0 2023-06-18 17:12:17,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 3.080e+02 3.556e+02 4.378e+02 7.837e+02, threshold=7.111e+02, percent-clipped=1.0 2023-06-18 17:12:39,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=260730.0, ans=0.125 2023-06-18 17:13:11,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.23 vs. limit=15.0 2023-06-18 17:13:13,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=260850.0, ans=0.1 2023-06-18 17:13:16,862 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:13:51,822 INFO [train.py:996] (0/4) Epoch 2, batch 13000, loss[loss=0.2782, simple_loss=0.3502, pruned_loss=0.1031, over 21751.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3706, pruned_loss=0.1236, over 4273001.11 frames. ], batch size: 391, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:14:03,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-18 17:14:03,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-18 17:14:12,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=261030.0, ans=0.125 2023-06-18 17:14:21,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=261090.0, ans=0.035 2023-06-18 17:14:48,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=261150.0, ans=0.2 2023-06-18 17:14:55,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=261150.0, ans=0.0 2023-06-18 17:15:27,591 INFO [train.py:996] (0/4) Epoch 2, batch 13050, loss[loss=0.3319, simple_loss=0.3809, pruned_loss=0.1415, over 21847.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3637, pruned_loss=0.1202, over 4279761.17 frames. ], batch size: 107, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:15:29,092 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.088e+02 4.287e+02 5.215e+02 1.044e+03, threshold=8.575e+02, percent-clipped=6.0 2023-06-18 17:15:34,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=261270.0, ans=0.1 2023-06-18 17:15:52,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=261330.0, ans=0.0 2023-06-18 17:16:21,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261390.0, ans=0.1 2023-06-18 17:16:24,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=261450.0, ans=0.125 2023-06-18 17:16:34,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-18 17:17:04,249 INFO [train.py:996] (0/4) Epoch 2, batch 13100, loss[loss=0.2795, simple_loss=0.3522, pruned_loss=0.1034, over 21776.00 frames. ], tot_loss[loss=0.3018, simple_loss=0.3632, pruned_loss=0.1202, over 4278346.88 frames. ], batch size: 298, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:17:41,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=261630.0, ans=0.0 2023-06-18 17:18:17,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=261750.0, ans=0.2 2023-06-18 17:18:17,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=261750.0, ans=0.0 2023-06-18 17:18:24,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=261750.0, ans=0.125 2023-06-18 17:18:30,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=15.0 2023-06-18 17:18:44,175 INFO [train.py:996] (0/4) Epoch 2, batch 13150, loss[loss=0.2571, simple_loss=0.3179, pruned_loss=0.09812, over 21568.00 frames. ], tot_loss[loss=0.307, simple_loss=0.3655, pruned_loss=0.1243, over 4277074.84 frames. ], batch size: 230, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:18:45,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 3.617e+02 4.529e+02 5.724e+02 9.376e+02, threshold=9.058e+02, percent-clipped=0.0 2023-06-18 17:19:04,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=261930.0, ans=0.125 2023-06-18 17:19:17,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=22.5 2023-06-18 17:19:36,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=261990.0, ans=0.0 2023-06-18 17:19:39,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=261990.0, ans=0.125 2023-06-18 17:19:58,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=262050.0, ans=0.125 2023-06-18 17:20:18,156 INFO [train.py:996] (0/4) Epoch 2, batch 13200, loss[loss=0.3148, simple_loss=0.3712, pruned_loss=0.1292, over 21491.00 frames. ], tot_loss[loss=0.3073, simple_loss=0.3653, pruned_loss=0.1247, over 4280855.62 frames. ], batch size: 131, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:21:23,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.76 vs. limit=6.0 2023-06-18 17:21:59,606 INFO [train.py:996] (0/4) Epoch 2, batch 13250, loss[loss=0.3132, simple_loss=0.3565, pruned_loss=0.1349, over 21281.00 frames. ], tot_loss[loss=0.3082, simple_loss=0.365, pruned_loss=0.1257, over 4274979.87 frames. ], batch size: 143, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:22:06,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.187e+02 3.809e+02 4.597e+02 7.682e+02, threshold=7.618e+02, percent-clipped=1.0 2023-06-18 17:22:46,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=262590.0, ans=0.2 2023-06-18 17:23:01,806 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-18 17:23:09,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=262650.0, ans=0.0 2023-06-18 17:23:43,660 INFO [train.py:996] (0/4) Epoch 2, batch 13300, loss[loss=0.3365, simple_loss=0.3863, pruned_loss=0.1433, over 21539.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3679, pruned_loss=0.1256, over 4280040.60 frames. ], batch size: 194, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:24:19,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-18 17:24:23,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=262890.0, ans=0.05 2023-06-18 17:24:37,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.31 vs. limit=15.0 2023-06-18 17:25:25,055 INFO [train.py:996] (0/4) Epoch 2, batch 13350, loss[loss=0.3323, simple_loss=0.3908, pruned_loss=0.137, over 21595.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3745, pruned_loss=0.1303, over 4285205.69 frames. ], batch size: 389, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:25:26,619 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.186e+02 3.887e+02 4.954e+02 1.112e+03, threshold=7.774e+02, percent-clipped=6.0 2023-06-18 17:25:39,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=263070.0, ans=0.125 2023-06-18 17:25:48,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=263130.0, ans=0.125 2023-06-18 17:26:27,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=263250.0, ans=0.025 2023-06-18 17:27:08,164 INFO [train.py:996] (0/4) Epoch 2, batch 13400, loss[loss=0.3257, simple_loss=0.3734, pruned_loss=0.139, over 21722.00 frames. ], tot_loss[loss=0.322, simple_loss=0.3772, pruned_loss=0.1334, over 4284775.01 frames. ], batch size: 112, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:27:10,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-18 17:27:32,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=263430.0, ans=0.0 2023-06-18 17:27:36,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=263430.0, ans=0.0 2023-06-18 17:27:39,389 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.46 vs. limit=15.0 2023-06-18 17:28:10,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-06-18 17:28:20,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=263610.0, ans=0.025 2023-06-18 17:28:45,371 INFO [train.py:996] (0/4) Epoch 2, batch 13450, loss[loss=0.2945, simple_loss=0.3428, pruned_loss=0.1231, over 21760.00 frames. ], tot_loss[loss=0.3265, simple_loss=0.3796, pruned_loss=0.1367, over 4275514.45 frames. ], batch size: 282, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:28:46,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.497e+02 3.954e+02 4.827e+02 1.042e+03, threshold=7.908e+02, percent-clipped=7.0 2023-06-18 17:28:55,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=263670.0, ans=0.2 2023-06-18 17:29:10,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=263730.0, ans=0.125 2023-06-18 17:29:32,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=263790.0, ans=0.09899494936611666 2023-06-18 17:29:47,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263850.0, ans=0.1 2023-06-18 17:29:53,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=263850.0, ans=0.2 2023-06-18 17:30:12,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=263910.0, ans=0.125 2023-06-18 17:30:23,558 INFO [train.py:996] (0/4) Epoch 2, batch 13500, loss[loss=0.3124, simple_loss=0.3662, pruned_loss=0.1294, over 21720.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3707, pruned_loss=0.131, over 4273841.48 frames. ], batch size: 298, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:30:29,823 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-44000.pt 2023-06-18 17:30:36,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=263970.0, ans=0.025 2023-06-18 17:30:38,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-18 17:30:55,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=264030.0, ans=0.1 2023-06-18 17:31:12,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=264090.0, ans=0.125 2023-06-18 17:31:51,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=264210.0, ans=0.04949747468305833 2023-06-18 17:32:03,919 INFO [train.py:996] (0/4) Epoch 2, batch 13550, loss[loss=0.3507, simple_loss=0.4267, pruned_loss=0.1373, over 21854.00 frames. ], tot_loss[loss=0.3152, simple_loss=0.3724, pruned_loss=0.129, over 4265964.57 frames. ], batch size: 371, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:32:05,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.329e+02 4.198e+02 5.480e+02 1.124e+03, threshold=8.396e+02, percent-clipped=8.0 2023-06-18 17:32:15,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=264270.0, ans=0.125 2023-06-18 17:32:16,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=264270.0, ans=0.0 2023-06-18 17:32:18,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=264330.0, ans=0.125 2023-06-18 17:32:21,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=264330.0, ans=0.125 2023-06-18 17:32:27,421 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:33:04,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=264450.0, ans=0.125 2023-06-18 17:33:42,107 INFO [train.py:996] (0/4) Epoch 2, batch 13600, loss[loss=0.3288, simple_loss=0.3759, pruned_loss=0.1409, over 21883.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3732, pruned_loss=0.1298, over 4275373.04 frames. ], batch size: 414, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:33:50,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=264570.0, ans=0.0 2023-06-18 17:34:05,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=264630.0, ans=0.125 2023-06-18 17:34:15,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-18 17:34:27,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264690.0, ans=0.1 2023-06-18 17:34:49,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=264750.0, ans=0.125 2023-06-18 17:35:19,303 INFO [train.py:996] (0/4) Epoch 2, batch 13650, loss[loss=0.2905, simple_loss=0.3269, pruned_loss=0.127, over 21258.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3675, pruned_loss=0.1253, over 4271025.26 frames. ], batch size: 548, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:35:20,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 3.001e+02 3.620e+02 4.450e+02 8.511e+02, threshold=7.240e+02, percent-clipped=1.0 2023-06-18 17:36:17,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264990.0, ans=0.1 2023-06-18 17:36:59,387 INFO [train.py:996] (0/4) Epoch 2, batch 13700, loss[loss=0.2811, simple_loss=0.3343, pruned_loss=0.1139, over 21762.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3624, pruned_loss=0.1254, over 4272539.03 frames. ], batch size: 282, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:37:01,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-18 17:37:29,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=265230.0, ans=0.125 2023-06-18 17:37:37,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=265230.0, ans=0.2 2023-06-18 17:37:46,690 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:38:13,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=265350.0, ans=0.125 2023-06-18 17:38:28,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=265410.0, ans=0.125 2023-06-18 17:38:39,027 INFO [train.py:996] (0/4) Epoch 2, batch 13750, loss[loss=0.299, simple_loss=0.3724, pruned_loss=0.1128, over 21559.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3573, pruned_loss=0.1232, over 4267580.96 frames. ], batch size: 441, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:38:45,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.651e+02 4.578e+02 5.768e+02 1.165e+03, threshold=9.156e+02, percent-clipped=11.0 2023-06-18 17:38:45,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=265470.0, ans=0.125 2023-06-18 17:38:54,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=265470.0, ans=0.025 2023-06-18 17:39:06,727 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-18 17:39:35,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265590.0, ans=0.1 2023-06-18 17:40:32,086 INFO [train.py:996] (0/4) Epoch 2, batch 13800, loss[loss=0.3435, simple_loss=0.4203, pruned_loss=0.1334, over 21678.00 frames. ], tot_loss[loss=0.304, simple_loss=0.3633, pruned_loss=0.1224, over 4268842.26 frames. ], batch size: 298, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:41:02,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265830.0, ans=0.1 2023-06-18 17:41:22,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-06-18 17:41:53,430 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:42:14,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266070.0, ans=0.1 2023-06-18 17:42:15,753 INFO [train.py:996] (0/4) Epoch 2, batch 13850, loss[loss=0.3352, simple_loss=0.3824, pruned_loss=0.144, over 21384.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3692, pruned_loss=0.1243, over 4271960.56 frames. ], batch size: 211, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:42:17,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.095e+02 3.814e+02 4.955e+02 1.017e+03, threshold=7.628e+02, percent-clipped=1.0 2023-06-18 17:42:27,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=266070.0, ans=0.0 2023-06-18 17:42:38,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-06-18 17:43:01,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=266190.0, ans=0.125 2023-06-18 17:43:21,588 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:43:49,734 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.84 vs. limit=22.5 2023-06-18 17:43:53,116 INFO [train.py:996] (0/4) Epoch 2, batch 13900, loss[loss=0.2924, simple_loss=0.346, pruned_loss=0.1194, over 21783.00 frames. ], tot_loss[loss=0.3155, simple_loss=0.3741, pruned_loss=0.1284, over 4271518.69 frames. ], batch size: 298, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:43:58,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=266370.0, ans=0.0 2023-06-18 17:44:08,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=266370.0, ans=0.125 2023-06-18 17:44:30,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=266490.0, ans=0.125 2023-06-18 17:45:35,913 INFO [train.py:996] (0/4) Epoch 2, batch 13950, loss[loss=0.3058, simple_loss=0.3581, pruned_loss=0.1268, over 21832.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.374, pruned_loss=0.1306, over 4281272.29 frames. ], batch size: 298, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:45:37,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.843e+02 4.662e+02 6.006e+02 1.294e+03, threshold=9.323e+02, percent-clipped=7.0 2023-06-18 17:45:49,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266670.0, ans=0.1 2023-06-18 17:45:50,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266730.0, ans=0.1 2023-06-18 17:46:25,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-18 17:46:43,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=266850.0, ans=0.2 2023-06-18 17:47:14,080 INFO [train.py:996] (0/4) Epoch 2, batch 14000, loss[loss=0.2391, simple_loss=0.2986, pruned_loss=0.08975, over 21723.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3657, pruned_loss=0.1257, over 4260094.60 frames. ], batch size: 264, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:48:48,752 INFO [train.py:996] (0/4) Epoch 2, batch 14050, loss[loss=0.2801, simple_loss=0.3529, pruned_loss=0.1037, over 21863.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3583, pruned_loss=0.1204, over 4262074.34 frames. ], batch size: 351, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:48:50,169 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.887e+02 3.656e+02 4.389e+02 9.715e+02, threshold=7.312e+02, percent-clipped=1.0 2023-06-18 17:48:50,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=267270.0, ans=0.125 2023-06-18 17:48:52,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=267270.0, ans=0.07 2023-06-18 17:50:03,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=12.0 2023-06-18 17:50:16,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=267510.0, ans=0.0 2023-06-18 17:50:23,577 INFO [train.py:996] (0/4) Epoch 2, batch 14100, loss[loss=0.2653, simple_loss=0.3059, pruned_loss=0.1124, over 20798.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3516, pruned_loss=0.1199, over 4262905.35 frames. ], batch size: 608, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:50:44,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-18 17:51:16,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-18 17:51:37,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-18 17:51:52,576 INFO [train.py:996] (0/4) Epoch 2, batch 14150, loss[loss=0.2682, simple_loss=0.3462, pruned_loss=0.09505, over 21819.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3584, pruned_loss=0.1231, over 4249941.19 frames. ], batch size: 102, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:51:59,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.602e+02 4.448e+02 5.500e+02 9.616e+02, threshold=8.896e+02, percent-clipped=7.0 2023-06-18 17:52:26,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-18 17:53:17,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=268110.0, ans=0.0 2023-06-18 17:53:21,316 INFO [train.py:996] (0/4) Epoch 2, batch 14200, loss[loss=0.307, simple_loss=0.3394, pruned_loss=0.1373, over 21561.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3557, pruned_loss=0.12, over 4258143.01 frames. ], batch size: 263, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:53:22,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.36 vs. limit=22.5 2023-06-18 17:53:46,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=268230.0, ans=0.125 2023-06-18 17:53:53,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=268230.0, ans=0.2 2023-06-18 17:54:23,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=268350.0, ans=0.125 2023-06-18 17:54:54,797 INFO [train.py:996] (0/4) Epoch 2, batch 14250, loss[loss=0.3487, simple_loss=0.4369, pruned_loss=0.1302, over 19701.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3516, pruned_loss=0.1205, over 4262752.86 frames. ], batch size: 703, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:54:56,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 3.115e+02 4.292e+02 5.783e+02 1.043e+03, threshold=8.584e+02, percent-clipped=1.0 2023-06-18 17:55:33,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.69 vs. limit=15.0 2023-06-18 17:56:30,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=268710.0, ans=0.125 2023-06-18 17:56:33,168 INFO [train.py:996] (0/4) Epoch 2, batch 14300, loss[loss=0.3959, simple_loss=0.4613, pruned_loss=0.1652, over 21763.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3552, pruned_loss=0.1196, over 4256960.01 frames. ], batch size: 332, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:56:50,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=268770.0, ans=0.2 2023-06-18 17:56:53,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.61 vs. limit=10.0 2023-06-18 17:56:54,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.81 vs. limit=6.0 2023-06-18 17:57:51,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=269010.0, ans=0.125 2023-06-18 17:58:09,494 INFO [train.py:996] (0/4) Epoch 2, batch 14350, loss[loss=0.3236, simple_loss=0.3721, pruned_loss=0.1376, over 20008.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3624, pruned_loss=0.1202, over 4258276.57 frames. ], batch size: 702, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:58:11,151 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.215e+02 4.287e+02 5.391e+02 1.265e+03, threshold=8.575e+02, percent-clipped=5.0 2023-06-18 17:59:48,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=269310.0, ans=0.025 2023-06-18 17:59:48,776 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-18 17:59:50,474 INFO [train.py:996] (0/4) Epoch 2, batch 14400, loss[loss=0.3314, simple_loss=0.3632, pruned_loss=0.1498, over 22021.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3603, pruned_loss=0.1231, over 4262709.82 frames. ], batch size: 103, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 18:00:15,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269430.0, ans=0.1 2023-06-18 18:00:43,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=269550.0, ans=0.125 2023-06-18 18:01:13,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=269610.0, ans=0.0 2023-06-18 18:01:25,596 INFO [train.py:996] (0/4) Epoch 2, batch 14450, loss[loss=0.3127, simple_loss=0.3497, pruned_loss=0.1378, over 21589.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3536, pruned_loss=0.1226, over 4262494.55 frames. ], batch size: 441, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 18:01:26,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 3.300e+02 3.933e+02 4.836e+02 8.413e+02, threshold=7.867e+02, percent-clipped=0.0 2023-06-18 18:01:44,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=269730.0, ans=0.125 2023-06-18 18:02:02,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=269790.0, ans=0.125 2023-06-18 18:02:13,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-18 18:02:14,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=269790.0, ans=0.1 2023-06-18 18:02:49,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=269910.0, ans=0.0 2023-06-18 18:03:01,258 INFO [train.py:996] (0/4) Epoch 2, batch 14500, loss[loss=0.338, simple_loss=0.4087, pruned_loss=0.1337, over 20927.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3506, pruned_loss=0.1224, over 4257536.23 frames. ], batch size: 608, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:03:23,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=270030.0, ans=0.125 2023-06-18 18:03:39,936 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-18 18:04:17,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270210.0, ans=0.1 2023-06-18 18:04:43,607 INFO [train.py:996] (0/4) Epoch 2, batch 14550, loss[loss=0.2936, simple_loss=0.3343, pruned_loss=0.1265, over 20688.00 frames. ], tot_loss[loss=0.3016, simple_loss=0.3555, pruned_loss=0.1239, over 4253393.32 frames. ], batch size: 607, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:04:45,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.923e+02 3.262e+02 3.772e+02 5.838e+02, threshold=6.523e+02, percent-clipped=0.0 2023-06-18 18:06:09,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=270510.0, ans=0.0 2023-06-18 18:06:21,037 INFO [train.py:996] (0/4) Epoch 2, batch 14600, loss[loss=0.3376, simple_loss=0.3965, pruned_loss=0.1393, over 21783.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.366, pruned_loss=0.1303, over 4262819.28 frames. ], batch size: 124, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:06:42,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-18 18:06:55,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-18 18:07:14,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-18 18:07:58,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.67 vs. limit=10.0 2023-06-18 18:07:58,650 INFO [train.py:996] (0/4) Epoch 2, batch 14650, loss[loss=0.2172, simple_loss=0.2986, pruned_loss=0.06785, over 21623.00 frames. ], tot_loss[loss=0.3127, simple_loss=0.3685, pruned_loss=0.1285, over 4268856.97 frames. ], batch size: 263, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:08:00,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.284e+02 3.921e+02 4.845e+02 9.187e+02, threshold=7.842e+02, percent-clipped=12.0 2023-06-18 18:08:09,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=270870.0, ans=0.125 2023-06-18 18:08:25,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=270930.0, ans=0.2 2023-06-18 18:08:32,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=270990.0, ans=0.2 2023-06-18 18:09:13,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=271050.0, ans=0.0 2023-06-18 18:09:34,801 INFO [train.py:996] (0/4) Epoch 2, batch 14700, loss[loss=0.3074, simple_loss=0.3762, pruned_loss=0.1193, over 21649.00 frames. ], tot_loss[loss=0.3005, simple_loss=0.3595, pruned_loss=0.1207, over 4256958.90 frames. ], batch size: 263, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:09:47,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=271170.0, ans=0.125 2023-06-18 18:10:00,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-18 18:10:03,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=12.0 2023-06-18 18:10:47,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.50 vs. limit=15.0 2023-06-18 18:11:07,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=271410.0, ans=0.0 2023-06-18 18:11:13,588 INFO [train.py:996] (0/4) Epoch 2, batch 14750, loss[loss=0.3209, simple_loss=0.3603, pruned_loss=0.1407, over 21192.00 frames. ], tot_loss[loss=0.3078, simple_loss=0.367, pruned_loss=0.1243, over 4264429.54 frames. ], batch size: 608, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:11:15,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 3.023e+02 4.317e+02 6.398e+02 9.994e+02, threshold=8.633e+02, percent-clipped=9.0 2023-06-18 18:11:33,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=271530.0, ans=0.125 2023-06-18 18:12:11,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=271590.0, ans=0.0 2023-06-18 18:12:15,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-18 18:12:20,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=271650.0, ans=0.0 2023-06-18 18:12:24,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-18 18:12:51,295 INFO [train.py:996] (0/4) Epoch 2, batch 14800, loss[loss=0.3537, simple_loss=0.4065, pruned_loss=0.1504, over 21554.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.3787, pruned_loss=0.1303, over 4264412.60 frames. ], batch size: 414, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:13:49,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-06-18 18:14:03,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-18 18:14:17,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=272010.0, ans=0.0 2023-06-18 18:14:31,716 INFO [train.py:996] (0/4) Epoch 2, batch 14850, loss[loss=0.3585, simple_loss=0.4128, pruned_loss=0.1521, over 21623.00 frames. ], tot_loss[loss=0.3159, simple_loss=0.3715, pruned_loss=0.1302, over 4266605.69 frames. ], batch size: 389, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:14:33,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.362e+02 3.948e+02 5.141e+02 8.278e+02, threshold=7.896e+02, percent-clipped=0.0 2023-06-18 18:14:44,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=272070.0, ans=0.125 2023-06-18 18:14:44,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272070.0, ans=0.1 2023-06-18 18:15:15,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=272130.0, ans=0.0 2023-06-18 18:15:19,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=272130.0, ans=0.125 2023-06-18 18:15:37,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=272250.0, ans=0.0 2023-06-18 18:15:37,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-18 18:15:54,457 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=15.0 2023-06-18 18:16:08,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=272310.0, ans=0.125 2023-06-18 18:16:11,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-18 18:16:13,873 INFO [train.py:996] (0/4) Epoch 2, batch 14900, loss[loss=0.3407, simple_loss=0.3866, pruned_loss=0.1474, over 21978.00 frames. ], tot_loss[loss=0.3202, simple_loss=0.375, pruned_loss=0.1327, over 4269404.79 frames. ], batch size: 317, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:16:15,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272370.0, ans=0.1 2023-06-18 18:16:33,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272370.0, ans=0.1 2023-06-18 18:16:39,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=272370.0, ans=0.0 2023-06-18 18:16:56,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=272430.0, ans=0.2 2023-06-18 18:17:13,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=272490.0, ans=0.125 2023-06-18 18:17:26,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-18 18:18:07,145 INFO [train.py:996] (0/4) Epoch 2, batch 14950, loss[loss=0.3256, simple_loss=0.3827, pruned_loss=0.1343, over 21966.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3743, pruned_loss=0.1306, over 4274103.22 frames. ], batch size: 317, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:18:08,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.099e+02 3.898e+02 5.275e+02 1.469e+03, threshold=7.796e+02, percent-clipped=9.0 2023-06-18 18:18:49,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=272790.0, ans=0.125 2023-06-18 18:19:44,473 INFO [train.py:996] (0/4) Epoch 2, batch 15000, loss[loss=0.312, simple_loss=0.3878, pruned_loss=0.118, over 20757.00 frames. ], tot_loss[loss=0.322, simple_loss=0.3778, pruned_loss=0.1331, over 4276339.52 frames. ], batch size: 607, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:19:44,474 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 18:19:59,944 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2784, simple_loss=0.3732, pruned_loss=0.09186, over 1796401.00 frames. 2023-06-18 18:19:59,944 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 18:20:04,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=272970.0, ans=0.025 2023-06-18 18:21:39,236 INFO [train.py:996] (0/4) Epoch 2, batch 15050, loss[loss=0.4062, simple_loss=0.48, pruned_loss=0.1662, over 20863.00 frames. ], tot_loss[loss=0.325, simple_loss=0.3803, pruned_loss=0.1348, over 4275124.40 frames. ], batch size: 607, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:21:42,435 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 3.625e+02 4.289e+02 5.445e+02 1.034e+03, threshold=8.577e+02, percent-clipped=3.0 2023-06-18 18:21:50,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=273270.0, ans=0.0 2023-06-18 18:23:16,040 INFO [train.py:996] (0/4) Epoch 2, batch 15100, loss[loss=0.377, simple_loss=0.4638, pruned_loss=0.1451, over 19751.00 frames. ], tot_loss[loss=0.3249, simple_loss=0.3824, pruned_loss=0.1337, over 4267000.35 frames. ], batch size: 702, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:23:28,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=273570.0, ans=0.0 2023-06-18 18:23:29,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-18 18:23:46,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=273630.0, ans=0.0 2023-06-18 18:23:48,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=273630.0, ans=0.0 2023-06-18 18:23:48,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=273630.0, ans=0.125 2023-06-18 18:23:49,486 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.54 vs. limit=10.0 2023-06-18 18:23:57,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273690.0, ans=0.1 2023-06-18 18:24:23,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=273750.0, ans=0.125 2023-06-18 18:24:29,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273750.0, ans=0.1 2023-06-18 18:24:41,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=273810.0, ans=0.0 2023-06-18 18:24:47,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-18 18:24:52,916 INFO [train.py:996] (0/4) Epoch 2, batch 15150, loss[loss=0.3544, simple_loss=0.3768, pruned_loss=0.166, over 21351.00 frames. ], tot_loss[loss=0.3217, simple_loss=0.3765, pruned_loss=0.1334, over 4266971.47 frames. ], batch size: 473, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:24:56,355 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.235e+02 3.924e+02 4.419e+02 1.242e+03, threshold=7.848e+02, percent-clipped=3.0 2023-06-18 18:25:01,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273870.0, ans=0.1 2023-06-18 18:25:32,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273990.0, ans=0.1 2023-06-18 18:25:48,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=273990.0, ans=0.1 2023-06-18 18:25:54,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=273990.0, ans=0.125 2023-06-18 18:25:56,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=273990.0, ans=0.035 2023-06-18 18:26:11,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=274050.0, ans=0.0 2023-06-18 18:26:25,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=274110.0, ans=0.125 2023-06-18 18:26:25,723 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.25 vs. limit=22.5 2023-06-18 18:26:29,421 INFO [train.py:996] (0/4) Epoch 2, batch 15200, loss[loss=0.2584, simple_loss=0.3451, pruned_loss=0.08585, over 21581.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3652, pruned_loss=0.1271, over 4261319.49 frames. ], batch size: 389, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:27:05,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=274230.0, ans=0.125 2023-06-18 18:27:28,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-18 18:27:33,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=274350.0, ans=0.125 2023-06-18 18:27:37,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=274350.0, ans=10.0 2023-06-18 18:28:05,745 INFO [train.py:996] (0/4) Epoch 2, batch 15250, loss[loss=0.3307, simple_loss=0.3649, pruned_loss=0.1483, over 21589.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.36, pruned_loss=0.1259, over 4254638.24 frames. ], batch size: 441, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:28:08,693 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.957e+02 3.425e+02 4.072e+02 6.895e+02, threshold=6.850e+02, percent-clipped=0.0 2023-06-18 18:28:26,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=274530.0, ans=0.5 2023-06-18 18:29:06,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=12.0 2023-06-18 18:29:43,133 INFO [train.py:996] (0/4) Epoch 2, batch 15300, loss[loss=0.3364, simple_loss=0.3893, pruned_loss=0.1418, over 21641.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.3645, pruned_loss=0.1301, over 4254817.28 frames. ], batch size: 113, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:29:53,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274770.0, ans=0.1 2023-06-18 18:31:19,251 INFO [train.py:996] (0/4) Epoch 2, batch 15350, loss[loss=0.2921, simple_loss=0.3611, pruned_loss=0.1115, over 21784.00 frames. ], tot_loss[loss=0.3182, simple_loss=0.3715, pruned_loss=0.1325, over 4260808.71 frames. ], batch size: 282, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:31:22,366 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.404e+02 4.041e+02 5.108e+02 1.058e+03, threshold=8.082e+02, percent-clipped=7.0 2023-06-18 18:31:43,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=275130.0, ans=0.125 2023-06-18 18:31:46,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275130.0, ans=0.1 2023-06-18 18:32:08,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=275190.0, ans=0.125 2023-06-18 18:32:18,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=275250.0, ans=0.05 2023-06-18 18:32:32,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-18 18:32:33,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275310.0, ans=0.1 2023-06-18 18:32:49,577 INFO [train.py:996] (0/4) Epoch 2, batch 15400, loss[loss=0.3126, simple_loss=0.3621, pruned_loss=0.1315, over 21821.00 frames. ], tot_loss[loss=0.3182, simple_loss=0.3741, pruned_loss=0.1311, over 4264218.57 frames. ], batch size: 414, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:33:07,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-18 18:33:11,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=275430.0, ans=0.2 2023-06-18 18:33:15,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-18 18:34:02,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=275550.0, ans=0.125 2023-06-18 18:34:26,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=275670.0, ans=0.125 2023-06-18 18:34:27,598 INFO [train.py:996] (0/4) Epoch 2, batch 15450, loss[loss=0.2887, simple_loss=0.3444, pruned_loss=0.1165, over 21851.00 frames. ], tot_loss[loss=0.3142, simple_loss=0.3703, pruned_loss=0.129, over 4271032.67 frames. ], batch size: 107, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:34:28,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=275670.0, ans=0.125 2023-06-18 18:34:30,684 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.287e+02 3.877e+02 4.804e+02 8.434e+02, threshold=7.754e+02, percent-clipped=1.0 2023-06-18 18:34:43,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=275670.0, ans=0.125 2023-06-18 18:35:33,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=275850.0, ans=0.0 2023-06-18 18:35:48,306 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:36:05,302 INFO [train.py:996] (0/4) Epoch 2, batch 15500, loss[loss=0.4204, simple_loss=0.4447, pruned_loss=0.198, over 21354.00 frames. ], tot_loss[loss=0.3161, simple_loss=0.3729, pruned_loss=0.1297, over 4268070.80 frames. ], batch size: 507, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:36:47,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=276030.0, ans=0.125 2023-06-18 18:37:01,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=276090.0, ans=0.125 2023-06-18 18:37:33,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=276210.0, ans=0.2 2023-06-18 18:37:41,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-18 18:37:47,908 INFO [train.py:996] (0/4) Epoch 2, batch 15550, loss[loss=0.2518, simple_loss=0.3136, pruned_loss=0.09502, over 21343.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3687, pruned_loss=0.127, over 4261266.14 frames. ], batch size: 194, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:37:51,082 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.141e+02 3.857e+02 4.872e+02 7.208e+02, threshold=7.715e+02, percent-clipped=0.0 2023-06-18 18:38:40,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=276390.0, ans=0.2 2023-06-18 18:38:53,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=276450.0, ans=0.07 2023-06-18 18:38:57,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=276450.0, ans=0.0 2023-06-18 18:39:03,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=276510.0, ans=0.125 2023-06-18 18:39:11,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=276510.0, ans=0.125 2023-06-18 18:39:24,631 INFO [train.py:996] (0/4) Epoch 2, batch 15600, loss[loss=0.2724, simple_loss=0.3434, pruned_loss=0.1007, over 21601.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3614, pruned_loss=0.1249, over 4261610.31 frames. ], batch size: 263, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:40:15,927 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:40:17,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=276690.0, ans=0.2 2023-06-18 18:40:54,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=276810.0, ans=0.0 2023-06-18 18:41:13,313 INFO [train.py:996] (0/4) Epoch 2, batch 15650, loss[loss=0.3048, simple_loss=0.3547, pruned_loss=0.1275, over 21629.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.3585, pruned_loss=0.1237, over 4257410.95 frames. ], batch size: 332, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:41:16,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 3.146e+02 3.937e+02 5.420e+02 1.080e+03, threshold=7.874e+02, percent-clipped=10.0 2023-06-18 18:42:03,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=277050.0, ans=0.125 2023-06-18 18:42:26,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=277110.0, ans=0.125 2023-06-18 18:42:49,656 INFO [train.py:996] (0/4) Epoch 2, batch 15700, loss[loss=0.2945, simple_loss=0.3418, pruned_loss=0.1236, over 21822.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3529, pruned_loss=0.1212, over 4262493.89 frames. ], batch size: 352, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:43:23,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=277290.0, ans=0.125 2023-06-18 18:43:23,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=277290.0, ans=0.2 2023-06-18 18:43:23,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277290.0, ans=0.1 2023-06-18 18:43:25,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-18 18:43:50,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=277350.0, ans=0.5 2023-06-18 18:44:04,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277410.0, ans=0.1 2023-06-18 18:44:19,895 INFO [train.py:996] (0/4) Epoch 2, batch 15750, loss[loss=0.2851, simple_loss=0.3285, pruned_loss=0.1208, over 21539.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3478, pruned_loss=0.1203, over 4258165.64 frames. ], batch size: 263, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:44:28,131 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 3.151e+02 3.926e+02 5.162e+02 7.477e+02, threshold=7.853e+02, percent-clipped=0.0 2023-06-18 18:44:29,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-18 18:44:33,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=277470.0, ans=0.125 2023-06-18 18:44:51,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=277530.0, ans=0.2 2023-06-18 18:44:59,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277590.0, ans=0.1 2023-06-18 18:45:17,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=277650.0, ans=0.125 2023-06-18 18:45:45,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=277710.0, ans=0.125 2023-06-18 18:46:01,542 INFO [train.py:996] (0/4) Epoch 2, batch 15800, loss[loss=0.248, simple_loss=0.3012, pruned_loss=0.0974, over 21674.00 frames. ], tot_loss[loss=0.2922, simple_loss=0.3442, pruned_loss=0.1201, over 4261656.42 frames. ], batch size: 298, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:47:34,332 INFO [train.py:996] (0/4) Epoch 2, batch 15850, loss[loss=0.3357, simple_loss=0.3853, pruned_loss=0.1431, over 21570.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3487, pruned_loss=0.1244, over 4258691.50 frames. ], batch size: 415, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:47:37,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.427e+02 3.232e+02 3.999e+02 5.094e+02 1.481e+03, threshold=7.998e+02, percent-clipped=6.0 2023-06-18 18:48:31,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-18 18:49:10,700 INFO [train.py:996] (0/4) Epoch 2, batch 15900, loss[loss=0.2757, simple_loss=0.354, pruned_loss=0.09872, over 21665.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3503, pruned_loss=0.1249, over 4258794.13 frames. ], batch size: 298, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:49:29,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=278430.0, ans=0.0 2023-06-18 18:49:36,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-18 18:50:09,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-18 18:50:19,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-18 18:50:25,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-18 18:50:41,670 INFO [train.py:996] (0/4) Epoch 2, batch 15950, loss[loss=0.2887, simple_loss=0.3449, pruned_loss=0.1162, over 21463.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3511, pruned_loss=0.1223, over 4246282.74 frames. ], batch size: 131, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:50:49,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 3.268e+02 3.930e+02 5.229e+02 1.203e+03, threshold=7.860e+02, percent-clipped=8.0 2023-06-18 18:51:23,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-18 18:52:06,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=278910.0, ans=0.0 2023-06-18 18:52:18,714 INFO [train.py:996] (0/4) Epoch 2, batch 16000, loss[loss=0.2828, simple_loss=0.3633, pruned_loss=0.1012, over 21852.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3514, pruned_loss=0.1191, over 4249454.77 frames. ], batch size: 371, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:52:52,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-18 18:52:58,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279090.0, ans=0.1 2023-06-18 18:53:04,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=279090.0, ans=0.2 2023-06-18 18:53:24,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=279150.0, ans=0.125 2023-06-18 18:53:47,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=279210.0, ans=0.02 2023-06-18 18:53:48,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=279270.0, ans=0.0 2023-06-18 18:53:49,674 INFO [train.py:996] (0/4) Epoch 2, batch 16050, loss[loss=0.264, simple_loss=0.342, pruned_loss=0.09302, over 21617.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3538, pruned_loss=0.1171, over 4260180.88 frames. ], batch size: 263, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:53:57,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 3.161e+02 3.861e+02 4.688e+02 7.896e+02, threshold=7.722e+02, percent-clipped=1.0 2023-06-18 18:54:08,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.97 vs. limit=22.5 2023-06-18 18:54:16,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=279270.0, ans=0.0 2023-06-18 18:54:23,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=279330.0, ans=0.1 2023-06-18 18:54:29,870 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.63 vs. limit=10.0 2023-06-18 18:55:23,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=22.5 2023-06-18 18:55:25,229 INFO [train.py:996] (0/4) Epoch 2, batch 16100, loss[loss=0.3227, simple_loss=0.3716, pruned_loss=0.1369, over 21734.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3552, pruned_loss=0.1177, over 4261524.27 frames. ], batch size: 389, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:55:36,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=279570.0, ans=0.0 2023-06-18 18:55:45,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=279570.0, ans=0.125 2023-06-18 18:56:15,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=22.5 2023-06-18 18:56:32,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=279750.0, ans=0.0 2023-06-18 18:56:35,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=279750.0, ans=0.0 2023-06-18 18:56:39,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=279810.0, ans=0.125 2023-06-18 18:56:52,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=279810.0, ans=0.125 2023-06-18 18:56:55,122 INFO [train.py:996] (0/4) Epoch 2, batch 16150, loss[loss=0.284, simple_loss=0.3349, pruned_loss=0.1165, over 21438.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3551, pruned_loss=0.1213, over 4276317.53 frames. ], batch size: 211, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:56:55,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=279870.0, ans=0.0 2023-06-18 18:57:02,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.360e+02 4.498e+02 6.275e+02 1.287e+03, threshold=8.996e+02, percent-clipped=10.0 2023-06-18 18:57:49,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=279990.0, ans=0.125 2023-06-18 18:58:08,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=280050.0, ans=0.0 2023-06-18 18:58:21,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-18 18:58:22,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=280110.0, ans=0.125 2023-06-18 18:58:24,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=280110.0, ans=0.125 2023-06-18 18:58:24,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280110.0, ans=0.1 2023-06-18 18:58:34,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=280170.0, ans=0.125 2023-06-18 18:58:35,870 INFO [train.py:996] (0/4) Epoch 2, batch 16200, loss[loss=0.4013, simple_loss=0.4382, pruned_loss=0.1822, over 21439.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.3595, pruned_loss=0.1231, over 4274943.44 frames. ], batch size: 471, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 18:58:43,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=22.5 2023-06-18 18:59:11,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=280230.0, ans=0.0 2023-06-18 19:00:17,023 INFO [train.py:996] (0/4) Epoch 2, batch 16250, loss[loss=0.1991, simple_loss=0.2636, pruned_loss=0.0673, over 21744.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3584, pruned_loss=0.1221, over 4274465.39 frames. ], batch size: 112, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:00:19,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2023-06-18 19:00:19,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.939e+02 3.512e+02 5.146e+02 1.306e+03, threshold=7.023e+02, percent-clipped=4.0 2023-06-18 19:00:39,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=280530.0, ans=0.125 2023-06-18 19:01:16,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=280650.0, ans=0.0 2023-06-18 19:01:48,540 INFO [train.py:996] (0/4) Epoch 2, batch 16300, loss[loss=0.2576, simple_loss=0.3093, pruned_loss=0.103, over 21776.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3507, pruned_loss=0.1163, over 4275197.99 frames. ], batch size: 112, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:03:31,623 INFO [train.py:996] (0/4) Epoch 2, batch 16350, loss[loss=0.3919, simple_loss=0.4419, pruned_loss=0.171, over 21855.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3549, pruned_loss=0.119, over 4272801.55 frames. ], batch size: 124, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:03:34,729 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.295e+02 4.517e+02 5.318e+02 1.033e+03, threshold=9.034e+02, percent-clipped=9.0 2023-06-18 19:04:22,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-18 19:04:32,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.35 vs. limit=22.5 2023-06-18 19:05:06,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=281310.0, ans=0.125 2023-06-18 19:05:08,354 INFO [train.py:996] (0/4) Epoch 2, batch 16400, loss[loss=0.3277, simple_loss=0.3696, pruned_loss=0.1429, over 21406.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.36, pruned_loss=0.1213, over 4278002.90 frames. ], batch size: 144, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:05:31,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281430.0, ans=0.1 2023-06-18 19:05:53,263 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:06:26,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=281610.0, ans=0.125 2023-06-18 19:06:43,721 INFO [train.py:996] (0/4) Epoch 2, batch 16450, loss[loss=0.2847, simple_loss=0.3305, pruned_loss=0.1194, over 21560.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.3593, pruned_loss=0.1228, over 4286437.97 frames. ], batch size: 212, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:06:46,792 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.379e+02 3.250e+02 3.733e+02 4.517e+02 7.172e+02, threshold=7.466e+02, percent-clipped=0.0 2023-06-18 19:07:12,395 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:07:14,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-18 19:07:17,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-18 19:08:06,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=281910.0, ans=0.04949747468305833 2023-06-18 19:08:18,624 INFO [train.py:996] (0/4) Epoch 2, batch 16500, loss[loss=0.2353, simple_loss=0.2729, pruned_loss=0.0988, over 21206.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.36, pruned_loss=0.1232, over 4285876.50 frames. ], batch size: 159, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:08:53,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-18 19:09:02,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=282090.0, ans=0.125 2023-06-18 19:09:34,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=282150.0, ans=0.0 2023-06-18 19:09:57,182 INFO [train.py:996] (0/4) Epoch 2, batch 16550, loss[loss=0.286, simple_loss=0.3505, pruned_loss=0.1107, over 21807.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.3536, pruned_loss=0.118, over 4281375.15 frames. ], batch size: 282, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:10:00,343 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.451e+02 4.279e+02 5.005e+02 9.425e+02, threshold=8.558e+02, percent-clipped=2.0 2023-06-18 19:10:15,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=282330.0, ans=0.125 2023-06-18 19:10:16,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=282330.0, ans=0.0 2023-06-18 19:10:28,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-18 19:11:08,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=282450.0, ans=0.0 2023-06-18 19:11:31,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=282510.0, ans=0.125 2023-06-18 19:11:36,202 INFO [train.py:996] (0/4) Epoch 2, batch 16600, loss[loss=0.3432, simple_loss=0.403, pruned_loss=0.1417, over 21824.00 frames. ], tot_loss[loss=0.3042, simple_loss=0.3634, pruned_loss=0.1225, over 4277530.67 frames. ], batch size: 118, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:12:43,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=282750.0, ans=0.125 2023-06-18 19:13:01,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=282810.0, ans=0.125 2023-06-18 19:13:03,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=282810.0, ans=0.0 2023-06-18 19:13:13,194 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-18 19:13:15,423 INFO [train.py:996] (0/4) Epoch 2, batch 16650, loss[loss=0.3325, simple_loss=0.3927, pruned_loss=0.1362, over 21996.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3731, pruned_loss=0.1265, over 4272638.04 frames. ], batch size: 317, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:13:15,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=282870.0, ans=0.0 2023-06-18 19:13:18,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.486e+02 3.999e+02 5.162e+02 7.260e+02, threshold=7.998e+02, percent-clipped=0.0 2023-06-18 19:14:00,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=282930.0, ans=0.125 2023-06-18 19:15:10,031 INFO [train.py:996] (0/4) Epoch 2, batch 16700, loss[loss=0.2387, simple_loss=0.2871, pruned_loss=0.09516, over 21885.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3721, pruned_loss=0.1259, over 4269770.85 frames. ], batch size: 98, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:15:34,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=283230.0, ans=0.125 2023-06-18 19:15:49,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=283290.0, ans=0.0 2023-06-18 19:15:53,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-18 19:16:16,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=283350.0, ans=0.0 2023-06-18 19:16:51,999 INFO [train.py:996] (0/4) Epoch 2, batch 16750, loss[loss=0.4001, simple_loss=0.4514, pruned_loss=0.1743, over 21594.00 frames. ], tot_loss[loss=0.3185, simple_loss=0.3783, pruned_loss=0.1293, over 4263255.67 frames. ], batch size: 414, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:16:55,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.344e+02 3.957e+02 4.930e+02 8.837e+02, threshold=7.914e+02, percent-clipped=3.0 2023-06-18 19:18:07,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=283650.0, ans=0.04949747468305833 2023-06-18 19:18:30,455 INFO [train.py:996] (0/4) Epoch 2, batch 16800, loss[loss=0.3056, simple_loss=0.3561, pruned_loss=0.1275, over 21966.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.38, pruned_loss=0.1289, over 4253154.44 frames. ], batch size: 113, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:18:35,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=283770.0, ans=0.035 2023-06-18 19:18:35,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=283770.0, ans=0.02 2023-06-18 19:19:07,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=283890.0, ans=0.125 2023-06-18 19:19:21,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-18 19:19:38,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=283950.0, ans=0.1 2023-06-18 19:19:43,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=283950.0, ans=0.125 2023-06-18 19:19:44,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=283950.0, ans=0.1 2023-06-18 19:19:57,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.39 vs. limit=22.5 2023-06-18 19:20:03,627 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.997e-02 2023-06-18 19:20:06,235 INFO [train.py:996] (0/4) Epoch 2, batch 16850, loss[loss=0.3264, simple_loss=0.3645, pruned_loss=0.1442, over 21350.00 frames. ], tot_loss[loss=0.317, simple_loss=0.3755, pruned_loss=0.1293, over 4264701.45 frames. ], batch size: 159, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:20:09,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 3.797e+02 4.349e+02 5.361e+02 8.347e+02, threshold=8.698e+02, percent-clipped=3.0 2023-06-18 19:20:31,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=284130.0, ans=0.125 2023-06-18 19:21:12,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=284250.0, ans=0.0 2023-06-18 19:21:37,639 INFO [train.py:996] (0/4) Epoch 2, batch 16900, loss[loss=0.3185, simple_loss=0.3886, pruned_loss=0.1242, over 20665.00 frames. ], tot_loss[loss=0.3121, simple_loss=0.3696, pruned_loss=0.1272, over 4273089.85 frames. ], batch size: 607, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:22:55,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=284550.0, ans=0.0 2023-06-18 19:23:13,346 INFO [train.py:996] (0/4) Epoch 2, batch 16950, loss[loss=0.2861, simple_loss=0.336, pruned_loss=0.1181, over 21910.00 frames. ], tot_loss[loss=0.3078, simple_loss=0.3647, pruned_loss=0.1255, over 4271413.72 frames. ], batch size: 351, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:23:16,633 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.116e+02 4.142e+02 5.213e+02 8.477e+02, threshold=8.284e+02, percent-clipped=0.0 2023-06-18 19:23:23,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=284670.0, ans=0.2 2023-06-18 19:23:37,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-18 19:24:00,990 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:24:06,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=284790.0, ans=0.0 2023-06-18 19:24:45,723 INFO [train.py:996] (0/4) Epoch 2, batch 17000, loss[loss=0.3501, simple_loss=0.3806, pruned_loss=0.1598, over 21632.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3627, pruned_loss=0.1256, over 4282604.79 frames. ], batch size: 471, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:24:49,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-18 19:24:52,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=284970.0, ans=0.125 2023-06-18 19:25:34,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=285090.0, ans=0.0 2023-06-18 19:25:48,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=285090.0, ans=0.125 2023-06-18 19:26:05,393 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:26:17,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=285210.0, ans=0.0 2023-06-18 19:26:22,077 INFO [train.py:996] (0/4) Epoch 2, batch 17050, loss[loss=0.3495, simple_loss=0.4113, pruned_loss=0.1439, over 21801.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.3704, pruned_loss=0.1282, over 4286415.00 frames. ], batch size: 414, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:26:25,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.160e+02 3.853e+02 4.663e+02 1.166e+03, threshold=7.706e+02, percent-clipped=2.0 2023-06-18 19:26:25,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=285270.0, ans=0.125 2023-06-18 19:27:26,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=285450.0, ans=0.2 2023-06-18 19:27:28,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=285450.0, ans=0.125 2023-06-18 19:27:37,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-18 19:27:56,609 INFO [train.py:996] (0/4) Epoch 2, batch 17100, loss[loss=0.3162, simple_loss=0.3513, pruned_loss=0.1405, over 21443.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.369, pruned_loss=0.128, over 4292513.56 frames. ], batch size: 211, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:28:53,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=285690.0, ans=0.125 2023-06-18 19:29:00,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=285750.0, ans=0.1 2023-06-18 19:29:25,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285870.0, ans=0.1 2023-06-18 19:29:26,176 INFO [train.py:996] (0/4) Epoch 2, batch 17150, loss[loss=0.2947, simple_loss=0.3391, pruned_loss=0.1252, over 21581.00 frames. ], tot_loss[loss=0.31, simple_loss=0.365, pruned_loss=0.1275, over 4294753.09 frames. ], batch size: 548, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:29:27,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-18 19:29:29,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.459e+02 4.285e+02 5.134e+02 9.644e+02, threshold=8.570e+02, percent-clipped=5.0 2023-06-18 19:30:32,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=286050.0, ans=0.125 2023-06-18 19:30:54,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=286110.0, ans=0.2 2023-06-18 19:31:03,092 INFO [train.py:996] (0/4) Epoch 2, batch 17200, loss[loss=0.386, simple_loss=0.4713, pruned_loss=0.1504, over 20885.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.3656, pruned_loss=0.128, over 4292600.22 frames. ], batch size: 607, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:31:35,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=286230.0, ans=0.95 2023-06-18 19:32:20,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=286350.0, ans=0.0 2023-06-18 19:32:24,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-18 19:32:33,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286410.0, ans=0.1 2023-06-18 19:32:50,100 INFO [train.py:996] (0/4) Epoch 2, batch 17250, loss[loss=0.3358, simple_loss=0.3878, pruned_loss=0.1419, over 21803.00 frames. ], tot_loss[loss=0.3185, simple_loss=0.3723, pruned_loss=0.1324, over 4292340.61 frames. ], batch size: 282, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:32:53,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.381e+02 4.035e+02 5.054e+02 8.566e+02, threshold=8.070e+02, percent-clipped=0.0 2023-06-18 19:33:47,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=286650.0, ans=0.0 2023-06-18 19:34:31,240 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:34:33,932 INFO [train.py:996] (0/4) Epoch 2, batch 17300, loss[loss=0.3044, simple_loss=0.3595, pruned_loss=0.1247, over 21735.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3787, pruned_loss=0.1349, over 4293444.22 frames. ], batch size: 247, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:35:19,324 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-18 19:35:58,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=287010.0, ans=0.0 2023-06-18 19:36:14,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=287010.0, ans=0.125 2023-06-18 19:36:18,575 INFO [train.py:996] (0/4) Epoch 2, batch 17350, loss[loss=0.279, simple_loss=0.3774, pruned_loss=0.09035, over 21255.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3792, pruned_loss=0.1343, over 4293611.16 frames. ], batch size: 548, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:36:23,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 3.256e+02 3.949e+02 5.005e+02 1.104e+03, threshold=7.898e+02, percent-clipped=4.0 2023-06-18 19:36:53,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=287190.0, ans=0.0 2023-06-18 19:37:26,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=287250.0, ans=0.0 2023-06-18 19:37:50,481 INFO [train.py:996] (0/4) Epoch 2, batch 17400, loss[loss=0.3072, simple_loss=0.4017, pruned_loss=0.1064, over 19832.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.3748, pruned_loss=0.1298, over 4287552.46 frames. ], batch size: 702, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:37:52,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287370.0, ans=0.1 2023-06-18 19:38:07,359 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:38:22,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=287490.0, ans=0.125 2023-06-18 19:38:43,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=287490.0, ans=0.2 2023-06-18 19:38:49,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=287550.0, ans=0.125 2023-06-18 19:38:53,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=287550.0, ans=0.125 2023-06-18 19:39:24,247 INFO [train.py:996] (0/4) Epoch 2, batch 17450, loss[loss=0.2354, simple_loss=0.3205, pruned_loss=0.07517, over 21786.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3697, pruned_loss=0.1252, over 4281405.41 frames. ], batch size: 333, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:39:28,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.947e+02 3.682e+02 4.725e+02 7.588e+02, threshold=7.364e+02, percent-clipped=0.0 2023-06-18 19:39:35,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=287670.0, ans=0.0 2023-06-18 19:39:38,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=287730.0, ans=0.125 2023-06-18 19:39:47,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=287730.0, ans=0.1 2023-06-18 19:40:12,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=287790.0, ans=0.125 2023-06-18 19:40:54,708 INFO [train.py:996] (0/4) Epoch 2, batch 17500, loss[loss=0.3026, simple_loss=0.3994, pruned_loss=0.1029, over 20805.00 frames. ], tot_loss[loss=0.3055, simple_loss=0.3656, pruned_loss=0.1227, over 4287108.30 frames. ], batch size: 608, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:41:01,174 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-48000.pt 2023-06-18 19:42:25,492 INFO [train.py:996] (0/4) Epoch 2, batch 17550, loss[loss=0.2653, simple_loss=0.3492, pruned_loss=0.09071, over 21773.00 frames. ], tot_loss[loss=0.3045, simple_loss=0.3662, pruned_loss=0.1214, over 4284654.39 frames. ], batch size: 124, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:42:30,239 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 3.260e+02 4.297e+02 5.739e+02 1.320e+03, threshold=8.594e+02, percent-clipped=8.0 2023-06-18 19:42:39,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=288330.0, ans=0.125 2023-06-18 19:43:33,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=288450.0, ans=0.0 2023-06-18 19:44:01,367 INFO [train.py:996] (0/4) Epoch 2, batch 17600, loss[loss=0.2884, simple_loss=0.3643, pruned_loss=0.1063, over 21589.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3655, pruned_loss=0.1205, over 4269921.48 frames. ], batch size: 112, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:44:39,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=288630.0, ans=0.07 2023-06-18 19:44:47,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=288690.0, ans=0.2 2023-06-18 19:45:31,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.22 vs. limit=22.5 2023-06-18 19:45:39,885 INFO [train.py:996] (0/4) Epoch 2, batch 17650, loss[loss=0.3029, simple_loss=0.3455, pruned_loss=0.1302, over 20143.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3634, pruned_loss=0.1203, over 4270181.77 frames. ], batch size: 702, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:45:42,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=288870.0, ans=0.2 2023-06-18 19:45:44,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.915e+02 4.145e+02 5.677e+02 8.803e+02, threshold=8.289e+02, percent-clipped=2.0 2023-06-18 19:46:43,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=288990.0, ans=0.2 2023-06-18 19:46:53,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=10.0 2023-06-18 19:47:05,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=289110.0, ans=0.125 2023-06-18 19:47:17,250 INFO [train.py:996] (0/4) Epoch 2, batch 17700, loss[loss=0.3183, simple_loss=0.3771, pruned_loss=0.1298, over 21336.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3582, pruned_loss=0.1173, over 4260269.19 frames. ], batch size: 159, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:47:22,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=289170.0, ans=0.1 2023-06-18 19:48:04,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=289290.0, ans=0.125 2023-06-18 19:48:21,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.41 vs. limit=15.0 2023-06-18 19:48:55,997 INFO [train.py:996] (0/4) Epoch 2, batch 17750, loss[loss=0.3844, simple_loss=0.4299, pruned_loss=0.1695, over 21561.00 frames. ], tot_loss[loss=0.3081, simple_loss=0.3688, pruned_loss=0.1237, over 4266874.48 frames. ], batch size: 414, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:49:10,431 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.081e+02 4.126e+02 5.116e+02 1.162e+03, threshold=8.253e+02, percent-clipped=4.0 2023-06-18 19:49:12,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=289470.0, ans=0.035 2023-06-18 19:49:47,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=289590.0, ans=0.125 2023-06-18 19:49:50,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=289590.0, ans=0.125 2023-06-18 19:50:14,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=289710.0, ans=0.0 2023-06-18 19:50:21,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=289710.0, ans=0.125 2023-06-18 19:50:23,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=289710.0, ans=0.125 2023-06-18 19:50:29,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=289710.0, ans=0.0 2023-06-18 19:50:41,025 INFO [train.py:996] (0/4) Epoch 2, batch 17800, loss[loss=0.2195, simple_loss=0.2632, pruned_loss=0.08795, over 16940.00 frames. ], tot_loss[loss=0.3055, simple_loss=0.3668, pruned_loss=0.1221, over 4257806.23 frames. ], batch size: 60, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:51:02,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=289770.0, ans=0.0 2023-06-18 19:51:08,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289830.0, ans=0.1 2023-06-18 19:51:41,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=289950.0, ans=0.1 2023-06-18 19:52:06,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=290010.0, ans=0.125 2023-06-18 19:52:17,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=290010.0, ans=0.07 2023-06-18 19:52:17,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-18 19:52:30,858 INFO [train.py:996] (0/4) Epoch 2, batch 17850, loss[loss=0.3446, simple_loss=0.3915, pruned_loss=0.1489, over 21979.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3655, pruned_loss=0.122, over 4261315.40 frames. ], batch size: 317, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:52:35,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.113e+02 3.565e+02 4.214e+02 1.146e+03, threshold=7.130e+02, percent-clipped=1.0 2023-06-18 19:52:56,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=290130.0, ans=0.0 2023-06-18 19:53:06,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290190.0, ans=0.1 2023-06-18 19:53:42,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=290310.0, ans=0.125 2023-06-18 19:53:52,462 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=15.0 2023-06-18 19:54:06,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=290310.0, ans=0.125 2023-06-18 19:54:09,667 INFO [train.py:996] (0/4) Epoch 2, batch 17900, loss[loss=0.3379, simple_loss=0.4117, pruned_loss=0.1321, over 21615.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3713, pruned_loss=0.1252, over 4268831.88 frames. ], batch size: 389, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:54:41,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=290490.0, ans=0.0 2023-06-18 19:55:02,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=290490.0, ans=0.04949747468305833 2023-06-18 19:55:29,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=290610.0, ans=0.125 2023-06-18 19:55:47,651 INFO [train.py:996] (0/4) Epoch 2, batch 17950, loss[loss=0.2683, simple_loss=0.3377, pruned_loss=0.09947, over 21421.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.371, pruned_loss=0.121, over 4259572.29 frames. ], batch size: 211, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:55:52,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.329e+02 4.288e+02 5.696e+02 7.966e+02, threshold=8.576e+02, percent-clipped=5.0 2023-06-18 19:56:03,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=290730.0, ans=0.125 2023-06-18 19:56:05,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-18 19:56:23,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.91 vs. limit=15.0 2023-06-18 19:56:45,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=290850.0, ans=0.125 2023-06-18 19:57:20,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=290910.0, ans=0.0 2023-06-18 19:57:22,929 INFO [train.py:996] (0/4) Epoch 2, batch 18000, loss[loss=0.2984, simple_loss=0.3347, pruned_loss=0.131, over 21592.00 frames. ], tot_loss[loss=0.3011, simple_loss=0.3636, pruned_loss=0.1193, over 4260726.87 frames. ], batch size: 263, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 19:57:22,930 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 19:57:38,933 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2951, simple_loss=0.3927, pruned_loss=0.09871, over 1796401.00 frames. 2023-06-18 19:57:38,934 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 19:57:45,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=290970.0, ans=0.125 2023-06-18 19:57:53,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=290970.0, ans=0.2 2023-06-18 19:57:59,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=291030.0, ans=0.125 2023-06-18 19:58:09,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=291030.0, ans=0.0 2023-06-18 19:58:12,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=291030.0, ans=0.0 2023-06-18 19:58:15,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=291030.0, ans=0.1 2023-06-18 19:58:32,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=291090.0, ans=0.1 2023-06-18 19:59:15,546 INFO [train.py:996] (0/4) Epoch 2, batch 18050, loss[loss=0.2548, simple_loss=0.3183, pruned_loss=0.09565, over 21225.00 frames. ], tot_loss[loss=0.2975, simple_loss=0.3577, pruned_loss=0.1186, over 4262890.74 frames. ], batch size: 176, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 19:59:24,570 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.442e+02 4.328e+02 5.219e+02 8.565e+02, threshold=8.656e+02, percent-clipped=0.0 2023-06-18 19:59:26,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=291270.0, ans=0.05 2023-06-18 19:59:36,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=291330.0, ans=0.0 2023-06-18 19:59:58,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-18 20:00:55,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=291510.0, ans=15.0 2023-06-18 20:00:57,650 INFO [train.py:996] (0/4) Epoch 2, batch 18100, loss[loss=0.3296, simple_loss=0.3722, pruned_loss=0.1436, over 21709.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.3629, pruned_loss=0.1215, over 4263684.95 frames. ], batch size: 351, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:01:28,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=291630.0, ans=0.125 2023-06-18 20:02:34,135 INFO [train.py:996] (0/4) Epoch 2, batch 18150, loss[loss=0.2807, simple_loss=0.3374, pruned_loss=0.112, over 21818.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3625, pruned_loss=0.1209, over 4259574.36 frames. ], batch size: 317, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:02:36,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.84 vs. limit=22.5 2023-06-18 20:02:38,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.253e+02 3.964e+02 4.730e+02 7.645e+02, threshold=7.929e+02, percent-clipped=0.0 2023-06-18 20:03:02,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=291930.0, ans=0.125 2023-06-18 20:03:32,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=291990.0, ans=22.5 2023-06-18 20:03:58,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=292110.0, ans=10.0 2023-06-18 20:04:09,557 INFO [train.py:996] (0/4) Epoch 2, batch 18200, loss[loss=0.2433, simple_loss=0.2979, pruned_loss=0.09437, over 21569.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3578, pruned_loss=0.1213, over 4261314.18 frames. ], batch size: 263, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:04:42,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=292230.0, ans=0.0 2023-06-18 20:05:39,496 INFO [train.py:996] (0/4) Epoch 2, batch 18250, loss[loss=0.3194, simple_loss=0.3618, pruned_loss=0.1385, over 21926.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3476, pruned_loss=0.1166, over 4263727.90 frames. ], batch size: 333, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:05:44,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.938e+02 3.797e+02 4.811e+02 7.621e+02, threshold=7.594e+02, percent-clipped=0.0 2023-06-18 20:06:29,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=292590.0, ans=0.95 2023-06-18 20:06:35,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=292590.0, ans=0.0 2023-06-18 20:07:15,032 INFO [train.py:996] (0/4) Epoch 2, batch 18300, loss[loss=0.3798, simple_loss=0.4544, pruned_loss=0.1526, over 21539.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3467, pruned_loss=0.1168, over 4267749.40 frames. ], batch size: 471, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:07:51,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-18 20:08:12,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=292890.0, ans=0.0 2023-06-18 20:08:50,551 INFO [train.py:996] (0/4) Epoch 2, batch 18350, loss[loss=0.3012, simple_loss=0.352, pruned_loss=0.1252, over 21744.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3502, pruned_loss=0.1163, over 4278916.38 frames. ], batch size: 351, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:08:55,236 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.344e+02 4.081e+02 5.632e+02 1.157e+03, threshold=8.162e+02, percent-clipped=13.0 2023-06-18 20:09:11,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=293130.0, ans=0.125 2023-06-18 20:09:26,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=293130.0, ans=0.0 2023-06-18 20:09:32,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=293130.0, ans=0.0 2023-06-18 20:09:57,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=293250.0, ans=0.0 2023-06-18 20:10:09,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=293250.0, ans=0.125 2023-06-18 20:10:10,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=293310.0, ans=0.0 2023-06-18 20:10:27,242 INFO [train.py:996] (0/4) Epoch 2, batch 18400, loss[loss=0.2407, simple_loss=0.3152, pruned_loss=0.08303, over 21716.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3461, pruned_loss=0.1141, over 4274101.73 frames. ], batch size: 247, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:11:56,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=293610.0, ans=0.0 2023-06-18 20:12:08,226 INFO [train.py:996] (0/4) Epoch 2, batch 18450, loss[loss=0.2131, simple_loss=0.2803, pruned_loss=0.07293, over 21234.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3426, pruned_loss=0.1092, over 4278070.83 frames. ], batch size: 159, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:12:12,710 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.811e+02 3.741e+02 4.962e+02 8.715e+02, threshold=7.483e+02, percent-clipped=1.0 2023-06-18 20:13:22,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293850.0, ans=0.1 2023-06-18 20:13:45,352 INFO [train.py:996] (0/4) Epoch 2, batch 18500, loss[loss=0.286, simple_loss=0.3481, pruned_loss=0.1119, over 21488.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3393, pruned_loss=0.1084, over 4275619.17 frames. ], batch size: 389, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:14:24,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=294090.0, ans=0.2 2023-06-18 20:14:47,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=294150.0, ans=0.125 2023-06-18 20:14:52,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=294150.0, ans=0.0 2023-06-18 20:14:54,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-18 20:14:55,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=294150.0, ans=0.1 2023-06-18 20:15:05,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=294210.0, ans=0.1 2023-06-18 20:15:17,120 INFO [train.py:996] (0/4) Epoch 2, batch 18550, loss[loss=0.2941, simple_loss=0.3386, pruned_loss=0.1248, over 21773.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3363, pruned_loss=0.1077, over 4263336.45 frames. ], batch size: 351, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:15:26,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.027e+02 3.740e+02 5.224e+02 1.354e+03, threshold=7.479e+02, percent-clipped=4.0 2023-06-18 20:15:30,029 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:16:04,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294390.0, ans=0.1 2023-06-18 20:16:25,594 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.55 vs. limit=15.0 2023-06-18 20:16:34,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=294450.0, ans=0.125 2023-06-18 20:16:42,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=294510.0, ans=0.125 2023-06-18 20:16:42,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=294510.0, ans=0.125 2023-06-18 20:16:58,792 INFO [train.py:996] (0/4) Epoch 2, batch 18600, loss[loss=0.273, simple_loss=0.3388, pruned_loss=0.1036, over 21688.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.335, pruned_loss=0.1087, over 4256048.48 frames. ], batch size: 298, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:17:26,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-18 20:17:54,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-18 20:18:12,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294750.0, ans=0.1 2023-06-18 20:18:15,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=294810.0, ans=0.0 2023-06-18 20:18:35,048 INFO [train.py:996] (0/4) Epoch 2, batch 18650, loss[loss=0.3017, simple_loss=0.3398, pruned_loss=0.1318, over 21365.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3349, pruned_loss=0.1091, over 4264798.11 frames. ], batch size: 473, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:18:40,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.101e+02 3.636e+02 4.483e+02 8.727e+02, threshold=7.272e+02, percent-clipped=3.0 2023-06-18 20:19:55,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=295110.0, ans=0.125 2023-06-18 20:20:06,195 INFO [train.py:996] (0/4) Epoch 2, batch 18700, loss[loss=0.2903, simple_loss=0.3321, pruned_loss=0.1242, over 21699.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3335, pruned_loss=0.111, over 4265358.56 frames. ], batch size: 231, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:20:53,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=12.0 2023-06-18 20:21:33,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=295410.0, ans=0.0 2023-06-18 20:21:34,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=295410.0, ans=0.0 2023-06-18 20:21:39,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=295410.0, ans=0.1 2023-06-18 20:21:41,694 INFO [train.py:996] (0/4) Epoch 2, batch 18750, loss[loss=0.3651, simple_loss=0.4186, pruned_loss=0.1558, over 21607.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3382, pruned_loss=0.1153, over 4265101.44 frames. ], batch size: 414, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:21:52,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.155e+02 3.822e+02 4.527e+02 9.030e+02, threshold=7.645e+02, percent-clipped=1.0 2023-06-18 20:22:05,305 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=8.301e-03 2023-06-18 20:23:09,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=295710.0, ans=0.2 2023-06-18 20:23:17,287 INFO [train.py:996] (0/4) Epoch 2, batch 18800, loss[loss=0.3374, simple_loss=0.4023, pruned_loss=0.1362, over 21577.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3447, pruned_loss=0.1168, over 4260289.48 frames. ], batch size: 508, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:23:39,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=295830.0, ans=0.125 2023-06-18 20:24:32,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=295950.0, ans=0.125 2023-06-18 20:24:57,826 INFO [train.py:996] (0/4) Epoch 2, batch 18850, loss[loss=0.2573, simple_loss=0.3223, pruned_loss=0.09612, over 21538.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3383, pruned_loss=0.1094, over 4257625.66 frames. ], batch size: 441, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:25:03,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.896e+02 3.612e+02 4.924e+02 1.009e+03, threshold=7.223e+02, percent-clipped=2.0 2023-06-18 20:25:44,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=296190.0, ans=0.0 2023-06-18 20:26:30,515 INFO [train.py:996] (0/4) Epoch 2, batch 18900, loss[loss=0.3464, simple_loss=0.3779, pruned_loss=0.1574, over 15336.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3357, pruned_loss=0.1105, over 4239136.38 frames. ], batch size: 63, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:26:37,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=296370.0, ans=0.1 2023-06-18 20:26:55,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=296430.0, ans=0.125 2023-06-18 20:27:56,329 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-18 20:28:07,310 INFO [train.py:996] (0/4) Epoch 2, batch 18950, loss[loss=0.2571, simple_loss=0.3075, pruned_loss=0.1034, over 21165.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3365, pruned_loss=0.1135, over 4235732.10 frames. ], batch size: 608, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:28:18,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.002e+02 3.745e+02 4.553e+02 7.623e+02, threshold=7.489e+02, percent-clipped=1.0 2023-06-18 20:29:11,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=296790.0, ans=0.2 2023-06-18 20:29:30,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=296910.0, ans=0.125 2023-06-18 20:29:31,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=296910.0, ans=0.0 2023-06-18 20:29:42,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=296910.0, ans=0.0 2023-06-18 20:29:44,728 INFO [train.py:996] (0/4) Epoch 2, batch 19000, loss[loss=0.3337, simple_loss=0.384, pruned_loss=0.1417, over 21427.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3475, pruned_loss=0.1161, over 4232052.91 frames. ], batch size: 211, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:31:13,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=297210.0, ans=0.125 2023-06-18 20:31:28,459 INFO [train.py:996] (0/4) Epoch 2, batch 19050, loss[loss=0.295, simple_loss=0.3433, pruned_loss=0.1233, over 21438.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3525, pruned_loss=0.1208, over 4230070.10 frames. ], batch size: 194, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:31:34,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.373e+02 3.128e+02 4.146e+02 5.526e+02 1.033e+03, threshold=8.291e+02, percent-clipped=8.0 2023-06-18 20:32:34,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=297450.0, ans=0.125 2023-06-18 20:33:05,259 INFO [train.py:996] (0/4) Epoch 2, batch 19100, loss[loss=0.268, simple_loss=0.3195, pruned_loss=0.1082, over 21593.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.35, pruned_loss=0.1217, over 4243583.54 frames. ], batch size: 414, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:33:36,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=297630.0, ans=0.09899494936611666 2023-06-18 20:33:55,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=297690.0, ans=0.2 2023-06-18 20:34:21,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=297750.0, ans=0.07 2023-06-18 20:34:21,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=297750.0, ans=0.125 2023-06-18 20:34:24,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=297810.0, ans=0.125 2023-06-18 20:34:43,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297870.0, ans=0.1 2023-06-18 20:34:44,325 INFO [train.py:996] (0/4) Epoch 2, batch 19150, loss[loss=0.3008, simple_loss=0.3535, pruned_loss=0.1241, over 20093.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3532, pruned_loss=0.1226, over 4254373.55 frames. ], batch size: 707, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:34:51,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.456e+02 4.497e+02 5.703e+02 1.044e+03, threshold=8.993e+02, percent-clipped=4.0 2023-06-18 20:35:17,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297930.0, ans=0.1 2023-06-18 20:35:30,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=297990.0, ans=0.125 2023-06-18 20:35:37,221 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=22.5 2023-06-18 20:36:03,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-18 20:36:27,164 INFO [train.py:996] (0/4) Epoch 2, batch 19200, loss[loss=0.2985, simple_loss=0.3754, pruned_loss=0.1108, over 21436.00 frames. ], tot_loss[loss=0.3074, simple_loss=0.3659, pruned_loss=0.1245, over 4251682.71 frames. ], batch size: 211, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:36:29,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=298170.0, ans=0.04949747468305833 2023-06-18 20:36:57,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=298230.0, ans=0.0 2023-06-18 20:37:10,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=298290.0, ans=0.09899494936611666 2023-06-18 20:37:47,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.54 vs. limit=15.0 2023-06-18 20:37:58,670 INFO [train.py:996] (0/4) Epoch 2, batch 19250, loss[loss=0.2549, simple_loss=0.3296, pruned_loss=0.09008, over 21785.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3632, pruned_loss=0.117, over 4244061.35 frames. ], batch size: 247, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:38:09,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.933e+02 3.534e+02 4.386e+02 8.060e+02, threshold=7.069e+02, percent-clipped=0.0 2023-06-18 20:38:17,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-18 20:38:33,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=298530.0, ans=0.0 2023-06-18 20:39:28,104 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.41 vs. limit=15.0 2023-06-18 20:39:34,518 INFO [train.py:996] (0/4) Epoch 2, batch 19300, loss[loss=0.2554, simple_loss=0.3141, pruned_loss=0.09833, over 21738.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3599, pruned_loss=0.1169, over 4256191.35 frames. ], batch size: 247, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:40:04,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=12.0 2023-06-18 20:40:11,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=298830.0, ans=0.125 2023-06-18 20:41:17,387 INFO [train.py:996] (0/4) Epoch 2, batch 19350, loss[loss=0.3266, simple_loss=0.39, pruned_loss=0.1316, over 21612.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3518, pruned_loss=0.1105, over 4256914.00 frames. ], batch size: 473, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:41:28,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 3.011e+02 3.707e+02 4.151e+02 9.500e+02, threshold=7.414e+02, percent-clipped=4.0 2023-06-18 20:41:45,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=299130.0, ans=0.1 2023-06-18 20:42:54,496 INFO [train.py:996] (0/4) Epoch 2, batch 19400, loss[loss=0.2188, simple_loss=0.2976, pruned_loss=0.06997, over 21569.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3498, pruned_loss=0.1093, over 4262314.69 frames. ], batch size: 230, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:43:08,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=299430.0, ans=0.125 2023-06-18 20:43:27,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=299490.0, ans=6.0 2023-06-18 20:43:43,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=299550.0, ans=0.0 2023-06-18 20:44:19,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=299670.0, ans=0.0 2023-06-18 20:44:25,376 INFO [train.py:996] (0/4) Epoch 2, batch 19450, loss[loss=0.2654, simple_loss=0.3205, pruned_loss=0.1052, over 21613.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3481, pruned_loss=0.1125, over 4268802.25 frames. ], batch size: 263, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:44:36,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.184e+02 3.692e+02 4.694e+02 9.525e+02, threshold=7.383e+02, percent-clipped=2.0 2023-06-18 20:44:41,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=299670.0, ans=0.0 2023-06-18 20:46:07,387 INFO [train.py:996] (0/4) Epoch 2, batch 19500, loss[loss=0.2564, simple_loss=0.3169, pruned_loss=0.09792, over 21672.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3441, pruned_loss=0.1144, over 4264978.77 frames. ], batch size: 247, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:46:30,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=300030.0, ans=0.2 2023-06-18 20:47:08,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-06-18 20:47:45,720 INFO [train.py:996] (0/4) Epoch 2, batch 19550, loss[loss=0.2908, simple_loss=0.3688, pruned_loss=0.1064, over 21739.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.3396, pruned_loss=0.1119, over 4262718.70 frames. ], batch size: 414, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:47:51,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 3.032e+02 3.762e+02 4.813e+02 9.306e+02, threshold=7.523e+02, percent-clipped=3.0 2023-06-18 20:48:12,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=300330.0, ans=0.2 2023-06-18 20:49:17,048 INFO [train.py:996] (0/4) Epoch 2, batch 19600, loss[loss=0.297, simple_loss=0.3422, pruned_loss=0.1259, over 21434.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3425, pruned_loss=0.1137, over 4271747.82 frames. ], batch size: 211, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:49:39,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=300630.0, ans=0.2 2023-06-18 20:50:06,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=300750.0, ans=0.2 2023-06-18 20:50:18,226 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-18 20:50:29,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300810.0, ans=0.1 2023-06-18 20:50:41,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=300810.0, ans=0.125 2023-06-18 20:50:44,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-18 20:50:49,882 INFO [train.py:996] (0/4) Epoch 2, batch 19650, loss[loss=0.307, simple_loss=0.3572, pruned_loss=0.1284, over 21827.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3502, pruned_loss=0.1197, over 4274736.90 frames. ], batch size: 298, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:50:54,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=300870.0, ans=0.125 2023-06-18 20:50:55,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 3.349e+02 4.086e+02 5.431e+02 7.953e+02, threshold=8.171e+02, percent-clipped=1.0 2023-06-18 20:50:58,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=300870.0, ans=0.125 2023-06-18 20:50:59,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=300870.0, ans=0.07 2023-06-18 20:51:13,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=300930.0, ans=0.05 2023-06-18 20:52:24,576 INFO [train.py:996] (0/4) Epoch 2, batch 19700, loss[loss=0.2336, simple_loss=0.2911, pruned_loss=0.08805, over 21387.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3533, pruned_loss=0.1193, over 4276708.70 frames. ], batch size: 131, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:52:32,323 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-06-18 20:52:33,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=301170.0, ans=0.125 2023-06-18 20:53:24,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=301290.0, ans=0.125 2023-06-18 20:53:24,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=301290.0, ans=15.0 2023-06-18 20:53:33,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=301350.0, ans=0.125 2023-06-18 20:54:03,050 INFO [train.py:996] (0/4) Epoch 2, batch 19750, loss[loss=0.3218, simple_loss=0.3795, pruned_loss=0.132, over 21311.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.3646, pruned_loss=0.1218, over 4274191.43 frames. ], batch size: 159, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:54:09,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.167e+02 3.934e+02 5.557e+02 1.096e+03, threshold=7.868e+02, percent-clipped=5.0 2023-06-18 20:54:46,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=301590.0, ans=0.125 2023-06-18 20:55:13,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=301650.0, ans=0.125 2023-06-18 20:55:14,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301650.0, ans=0.1 2023-06-18 20:55:17,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=301650.0, ans=0.125 2023-06-18 20:55:40,459 INFO [train.py:996] (0/4) Epoch 2, batch 19800, loss[loss=0.3409, simple_loss=0.3867, pruned_loss=0.1475, over 21857.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3641, pruned_loss=0.1236, over 4280536.40 frames. ], batch size: 414, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:55:42,621 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:56:12,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=301830.0, ans=0.125 2023-06-18 20:56:40,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=301890.0, ans=0.04949747468305833 2023-06-18 20:56:42,329 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:56:48,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=301950.0, ans=0.125 2023-06-18 20:57:22,428 INFO [train.py:996] (0/4) Epoch 2, batch 19850, loss[loss=0.2568, simple_loss=0.3436, pruned_loss=0.08495, over 21242.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3556, pruned_loss=0.1164, over 4275588.03 frames. ], batch size: 548, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:57:28,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.803e+02 3.716e+02 4.795e+02 8.783e+02, threshold=7.432e+02, percent-clipped=5.0 2023-06-18 20:57:55,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=302130.0, ans=0.0 2023-06-18 20:58:30,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=302250.0, ans=0.125 2023-06-18 20:58:39,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-18 20:58:54,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=302310.0, ans=0.125 2023-06-18 20:58:58,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=22.5 2023-06-18 20:59:00,046 INFO [train.py:996] (0/4) Epoch 2, batch 19900, loss[loss=0.2724, simple_loss=0.3482, pruned_loss=0.09835, over 21610.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3559, pruned_loss=0.1134, over 4268936.58 frames. ], batch size: 263, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 20:59:56,001 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:00:21,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=302610.0, ans=0.1 2023-06-18 21:00:23,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=302610.0, ans=0.125 2023-06-18 21:00:35,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-18 21:00:35,686 INFO [train.py:996] (0/4) Epoch 2, batch 19950, loss[loss=0.2514, simple_loss=0.3059, pruned_loss=0.0984, over 21392.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.348, pruned_loss=0.1132, over 4269638.66 frames. ], batch size: 144, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:00:42,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=302670.0, ans=0.0 2023-06-18 21:00:46,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.768e+02 3.439e+02 5.262e+02 1.066e+03, threshold=6.877e+02, percent-clipped=5.0 2023-06-18 21:01:07,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=302730.0, ans=0.0 2023-06-18 21:01:36,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=302850.0, ans=0.0 2023-06-18 21:01:48,312 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.19 vs. limit=22.5 2023-06-18 21:02:16,330 INFO [train.py:996] (0/4) Epoch 2, batch 20000, loss[loss=0.2826, simple_loss=0.3382, pruned_loss=0.1135, over 21849.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3479, pruned_loss=0.1127, over 4274689.32 frames. ], batch size: 124, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:02:35,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=303030.0, ans=0.2 2023-06-18 21:03:40,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303210.0, ans=0.1 2023-06-18 21:03:46,349 INFO [train.py:996] (0/4) Epoch 2, batch 20050, loss[loss=0.2983, simple_loss=0.353, pruned_loss=0.1218, over 21925.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3504, pruned_loss=0.1155, over 4285952.91 frames. ], batch size: 333, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:03:46,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=303270.0, ans=0.0 2023-06-18 21:03:56,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.190e+02 3.752e+02 4.909e+02 8.771e+02, threshold=7.503e+02, percent-clipped=6.0 2023-06-18 21:04:33,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=303390.0, ans=0.125 2023-06-18 21:04:39,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303390.0, ans=0.1 2023-06-18 21:04:48,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-18 21:05:33,704 INFO [train.py:996] (0/4) Epoch 2, batch 20100, loss[loss=0.2979, simple_loss=0.3618, pruned_loss=0.117, over 19779.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.353, pruned_loss=0.1188, over 4292426.52 frames. ], batch size: 704, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:05:43,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303570.0, ans=0.1 2023-06-18 21:06:07,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303630.0, ans=0.1 2023-06-18 21:06:12,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=303690.0, ans=0.04949747468305833 2023-06-18 21:06:32,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=303750.0, ans=0.125 2023-06-18 21:07:05,848 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=15.0 2023-06-18 21:07:17,570 INFO [train.py:996] (0/4) Epoch 2, batch 20150, loss[loss=0.3198, simple_loss=0.3785, pruned_loss=0.1305, over 21754.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3636, pruned_loss=0.1236, over 4291033.04 frames. ], batch size: 298, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:07:24,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.342e+02 4.169e+02 5.156e+02 8.825e+02, threshold=8.338e+02, percent-clipped=3.0 2023-06-18 21:07:52,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=303930.0, ans=0.2 2023-06-18 21:08:02,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=303990.0, ans=0.125 2023-06-18 21:08:20,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=304050.0, ans=0.2 2023-06-18 21:08:51,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=304110.0, ans=0.125 2023-06-18 21:08:58,629 INFO [train.py:996] (0/4) Epoch 2, batch 20200, loss[loss=0.31, simple_loss=0.3718, pruned_loss=0.1241, over 21628.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3705, pruned_loss=0.127, over 4280867.85 frames. ], batch size: 263, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:09:40,234 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.96 vs. limit=15.0 2023-06-18 21:10:01,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=304350.0, ans=0.0 2023-06-18 21:10:35,652 INFO [train.py:996] (0/4) Epoch 2, batch 20250, loss[loss=0.3856, simple_loss=0.4099, pruned_loss=0.1806, over 21623.00 frames. ], tot_loss[loss=0.3092, simple_loss=0.3694, pruned_loss=0.1245, over 4279618.78 frames. ], batch size: 507, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:10:41,729 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 3.323e+02 4.182e+02 5.137e+02 1.003e+03, threshold=8.365e+02, percent-clipped=2.0 2023-06-18 21:11:11,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=304590.0, ans=0.125 2023-06-18 21:11:53,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-18 21:12:06,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=304710.0, ans=0.0 2023-06-18 21:12:13,292 INFO [train.py:996] (0/4) Epoch 2, batch 20300, loss[loss=0.2998, simple_loss=0.3833, pruned_loss=0.1082, over 21183.00 frames. ], tot_loss[loss=0.3067, simple_loss=0.3691, pruned_loss=0.1222, over 4274013.19 frames. ], batch size: 548, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:12:39,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=304830.0, ans=0.0 2023-06-18 21:13:05,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-18 21:13:23,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=304950.0, ans=0.2 2023-06-18 21:13:48,917 INFO [train.py:996] (0/4) Epoch 2, batch 20350, loss[loss=0.3378, simple_loss=0.3862, pruned_loss=0.1447, over 21800.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3678, pruned_loss=0.1219, over 4275627.71 frames. ], batch size: 441, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:13:54,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=305070.0, ans=0.125 2023-06-18 21:13:55,476 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.188e+02 3.897e+02 4.973e+02 9.485e+02, threshold=7.794e+02, percent-clipped=2.0 2023-06-18 21:14:53,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=305250.0, ans=0.125 2023-06-18 21:15:15,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-18 21:15:26,545 INFO [train.py:996] (0/4) Epoch 2, batch 20400, loss[loss=0.3505, simple_loss=0.4015, pruned_loss=0.1497, over 21946.00 frames. ], tot_loss[loss=0.311, simple_loss=0.3711, pruned_loss=0.1255, over 4266641.31 frames. ], batch size: 316, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:16:04,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-18 21:16:08,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=305490.0, ans=0.125 2023-06-18 21:16:37,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=305550.0, ans=0.1 2023-06-18 21:17:02,156 INFO [train.py:996] (0/4) Epoch 2, batch 20450, loss[loss=0.3213, simple_loss=0.3685, pruned_loss=0.137, over 21791.00 frames. ], tot_loss[loss=0.3137, simple_loss=0.3716, pruned_loss=0.1279, over 4267690.99 frames. ], batch size: 247, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:17:07,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.641e+02 3.573e+02 4.608e+02 6.565e+02 1.538e+03, threshold=9.216e+02, percent-clipped=19.0 2023-06-18 21:17:14,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=22.5 2023-06-18 21:17:18,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=305670.0, ans=0.0 2023-06-18 21:17:20,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=305730.0, ans=0.0 2023-06-18 21:17:21,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=305730.0, ans=0.125 2023-06-18 21:17:22,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-18 21:18:14,872 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:18:17,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-06-18 21:18:18,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=305850.0, ans=0.125 2023-06-18 21:18:26,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=305910.0, ans=0.0 2023-06-18 21:18:37,635 INFO [train.py:996] (0/4) Epoch 2, batch 20500, loss[loss=0.3477, simple_loss=0.3862, pruned_loss=0.1546, over 21402.00 frames. ], tot_loss[loss=0.314, simple_loss=0.3694, pruned_loss=0.1293, over 4259707.01 frames. ], batch size: 131, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:18:58,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-18 21:20:02,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-18 21:20:08,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=306210.0, ans=0.0 2023-06-18 21:20:12,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-18 21:20:19,078 INFO [train.py:996] (0/4) Epoch 2, batch 20550, loss[loss=0.2522, simple_loss=0.326, pruned_loss=0.08922, over 21442.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3611, pruned_loss=0.126, over 4258089.90 frames. ], batch size: 194, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:20:25,446 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.461e+02 4.145e+02 5.402e+02 8.194e+02, threshold=8.291e+02, percent-clipped=0.0 2023-06-18 21:21:56,398 INFO [train.py:996] (0/4) Epoch 2, batch 20600, loss[loss=0.2652, simple_loss=0.3298, pruned_loss=0.1003, over 16475.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3621, pruned_loss=0.1239, over 4256046.55 frames. ], batch size: 61, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:22:06,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306570.0, ans=0.1 2023-06-18 21:22:07,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=306570.0, ans=0.125 2023-06-18 21:22:09,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=306570.0, ans=0.0 2023-06-18 21:22:28,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=306630.0, ans=0.0 2023-06-18 21:22:56,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=306750.0, ans=0.0 2023-06-18 21:23:22,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=306810.0, ans=0.125 2023-06-18 21:23:32,693 INFO [train.py:996] (0/4) Epoch 2, batch 20650, loss[loss=0.2624, simple_loss=0.3118, pruned_loss=0.1064, over 21503.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.3572, pruned_loss=0.1239, over 4269185.48 frames. ], batch size: 195, lr: 1.56e-02, grad_scale: 64.0 2023-06-18 21:23:38,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 3.072e+02 3.756e+02 5.105e+02 7.352e+02, threshold=7.512e+02, percent-clipped=0.0 2023-06-18 21:23:41,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=306870.0, ans=0.2 2023-06-18 21:23:45,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=306870.0, ans=0.125 2023-06-18 21:24:32,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=306990.0, ans=0.125 2023-06-18 21:25:12,203 INFO [train.py:996] (0/4) Epoch 2, batch 20700, loss[loss=0.3864, simple_loss=0.4403, pruned_loss=0.1663, over 21490.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3509, pruned_loss=0.1199, over 4269062.89 frames. ], batch size: 471, lr: 1.56e-02, grad_scale: 64.0 2023-06-18 21:25:25,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=307170.0, ans=0.2 2023-06-18 21:25:42,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-18 21:25:56,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=307290.0, ans=0.0 2023-06-18 21:26:08,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=307290.0, ans=0.0 2023-06-18 21:26:22,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=307350.0, ans=0.0 2023-06-18 21:26:49,461 INFO [train.py:996] (0/4) Epoch 2, batch 20750, loss[loss=0.3487, simple_loss=0.4272, pruned_loss=0.1351, over 21759.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3529, pruned_loss=0.119, over 4244943.38 frames. ], batch size: 332, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:26:57,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.977e+02 3.559e+02 4.590e+02 7.850e+02, threshold=7.118e+02, percent-clipped=2.0 2023-06-18 21:27:08,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=307470.0, ans=0.07 2023-06-18 21:27:29,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-18 21:27:44,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=307590.0, ans=0.2 2023-06-18 21:28:26,174 INFO [train.py:996] (0/4) Epoch 2, batch 20800, loss[loss=0.3414, simple_loss=0.3633, pruned_loss=0.1598, over 21323.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3547, pruned_loss=0.1198, over 4244948.93 frames. ], batch size: 507, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:29:04,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-18 21:29:25,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=307890.0, ans=0.125 2023-06-18 21:30:02,863 INFO [train.py:996] (0/4) Epoch 2, batch 20850, loss[loss=0.2552, simple_loss=0.3087, pruned_loss=0.1009, over 16673.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3472, pruned_loss=0.1171, over 4232489.37 frames. ], batch size: 61, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:30:16,772 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.417e+02 4.186e+02 5.469e+02 9.109e+02, threshold=8.373e+02, percent-clipped=11.0 2023-06-18 21:30:45,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=308130.0, ans=0.0 2023-06-18 21:31:22,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308310.0, ans=0.1 2023-06-18 21:31:37,924 INFO [train.py:996] (0/4) Epoch 2, batch 20900, loss[loss=0.2659, simple_loss=0.3288, pruned_loss=0.1015, over 21165.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3478, pruned_loss=0.1185, over 4246671.97 frames. ], batch size: 159, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:32:28,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=308490.0, ans=0.0 2023-06-18 21:32:28,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=308490.0, ans=0.125 2023-06-18 21:33:12,548 INFO [train.py:996] (0/4) Epoch 2, batch 20950, loss[loss=0.2353, simple_loss=0.2977, pruned_loss=0.08641, over 21693.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3423, pruned_loss=0.1137, over 4250648.21 frames. ], batch size: 298, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:33:14,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=308670.0, ans=0.125 2023-06-18 21:33:21,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 3.051e+02 3.719e+02 4.723e+02 9.435e+02, threshold=7.438e+02, percent-clipped=1.0 2023-06-18 21:33:27,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=308730.0, ans=0.125 2023-06-18 21:34:47,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=308970.0, ans=0.125 2023-06-18 21:34:48,320 INFO [train.py:996] (0/4) Epoch 2, batch 21000, loss[loss=0.2676, simple_loss=0.3228, pruned_loss=0.1062, over 21817.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3407, pruned_loss=0.1137, over 4255211.36 frames. ], batch size: 282, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:34:48,321 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 21:34:58,354 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.1419, 1.3760, 2.8138, 1.9824], device='cuda:0') 2023-06-18 21:35:04,494 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2933, simple_loss=0.3899, pruned_loss=0.09838, over 1796401.00 frames. 2023-06-18 21:35:04,495 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 21:35:16,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=12.0 2023-06-18 21:35:17,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=308970.0, ans=0.125 2023-06-18 21:35:28,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=309030.0, ans=0.125 2023-06-18 21:35:31,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=309030.0, ans=0.125 2023-06-18 21:35:42,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=309030.0, ans=0.0 2023-06-18 21:35:48,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=309090.0, ans=0.125 2023-06-18 21:35:48,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=309090.0, ans=0.125 2023-06-18 21:36:01,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309090.0, ans=0.1 2023-06-18 21:36:23,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=309150.0, ans=0.0 2023-06-18 21:36:23,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=309150.0, ans=0.1 2023-06-18 21:36:40,628 INFO [train.py:996] (0/4) Epoch 2, batch 21050, loss[loss=0.2723, simple_loss=0.3247, pruned_loss=0.11, over 21890.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.34, pruned_loss=0.1148, over 4260993.19 frames. ], batch size: 107, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:36:40,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=309270.0, ans=0.0 2023-06-18 21:36:55,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.843e+02 3.389e+02 4.157e+02 8.301e+02, threshold=6.779e+02, percent-clipped=3.0 2023-06-18 21:37:14,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=309330.0, ans=0.025 2023-06-18 21:37:33,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=309390.0, ans=0.05 2023-06-18 21:37:58,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.79 vs. limit=15.0 2023-06-18 21:38:05,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=309510.0, ans=0.1 2023-06-18 21:38:11,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=309510.0, ans=0.125 2023-06-18 21:38:16,165 INFO [train.py:996] (0/4) Epoch 2, batch 21100, loss[loss=0.2832, simple_loss=0.3153, pruned_loss=0.1256, over 21580.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3372, pruned_loss=0.1154, over 4257865.81 frames. ], batch size: 231, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:38:26,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=15.0 2023-06-18 21:38:51,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-18 21:39:15,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=309750.0, ans=0.07 2023-06-18 21:39:51,669 INFO [train.py:996] (0/4) Epoch 2, batch 21150, loss[loss=0.3074, simple_loss=0.3409, pruned_loss=0.1369, over 21692.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3318, pruned_loss=0.1143, over 4257956.71 frames. ], batch size: 417, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:40:05,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.856e+02 3.188e+02 4.098e+02 8.101e+02, threshold=6.375e+02, percent-clipped=2.0 2023-06-18 21:40:06,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=309870.0, ans=0.0 2023-06-18 21:41:00,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=310050.0, ans=0.125 2023-06-18 21:41:27,077 INFO [train.py:996] (0/4) Epoch 2, batch 21200, loss[loss=0.324, simple_loss=0.354, pruned_loss=0.147, over 21423.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3281, pruned_loss=0.1134, over 4246223.81 frames. ], batch size: 508, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:42:37,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.15 vs. limit=6.0 2023-06-18 21:42:40,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=310350.0, ans=0.0 2023-06-18 21:42:40,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=310350.0, ans=0.125 2023-06-18 21:43:02,681 INFO [train.py:996] (0/4) Epoch 2, batch 21250, loss[loss=0.2532, simple_loss=0.3054, pruned_loss=0.1005, over 21811.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3269, pruned_loss=0.1137, over 4252373.38 frames. ], batch size: 112, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:43:11,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.405e+02 2.991e+02 3.536e+02 4.577e+02 9.525e+02, threshold=7.072e+02, percent-clipped=11.0 2023-06-18 21:43:28,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=310530.0, ans=0.0 2023-06-18 21:44:05,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=310650.0, ans=0.0 2023-06-18 21:44:15,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=310650.0, ans=0.125 2023-06-18 21:44:19,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.16 vs. limit=15.0 2023-06-18 21:44:20,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-18 21:44:35,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-18 21:44:38,908 INFO [train.py:996] (0/4) Epoch 2, batch 21300, loss[loss=0.2931, simple_loss=0.3597, pruned_loss=0.1133, over 21788.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3361, pruned_loss=0.1175, over 4246363.78 frames. ], batch size: 351, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:44:42,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=310770.0, ans=0.125 2023-06-18 21:45:44,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-18 21:45:47,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=310950.0, ans=0.125 2023-06-18 21:46:10,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=311010.0, ans=0.2 2023-06-18 21:46:16,032 INFO [train.py:996] (0/4) Epoch 2, batch 21350, loss[loss=0.2771, simple_loss=0.3598, pruned_loss=0.0972, over 21817.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.342, pruned_loss=0.1188, over 4251674.25 frames. ], batch size: 351, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:46:30,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.675e+02 5.052e+02 5.900e+02 9.607e+02, threshold=1.010e+03, percent-clipped=12.0 2023-06-18 21:47:03,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=311190.0, ans=12.0 2023-06-18 21:47:30,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=311250.0, ans=0.125 2023-06-18 21:47:34,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=311250.0, ans=0.0 2023-06-18 21:47:38,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.05 vs. limit=22.5 2023-06-18 21:47:53,580 INFO [train.py:996] (0/4) Epoch 2, batch 21400, loss[loss=0.3261, simple_loss=0.3798, pruned_loss=0.1362, over 21385.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3468, pruned_loss=0.1189, over 4258949.31 frames. ], batch size: 159, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:48:27,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311430.0, ans=0.1 2023-06-18 21:48:45,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=311490.0, ans=0.125 2023-06-18 21:48:52,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=311490.0, ans=15.0 2023-06-18 21:49:01,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.57 vs. limit=10.0 2023-06-18 21:49:02,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=311550.0, ans=0.125 2023-06-18 21:49:10,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=311550.0, ans=0.0 2023-06-18 21:49:34,474 INFO [train.py:996] (0/4) Epoch 2, batch 21450, loss[loss=0.3186, simple_loss=0.3653, pruned_loss=0.1359, over 21814.00 frames. ], tot_loss[loss=0.298, simple_loss=0.352, pruned_loss=0.122, over 4265891.96 frames. ], batch size: 124, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:49:34,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=311670.0, ans=0.1 2023-06-18 21:49:48,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.034e+02 3.817e+02 5.220e+02 1.129e+03, threshold=7.634e+02, percent-clipped=2.0 2023-06-18 21:50:47,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=311850.0, ans=0.015 2023-06-18 21:51:09,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=311970.0, ans=0.0 2023-06-18 21:51:15,123 INFO [train.py:996] (0/4) Epoch 2, batch 21500, loss[loss=0.268, simple_loss=0.3187, pruned_loss=0.1086, over 21243.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3494, pruned_loss=0.1222, over 4262597.68 frames. ], batch size: 608, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:51:21,608 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-52000.pt 2023-06-18 21:51:43,896 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-18 21:52:13,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=312150.0, ans=0.5 2023-06-18 21:52:28,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=312150.0, ans=0.0 2023-06-18 21:52:51,502 INFO [train.py:996] (0/4) Epoch 2, batch 21550, loss[loss=0.2255, simple_loss=0.2937, pruned_loss=0.07867, over 21661.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3405, pruned_loss=0.1179, over 4260361.24 frames. ], batch size: 415, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:53:05,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.167e+02 4.080e+02 5.090e+02 8.174e+02, threshold=8.161e+02, percent-clipped=3.0 2023-06-18 21:53:06,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=312270.0, ans=0.125 2023-06-18 21:53:09,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=312270.0, ans=0.015 2023-06-18 21:54:10,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=312510.0, ans=0.0 2023-06-18 21:54:28,567 INFO [train.py:996] (0/4) Epoch 2, batch 21600, loss[loss=0.2583, simple_loss=0.333, pruned_loss=0.09184, over 21551.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3339, pruned_loss=0.1147, over 4265544.50 frames. ], batch size: 263, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:54:44,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=22.5 2023-06-18 21:54:51,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=312630.0, ans=0.125 2023-06-18 21:54:51,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=312630.0, ans=0.125 2023-06-18 21:55:22,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=312690.0, ans=0.0 2023-06-18 21:55:38,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=312750.0, ans=0.125 2023-06-18 21:56:04,260 INFO [train.py:996] (0/4) Epoch 2, batch 21650, loss[loss=0.2723, simple_loss=0.3822, pruned_loss=0.08122, over 20834.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.338, pruned_loss=0.1124, over 4264975.51 frames. ], batch size: 607, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:56:16,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-06-18 21:56:18,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 2.957e+02 3.372e+02 4.177e+02 8.367e+02, threshold=6.745e+02, percent-clipped=1.0 2023-06-18 21:57:28,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=313110.0, ans=0.0 2023-06-18 21:57:34,168 INFO [train.py:996] (0/4) Epoch 2, batch 21700, loss[loss=0.2617, simple_loss=0.3137, pruned_loss=0.1049, over 21392.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3381, pruned_loss=0.109, over 4269218.71 frames. ], batch size: 211, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:57:53,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=15.0 2023-06-18 21:58:19,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=313290.0, ans=0.125 2023-06-18 21:58:22,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=313290.0, ans=0.125 2023-06-18 21:58:29,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=313290.0, ans=0.125 2023-06-18 21:58:30,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-18 21:59:06,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=313410.0, ans=0.1 2023-06-18 21:59:09,017 INFO [train.py:996] (0/4) Epoch 2, batch 21750, loss[loss=0.2825, simple_loss=0.3143, pruned_loss=0.1254, over 21249.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3342, pruned_loss=0.1098, over 4264598.66 frames. ], batch size: 176, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:59:16,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=313470.0, ans=0.95 2023-06-18 21:59:17,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=313470.0, ans=0.2 2023-06-18 21:59:28,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.954e+02 3.507e+02 4.548e+02 1.201e+03, threshold=7.014e+02, percent-clipped=5.0 2023-06-18 22:00:07,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=313590.0, ans=0.05 2023-06-18 22:00:08,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=313650.0, ans=0.125 2023-06-18 22:00:10,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=313650.0, ans=0.015 2023-06-18 22:00:35,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.38 vs. limit=6.0 2023-06-18 22:00:36,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=313710.0, ans=0.125 2023-06-18 22:00:50,254 INFO [train.py:996] (0/4) Epoch 2, batch 21800, loss[loss=0.3182, simple_loss=0.3772, pruned_loss=0.1297, over 21629.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3326, pruned_loss=0.1117, over 4273531.29 frames. ], batch size: 391, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:00:55,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313770.0, ans=0.1 2023-06-18 22:01:08,666 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-18 22:01:12,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=313830.0, ans=0.125 2023-06-18 22:01:19,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=313830.0, ans=0.125 2023-06-18 22:01:45,389 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-18 22:01:46,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=313950.0, ans=10.0 2023-06-18 22:01:54,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313950.0, ans=0.1 2023-06-18 22:02:26,279 INFO [train.py:996] (0/4) Epoch 2, batch 21850, loss[loss=0.2767, simple_loss=0.3602, pruned_loss=0.09664, over 21378.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3399, pruned_loss=0.1129, over 4270981.07 frames. ], batch size: 211, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:02:40,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.114e+02 3.894e+02 4.746e+02 8.265e+02, threshold=7.787e+02, percent-clipped=3.0 2023-06-18 22:03:03,181 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-18 22:03:38,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=314250.0, ans=0.0 2023-06-18 22:03:42,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=314310.0, ans=0.2 2023-06-18 22:03:44,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=314310.0, ans=0.2 2023-06-18 22:03:47,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=314310.0, ans=0.2 2023-06-18 22:03:48,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=314310.0, ans=0.0 2023-06-18 22:04:06,583 INFO [train.py:996] (0/4) Epoch 2, batch 21900, loss[loss=0.2792, simple_loss=0.3225, pruned_loss=0.118, over 21714.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3412, pruned_loss=0.115, over 4276509.35 frames. ], batch size: 333, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:04:11,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=314370.0, ans=0.125 2023-06-18 22:04:15,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-18 22:04:31,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-18 22:04:48,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-18 22:04:54,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=314490.0, ans=0.125 2023-06-18 22:05:36,742 INFO [train.py:996] (0/4) Epoch 2, batch 21950, loss[loss=0.2822, simple_loss=0.3195, pruned_loss=0.1225, over 20993.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3358, pruned_loss=0.1132, over 4276260.43 frames. ], batch size: 607, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:05:43,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=314670.0, ans=0.125 2023-06-18 22:05:46,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=314670.0, ans=0.2 2023-06-18 22:05:50,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.916e+02 3.382e+02 4.376e+02 8.385e+02, threshold=6.764e+02, percent-clipped=2.0 2023-06-18 22:06:18,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=314790.0, ans=0.125 2023-06-18 22:06:31,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=314850.0, ans=0.125 2023-06-18 22:07:13,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=314970.0, ans=0.125 2023-06-18 22:07:14,898 INFO [train.py:996] (0/4) Epoch 2, batch 22000, loss[loss=0.2466, simple_loss=0.3051, pruned_loss=0.09407, over 21436.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3276, pruned_loss=0.1083, over 4258395.43 frames. ], batch size: 131, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:07:22,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=314970.0, ans=0.0 2023-06-18 22:07:54,877 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=2.659e-03 2023-06-18 22:07:58,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=22.5 2023-06-18 22:08:10,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=315090.0, ans=0.125 2023-06-18 22:08:32,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=315150.0, ans=0.2 2023-06-18 22:08:41,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-18 22:09:01,073 INFO [train.py:996] (0/4) Epoch 2, batch 22050, loss[loss=0.2154, simple_loss=0.2991, pruned_loss=0.06581, over 21660.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3346, pruned_loss=0.1109, over 4263470.43 frames. ], batch size: 298, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:09:02,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-18 22:09:10,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 3.309e+02 5.007e+02 6.581e+02 1.076e+03, threshold=1.001e+03, percent-clipped=24.0 2023-06-18 22:09:14,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=315270.0, ans=0.125 2023-06-18 22:09:16,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-18 22:09:18,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=315330.0, ans=0.125 2023-06-18 22:09:57,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=315450.0, ans=0.125 2023-06-18 22:10:39,911 INFO [train.py:996] (0/4) Epoch 2, batch 22100, loss[loss=0.3189, simple_loss=0.374, pruned_loss=0.1319, over 21272.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.347, pruned_loss=0.1174, over 4260808.93 frames. ], batch size: 143, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:11:45,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=315750.0, ans=0.125 2023-06-18 22:12:17,298 INFO [train.py:996] (0/4) Epoch 2, batch 22150, loss[loss=0.3156, simple_loss=0.3655, pruned_loss=0.1329, over 21448.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3497, pruned_loss=0.1196, over 4269030.24 frames. ], batch size: 211, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:12:19,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=315870.0, ans=0.0 2023-06-18 22:12:26,476 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.397e+02 4.124e+02 4.864e+02 1.101e+03, threshold=8.247e+02, percent-clipped=1.0 2023-06-18 22:12:26,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=315870.0, ans=0.0 2023-06-18 22:12:49,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=315930.0, ans=10.0 2023-06-18 22:12:55,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315990.0, ans=0.1 2023-06-18 22:13:15,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=316050.0, ans=0.0 2023-06-18 22:13:23,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-18 22:13:23,709 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-18 22:13:52,426 INFO [train.py:996] (0/4) Epoch 2, batch 22200, loss[loss=0.2815, simple_loss=0.3367, pruned_loss=0.1132, over 21423.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3506, pruned_loss=0.1209, over 4277850.22 frames. ], batch size: 177, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:14:16,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=316230.0, ans=0.0 2023-06-18 22:14:16,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=316230.0, ans=0.125 2023-06-18 22:14:19,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=316230.0, ans=0.2 2023-06-18 22:14:43,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=316290.0, ans=0.0 2023-06-18 22:14:43,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.23 vs. limit=22.5 2023-06-18 22:14:47,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=316350.0, ans=0.125 2023-06-18 22:14:49,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=316350.0, ans=0.125 2023-06-18 22:14:53,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=316350.0, ans=0.2 2023-06-18 22:14:58,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=316350.0, ans=0.125 2023-06-18 22:14:59,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=316350.0, ans=0.125 2023-06-18 22:15:12,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=316410.0, ans=0.125 2023-06-18 22:15:33,132 INFO [train.py:996] (0/4) Epoch 2, batch 22250, loss[loss=0.3727, simple_loss=0.4197, pruned_loss=0.1628, over 21885.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.3606, pruned_loss=0.1238, over 4271110.82 frames. ], batch size: 124, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:15:43,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 2.999e+02 3.846e+02 4.976e+02 1.173e+03, threshold=7.692e+02, percent-clipped=5.0 2023-06-18 22:15:47,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=316530.0, ans=0.0 2023-06-18 22:16:18,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=316590.0, ans=0.0 2023-06-18 22:16:38,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316710.0, ans=0.1 2023-06-18 22:16:49,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=316710.0, ans=0.2 2023-06-18 22:17:08,374 INFO [train.py:996] (0/4) Epoch 2, batch 22300, loss[loss=0.3145, simple_loss=0.3606, pruned_loss=0.1342, over 21813.00 frames. ], tot_loss[loss=0.308, simple_loss=0.3631, pruned_loss=0.1265, over 4279693.51 frames. ], batch size: 441, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:17:21,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=316830.0, ans=0.125 2023-06-18 22:17:33,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=316830.0, ans=0.125 2023-06-18 22:17:58,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=316950.0, ans=0.125 2023-06-18 22:18:42,791 INFO [train.py:996] (0/4) Epoch 2, batch 22350, loss[loss=0.2679, simple_loss=0.3282, pruned_loss=0.1038, over 21859.00 frames. ], tot_loss[loss=0.307, simple_loss=0.3608, pruned_loss=0.1266, over 4282355.22 frames. ], batch size: 298, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:18:53,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.080e+02 3.447e+02 4.334e+02 7.080e+02, threshold=6.895e+02, percent-clipped=0.0 2023-06-18 22:19:02,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=6.0 2023-06-18 22:19:12,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=317130.0, ans=0.125 2023-06-18 22:20:19,400 INFO [train.py:996] (0/4) Epoch 2, batch 22400, loss[loss=0.2796, simple_loss=0.3348, pruned_loss=0.1122, over 21569.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3561, pruned_loss=0.1214, over 4285733.38 frames. ], batch size: 263, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:20:44,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.92 vs. limit=6.0 2023-06-18 22:20:48,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=317430.0, ans=0.125 2023-06-18 22:21:27,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=317550.0, ans=0.125 2023-06-18 22:21:45,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=317610.0, ans=0.125 2023-06-18 22:21:49,908 INFO [train.py:996] (0/4) Epoch 2, batch 22450, loss[loss=0.2401, simple_loss=0.2923, pruned_loss=0.09393, over 21547.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3493, pruned_loss=0.1202, over 4276619.42 frames. ], batch size: 231, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:22:00,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.085e+02 3.597e+02 4.516e+02 1.181e+03, threshold=7.194e+02, percent-clipped=2.0 2023-06-18 22:22:02,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=317670.0, ans=0.2 2023-06-18 22:22:02,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=317670.0, ans=0.04949747468305833 2023-06-18 22:22:32,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=317790.0, ans=0.125 2023-06-18 22:22:47,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=317850.0, ans=0.0 2023-06-18 22:22:47,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-18 22:23:28,621 INFO [train.py:996] (0/4) Epoch 2, batch 22500, loss[loss=0.2913, simple_loss=0.3518, pruned_loss=0.1155, over 21243.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3435, pruned_loss=0.1194, over 4272253.77 frames. ], batch size: 176, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:23:56,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=318030.0, ans=0.125 2023-06-18 22:23:56,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=318030.0, ans=0.125 2023-06-18 22:24:46,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2023-06-18 22:24:54,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=318210.0, ans=0.0 2023-06-18 22:25:10,130 INFO [train.py:996] (0/4) Epoch 2, batch 22550, loss[loss=0.4111, simple_loss=0.4454, pruned_loss=0.1884, over 21543.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3477, pruned_loss=0.1206, over 4271926.45 frames. ], batch size: 471, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:25:18,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=318270.0, ans=0.2 2023-06-18 22:25:26,412 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.552e+02 4.294e+02 6.006e+02 1.237e+03, threshold=8.588e+02, percent-clipped=14.0 2023-06-18 22:25:37,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-18 22:25:40,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=318330.0, ans=0.025 2023-06-18 22:25:59,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=15.0 2023-06-18 22:25:59,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=318390.0, ans=15.0 2023-06-18 22:26:06,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=318390.0, ans=0.125 2023-06-18 22:26:25,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318450.0, ans=0.1 2023-06-18 22:26:51,938 INFO [train.py:996] (0/4) Epoch 2, batch 22600, loss[loss=0.2163, simple_loss=0.2701, pruned_loss=0.08123, over 21172.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3496, pruned_loss=0.12, over 4274590.98 frames. ], batch size: 143, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:27:00,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-18 22:27:22,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.16 vs. limit=10.0 2023-06-18 22:27:44,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=318690.0, ans=0.125 2023-06-18 22:27:46,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=318690.0, ans=0.125 2023-06-18 22:27:57,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=318750.0, ans=0.1 2023-06-18 22:28:04,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318750.0, ans=0.1 2023-06-18 22:28:29,441 INFO [train.py:996] (0/4) Epoch 2, batch 22650, loss[loss=0.2303, simple_loss=0.2839, pruned_loss=0.08837, over 21425.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3463, pruned_loss=0.1183, over 4270815.68 frames. ], batch size: 131, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:28:41,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 3.125e+02 3.820e+02 4.472e+02 8.562e+02, threshold=7.640e+02, percent-clipped=0.0 2023-06-18 22:28:48,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=318930.0, ans=0.125 2023-06-18 22:29:12,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=318930.0, ans=0.2 2023-06-18 22:29:25,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=318990.0, ans=0.0 2023-06-18 22:30:07,829 INFO [train.py:996] (0/4) Epoch 2, batch 22700, loss[loss=0.2674, simple_loss=0.3127, pruned_loss=0.111, over 21557.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3408, pruned_loss=0.1178, over 4265940.03 frames. ], batch size: 247, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:30:24,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=319230.0, ans=0.05 2023-06-18 22:30:39,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=319230.0, ans=0.0 2023-06-18 22:31:00,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=319290.0, ans=0.0 2023-06-18 22:31:12,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-18 22:31:25,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=319410.0, ans=0.125 2023-06-18 22:31:39,940 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-06-18 22:31:46,698 INFO [train.py:996] (0/4) Epoch 2, batch 22750, loss[loss=0.3771, simple_loss=0.4085, pruned_loss=0.1729, over 21141.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3413, pruned_loss=0.1199, over 4266358.32 frames. ], batch size: 143, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:31:59,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.020e+02 3.645e+02 4.350e+02 9.693e+02, threshold=7.290e+02, percent-clipped=3.0 2023-06-18 22:32:34,851 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:32:36,967 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=12.0 2023-06-18 22:33:24,562 INFO [train.py:996] (0/4) Epoch 2, batch 22800, loss[loss=0.3426, simple_loss=0.3789, pruned_loss=0.1531, over 21564.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3487, pruned_loss=0.124, over 4263175.88 frames. ], batch size: 548, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:33:30,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=12.0 2023-06-18 22:34:13,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=319890.0, ans=0.1 2023-06-18 22:34:25,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=319950.0, ans=0.0 2023-06-18 22:34:42,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=320010.0, ans=0.125 2023-06-18 22:34:43,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-18 22:34:55,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=320010.0, ans=0.025 2023-06-18 22:34:56,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=320010.0, ans=0.0 2023-06-18 22:35:00,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=320070.0, ans=0.125 2023-06-18 22:35:01,242 INFO [train.py:996] (0/4) Epoch 2, batch 22850, loss[loss=0.2673, simple_loss=0.3094, pruned_loss=0.1126, over 21650.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3456, pruned_loss=0.1228, over 4262400.48 frames. ], batch size: 282, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:35:18,261 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.149e+02 3.783e+02 4.416e+02 9.029e+02, threshold=7.565e+02, percent-clipped=2.0 2023-06-18 22:35:20,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=15.0 2023-06-18 22:35:52,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-18 22:35:54,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=320190.0, ans=0.0 2023-06-18 22:36:38,196 INFO [train.py:996] (0/4) Epoch 2, batch 22900, loss[loss=0.2585, simple_loss=0.3156, pruned_loss=0.1007, over 21638.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3452, pruned_loss=0.1213, over 4268037.56 frames. ], batch size: 247, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:37:36,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=320490.0, ans=0.125 2023-06-18 22:38:03,738 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.26 vs. limit=22.5 2023-06-18 22:38:19,519 INFO [train.py:996] (0/4) Epoch 2, batch 22950, loss[loss=0.3383, simple_loss=0.4554, pruned_loss=0.1106, over 21320.00 frames. ], tot_loss[loss=0.298, simple_loss=0.3571, pruned_loss=0.1195, over 4274144.99 frames. ], batch size: 548, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:38:28,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=320670.0, ans=0.0 2023-06-18 22:38:32,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 3.018e+02 3.691e+02 4.786e+02 9.826e+02, threshold=7.383e+02, percent-clipped=2.0 2023-06-18 22:38:32,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=320670.0, ans=0.125 2023-06-18 22:38:52,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=320730.0, ans=0.0 2023-06-18 22:38:55,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=320730.0, ans=0.125 2023-06-18 22:39:30,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=320850.0, ans=0.125 2023-06-18 22:39:48,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-18 22:39:54,962 INFO [train.py:996] (0/4) Epoch 2, batch 23000, loss[loss=0.3224, simple_loss=0.3692, pruned_loss=0.1378, over 21848.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3567, pruned_loss=0.1166, over 4284774.06 frames. ], batch size: 414, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:40:58,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=321150.0, ans=0.2 2023-06-18 22:41:01,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=321150.0, ans=0.125 2023-06-18 22:41:26,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=321210.0, ans=0.0 2023-06-18 22:41:32,470 INFO [train.py:996] (0/4) Epoch 2, batch 23050, loss[loss=0.3297, simple_loss=0.3788, pruned_loss=0.1404, over 21236.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3581, pruned_loss=0.1193, over 4288056.80 frames. ], batch size: 143, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:41:54,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.225e+02 4.039e+02 5.218e+02 8.181e+02, threshold=8.078e+02, percent-clipped=3.0 2023-06-18 22:42:50,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=321450.0, ans=0.125 2023-06-18 22:43:08,187 INFO [train.py:996] (0/4) Epoch 2, batch 23100, loss[loss=0.2864, simple_loss=0.3213, pruned_loss=0.1258, over 21595.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3531, pruned_loss=0.1194, over 4280936.29 frames. ], batch size: 231, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:43:28,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.14 vs. limit=22.5 2023-06-18 22:43:28,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=321570.0, ans=0.0 2023-06-18 22:43:52,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.38 vs. limit=15.0 2023-06-18 22:44:10,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=321750.0, ans=0.0 2023-06-18 22:44:42,563 INFO [train.py:996] (0/4) Epoch 2, batch 23150, loss[loss=0.3615, simple_loss=0.3844, pruned_loss=0.1693, over 21587.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3472, pruned_loss=0.1181, over 4269276.92 frames. ], batch size: 471, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:44:43,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=321870.0, ans=0.0 2023-06-18 22:44:50,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=321870.0, ans=0.0 2023-06-18 22:44:58,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.109e+02 3.658e+02 4.384e+02 7.114e+02, threshold=7.315e+02, percent-clipped=0.0 2023-06-18 22:44:59,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=321870.0, ans=0.1 2023-06-18 22:45:08,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=321930.0, ans=0.1 2023-06-18 22:45:26,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-18 22:45:34,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=321990.0, ans=0.0 2023-06-18 22:45:36,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=322050.0, ans=0.04949747468305833 2023-06-18 22:46:11,312 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:46:12,279 INFO [train.py:996] (0/4) Epoch 2, batch 23200, loss[loss=0.259, simple_loss=0.3133, pruned_loss=0.1024, over 21377.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3484, pruned_loss=0.1201, over 4279500.86 frames. ], batch size: 176, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:46:39,709 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.74 vs. limit=15.0 2023-06-18 22:47:23,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=322350.0, ans=0.125 2023-06-18 22:47:29,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=322350.0, ans=0.0 2023-06-18 22:47:39,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-18 22:47:47,998 INFO [train.py:996] (0/4) Epoch 2, batch 23250, loss[loss=0.3621, simple_loss=0.3949, pruned_loss=0.1646, over 21901.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3488, pruned_loss=0.122, over 4291398.30 frames. ], batch size: 414, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:48:04,719 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.062e+02 3.496e+02 4.224e+02 8.959e+02, threshold=6.992e+02, percent-clipped=2.0 2023-06-18 22:48:19,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=22.5 2023-06-18 22:48:26,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=322530.0, ans=15.0 2023-06-18 22:49:09,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=322710.0, ans=0.125 2023-06-18 22:49:25,578 INFO [train.py:996] (0/4) Epoch 2, batch 23300, loss[loss=0.3837, simple_loss=0.4349, pruned_loss=0.1662, over 21719.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.358, pruned_loss=0.1244, over 4292877.65 frames. ], batch size: 441, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:49:54,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=322830.0, ans=0.125 2023-06-18 22:50:23,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=322950.0, ans=0.0 2023-06-18 22:51:02,222 INFO [train.py:996] (0/4) Epoch 2, batch 23350, loss[loss=0.2292, simple_loss=0.2887, pruned_loss=0.08489, over 21206.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3615, pruned_loss=0.1229, over 4287914.85 frames. ], batch size: 143, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:51:18,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 3.248e+02 3.923e+02 4.916e+02 7.049e+02, threshold=7.847e+02, percent-clipped=1.0 2023-06-18 22:51:40,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=22.5 2023-06-18 22:51:55,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.43 vs. limit=15.0 2023-06-18 22:52:22,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=323310.0, ans=0.1 2023-06-18 22:52:37,476 INFO [train.py:996] (0/4) Epoch 2, batch 23400, loss[loss=0.2685, simple_loss=0.3178, pruned_loss=0.1096, over 21316.00 frames. ], tot_loss[loss=0.2944, simple_loss=0.3537, pruned_loss=0.1175, over 4293707.64 frames. ], batch size: 159, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:52:49,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=323370.0, ans=0.125 2023-06-18 22:53:36,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=323550.0, ans=0.1 2023-06-18 22:53:55,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=323550.0, ans=0.0 2023-06-18 22:53:55,102 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.120e-02 2023-06-18 22:54:20,635 INFO [train.py:996] (0/4) Epoch 2, batch 23450, loss[loss=0.3944, simple_loss=0.4188, pruned_loss=0.185, over 21326.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3539, pruned_loss=0.1205, over 4294442.05 frames. ], batch size: 176, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:54:33,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.119e+02 3.774e+02 4.736e+02 8.725e+02, threshold=7.548e+02, percent-clipped=2.0 2023-06-18 22:55:16,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=323850.0, ans=0.125 2023-06-18 22:55:59,329 INFO [train.py:996] (0/4) Epoch 2, batch 23500, loss[loss=0.2931, simple_loss=0.3985, pruned_loss=0.0938, over 19869.00 frames. ], tot_loss[loss=0.3006, simple_loss=0.3562, pruned_loss=0.1225, over 4286616.19 frames. ], batch size: 702, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:56:11,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-18 22:56:14,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=324030.0, ans=0.125 2023-06-18 22:56:16,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.63 vs. limit=6.0 2023-06-18 22:56:28,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=324030.0, ans=0.0 2023-06-18 22:56:28,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-18 22:56:30,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.84 vs. limit=6.0 2023-06-18 22:56:36,402 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-18 22:56:39,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=324090.0, ans=0.0 2023-06-18 22:56:42,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=324090.0, ans=0.125 2023-06-18 22:56:59,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=324150.0, ans=0.125 2023-06-18 22:57:36,239 INFO [train.py:996] (0/4) Epoch 2, batch 23550, loss[loss=0.2831, simple_loss=0.3267, pruned_loss=0.1198, over 21872.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3522, pruned_loss=0.1229, over 4279571.05 frames. ], batch size: 107, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:57:45,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=324270.0, ans=0.125 2023-06-18 22:57:48,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.262e+02 3.808e+02 4.439e+02 7.936e+02, threshold=7.617e+02, percent-clipped=1.0 2023-06-18 22:58:34,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324450.0, ans=0.1 2023-06-18 22:58:34,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=324450.0, ans=0.05 2023-06-18 22:58:59,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324510.0, ans=0.1 2023-06-18 22:59:03,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=324510.0, ans=0.125 2023-06-18 22:59:14,007 INFO [train.py:996] (0/4) Epoch 2, batch 23600, loss[loss=0.2844, simple_loss=0.3441, pruned_loss=0.1124, over 21712.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3519, pruned_loss=0.1237, over 4280629.81 frames. ], batch size: 298, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:59:17,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=324570.0, ans=0.125 2023-06-18 22:59:36,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=324630.0, ans=0.0 2023-06-18 22:59:38,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=324630.0, ans=0.04949747468305833 2023-06-18 23:00:12,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=324690.0, ans=0.0 2023-06-18 23:00:39,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=324810.0, ans=0.0 2023-06-18 23:00:52,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=324870.0, ans=0.0 2023-06-18 23:00:57,787 INFO [train.py:996] (0/4) Epoch 2, batch 23650, loss[loss=0.2641, simple_loss=0.3271, pruned_loss=0.1006, over 21628.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3535, pruned_loss=0.1228, over 4282839.00 frames. ], batch size: 230, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:01:10,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 3.434e+02 4.143e+02 5.445e+02 9.457e+02, threshold=8.285e+02, percent-clipped=4.0 2023-06-18 23:01:53,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=324990.0, ans=0.2 2023-06-18 23:02:05,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-18 23:02:13,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=325050.0, ans=0.125 2023-06-18 23:02:38,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-18 23:02:39,092 INFO [train.py:996] (0/4) Epoch 2, batch 23700, loss[loss=0.3302, simple_loss=0.3753, pruned_loss=0.1426, over 21278.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3581, pruned_loss=0.1231, over 4282997.68 frames. ], batch size: 143, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:02:47,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=325170.0, ans=0.2 2023-06-18 23:03:19,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=325230.0, ans=0.125 2023-06-18 23:03:34,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=325290.0, ans=0.125 2023-06-18 23:04:20,805 INFO [train.py:996] (0/4) Epoch 2, batch 23750, loss[loss=0.2441, simple_loss=0.3201, pruned_loss=0.08405, over 21302.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3591, pruned_loss=0.1229, over 4282815.49 frames. ], batch size: 176, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:04:38,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.156e+02 3.937e+02 4.892e+02 1.167e+03, threshold=7.875e+02, percent-clipped=3.0 2023-06-18 23:05:12,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=325590.0, ans=0.125 2023-06-18 23:06:06,063 INFO [train.py:996] (0/4) Epoch 2, batch 23800, loss[loss=0.3644, simple_loss=0.4354, pruned_loss=0.1467, over 21635.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3574, pruned_loss=0.12, over 4285232.64 frames. ], batch size: 414, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:06:45,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=325830.0, ans=0.0 2023-06-18 23:07:03,706 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-18 23:07:07,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=325950.0, ans=10.0 2023-06-18 23:07:07,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325950.0, ans=0.1 2023-06-18 23:07:07,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=325950.0, ans=0.0 2023-06-18 23:07:51,528 INFO [train.py:996] (0/4) Epoch 2, batch 23850, loss[loss=0.3442, simple_loss=0.388, pruned_loss=0.1502, over 21528.00 frames. ], tot_loss[loss=0.3078, simple_loss=0.3689, pruned_loss=0.1234, over 4278922.08 frames. ], batch size: 194, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:08:09,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 3.353e+02 4.244e+02 5.255e+02 8.980e+02, threshold=8.488e+02, percent-clipped=3.0 2023-06-18 23:08:29,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=326190.0, ans=0.0 2023-06-18 23:08:35,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=326190.0, ans=0.0 2023-06-18 23:09:07,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=326250.0, ans=0.0 2023-06-18 23:09:12,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=326310.0, ans=0.0 2023-06-18 23:09:30,862 INFO [train.py:996] (0/4) Epoch 2, batch 23900, loss[loss=0.2839, simple_loss=0.3498, pruned_loss=0.109, over 21504.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3756, pruned_loss=0.1258, over 4280652.33 frames. ], batch size: 389, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:10:14,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=326490.0, ans=0.1 2023-06-18 23:11:08,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=326670.0, ans=0.125 2023-06-18 23:11:08,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=326670.0, ans=0.1 2023-06-18 23:11:09,864 INFO [train.py:996] (0/4) Epoch 2, batch 23950, loss[loss=0.3046, simple_loss=0.3519, pruned_loss=0.1286, over 21883.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3694, pruned_loss=0.1258, over 4280073.79 frames. ], batch size: 372, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:11:28,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.116e+02 3.830e+02 4.465e+02 7.558e+02, threshold=7.660e+02, percent-clipped=0.0 2023-06-18 23:12:01,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=326790.0, ans=0.0 2023-06-18 23:12:50,758 INFO [train.py:996] (0/4) Epoch 2, batch 24000, loss[loss=0.343, simple_loss=0.3992, pruned_loss=0.1434, over 21596.00 frames. ], tot_loss[loss=0.3142, simple_loss=0.3702, pruned_loss=0.1291, over 4282898.66 frames. ], batch size: 415, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:12:50,759 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-18 23:13:09,347 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2897, simple_loss=0.3899, pruned_loss=0.09475, over 1796401.00 frames. 2023-06-18 23:13:09,348 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-18 23:13:44,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=327030.0, ans=0.07 2023-06-18 23:14:09,363 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.62 vs. limit=22.5 2023-06-18 23:14:14,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=15.0 2023-06-18 23:14:32,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=327210.0, ans=0.125 2023-06-18 23:14:48,754 INFO [train.py:996] (0/4) Epoch 2, batch 24050, loss[loss=0.2937, simple_loss=0.3705, pruned_loss=0.1085, over 21650.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3718, pruned_loss=0.1294, over 4287281.43 frames. ], batch size: 389, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:15:06,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.470e+02 4.161e+02 4.943e+02 1.064e+03, threshold=8.323e+02, percent-clipped=4.0 2023-06-18 23:15:25,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-18 23:15:47,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=327390.0, ans=0.125 2023-06-18 23:15:48,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=22.5 2023-06-18 23:15:56,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=327450.0, ans=0.2 2023-06-18 23:16:01,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=327450.0, ans=0.125 2023-06-18 23:16:17,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=327510.0, ans=0.5 2023-06-18 23:16:34,970 INFO [train.py:996] (0/4) Epoch 2, batch 24100, loss[loss=0.3225, simple_loss=0.3889, pruned_loss=0.128, over 21672.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.3706, pruned_loss=0.127, over 4282488.90 frames. ], batch size: 389, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:16:59,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327630.0, ans=0.1 2023-06-18 23:17:47,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327810.0, ans=0.1 2023-06-18 23:17:48,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=327810.0, ans=0.125 2023-06-18 23:17:58,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=327810.0, ans=0.0 2023-06-18 23:18:14,213 INFO [train.py:996] (0/4) Epoch 2, batch 24150, loss[loss=0.2888, simple_loss=0.34, pruned_loss=0.1188, over 21268.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3691, pruned_loss=0.1283, over 4287956.91 frames. ], batch size: 176, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:18:24,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=327870.0, ans=0.125 2023-06-18 23:18:26,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.990e+02 3.404e+02 4.259e+02 8.342e+02, threshold=6.809e+02, percent-clipped=1.0 2023-06-18 23:18:37,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327930.0, ans=0.1 2023-06-18 23:18:37,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327930.0, ans=0.1 2023-06-18 23:19:23,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=328050.0, ans=0.125 2023-06-18 23:19:23,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=328050.0, ans=0.1 2023-06-18 23:19:55,225 INFO [train.py:996] (0/4) Epoch 2, batch 24200, loss[loss=0.283, simple_loss=0.3474, pruned_loss=0.1093, over 21663.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3735, pruned_loss=0.131, over 4290622.76 frames. ], batch size: 247, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:20:25,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=328230.0, ans=0.125 2023-06-18 23:20:49,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-18 23:20:58,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=328350.0, ans=0.2 2023-06-18 23:21:29,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=328410.0, ans=0.125 2023-06-18 23:21:41,626 INFO [train.py:996] (0/4) Epoch 2, batch 24250, loss[loss=0.2539, simple_loss=0.3326, pruned_loss=0.08762, over 21641.00 frames. ], tot_loss[loss=0.3079, simple_loss=0.3699, pruned_loss=0.123, over 4282985.33 frames. ], batch size: 230, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:21:42,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=328470.0, ans=0.5 2023-06-18 23:21:47,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=328470.0, ans=0.0 2023-06-18 23:21:59,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 3.026e+02 3.601e+02 5.036e+02 9.709e+02, threshold=7.202e+02, percent-clipped=3.0 2023-06-18 23:22:00,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=328470.0, ans=0.2 2023-06-18 23:22:09,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=22.5 2023-06-18 23:22:19,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=328590.0, ans=0.1 2023-06-18 23:22:35,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=328590.0, ans=0.125 2023-06-18 23:23:21,445 INFO [train.py:996] (0/4) Epoch 2, batch 24300, loss[loss=0.2573, simple_loss=0.3245, pruned_loss=0.09503, over 21788.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3609, pruned_loss=0.115, over 4274949.05 frames. ], batch size: 332, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:23:43,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.56 vs. limit=15.0 2023-06-18 23:24:44,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=329010.0, ans=0.05 2023-06-18 23:25:04,620 INFO [train.py:996] (0/4) Epoch 2, batch 24350, loss[loss=0.3056, simple_loss=0.3568, pruned_loss=0.1272, over 21424.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.356, pruned_loss=0.1149, over 4282783.59 frames. ], batch size: 211, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:25:10,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=329070.0, ans=0.125 2023-06-18 23:25:16,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-18 23:25:17,425 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.896e+02 3.511e+02 4.657e+02 9.016e+02, threshold=7.022e+02, percent-clipped=4.0 2023-06-18 23:25:31,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=329130.0, ans=0.125 2023-06-18 23:25:31,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=329130.0, ans=0.125 2023-06-18 23:26:14,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=329250.0, ans=0.125 2023-06-18 23:26:15,753 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.889e-03 2023-06-18 23:26:38,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=329310.0, ans=0.125 2023-06-18 23:26:44,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=329370.0, ans=0.2 2023-06-18 23:26:45,796 INFO [train.py:996] (0/4) Epoch 2, batch 24400, loss[loss=0.3182, simple_loss=0.3629, pruned_loss=0.1368, over 21869.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3634, pruned_loss=0.1209, over 4282059.06 frames. ], batch size: 107, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:26:48,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-18 23:27:01,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=329430.0, ans=0.05 2023-06-18 23:27:02,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-18 23:27:03,388 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:27:06,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329430.0, ans=0.1 2023-06-18 23:28:00,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=329550.0, ans=0.0 2023-06-18 23:28:20,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=329610.0, ans=0.125 2023-06-18 23:28:25,851 INFO [train.py:996] (0/4) Epoch 2, batch 24450, loss[loss=0.381, simple_loss=0.4595, pruned_loss=0.1513, over 21699.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3653, pruned_loss=0.1222, over 4277049.28 frames. ], batch size: 414, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:28:38,615 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.395e+02 4.151e+02 4.993e+02 8.571e+02, threshold=8.301e+02, percent-clipped=4.0 2023-06-18 23:28:44,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=329730.0, ans=0.025 2023-06-18 23:28:59,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329730.0, ans=0.1 2023-06-18 23:29:07,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329790.0, ans=0.1 2023-06-18 23:30:04,438 INFO [train.py:996] (0/4) Epoch 2, batch 24500, loss[loss=0.2494, simple_loss=0.3068, pruned_loss=0.09599, over 21180.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3665, pruned_loss=0.1215, over 4285051.30 frames. ], batch size: 607, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:30:49,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.54 vs. limit=10.0 2023-06-18 23:31:25,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=330150.0, ans=0.2 2023-06-18 23:31:37,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=330210.0, ans=0.0 2023-06-18 23:31:44,658 INFO [train.py:996] (0/4) Epoch 2, batch 24550, loss[loss=0.2703, simple_loss=0.3348, pruned_loss=0.1029, over 21551.00 frames. ], tot_loss[loss=0.3096, simple_loss=0.3694, pruned_loss=0.1249, over 4288134.51 frames. ], batch size: 112, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:32:01,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.061e+02 3.714e+02 4.494e+02 1.254e+03, threshold=7.429e+02, percent-clipped=1.0 2023-06-18 23:32:09,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=330330.0, ans=0.2 2023-06-18 23:32:59,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=330450.0, ans=0.2 2023-06-18 23:33:22,629 INFO [train.py:996] (0/4) Epoch 2, batch 24600, loss[loss=0.2629, simple_loss=0.3076, pruned_loss=0.1092, over 21899.00 frames. ], tot_loss[loss=0.3075, simple_loss=0.3635, pruned_loss=0.1258, over 4289818.46 frames. ], batch size: 113, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:33:30,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=330570.0, ans=0.0 2023-06-18 23:33:57,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=330630.0, ans=0.125 2023-06-18 23:34:16,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330690.0, ans=0.1 2023-06-18 23:35:00,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=330870.0, ans=0.125 2023-06-18 23:35:01,700 INFO [train.py:996] (0/4) Epoch 2, batch 24650, loss[loss=0.3014, simple_loss=0.3592, pruned_loss=0.1218, over 21278.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3524, pruned_loss=0.1225, over 4289696.19 frames. ], batch size: 549, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:35:05,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=330870.0, ans=0.125 2023-06-18 23:35:08,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=330870.0, ans=0.0 2023-06-18 23:35:19,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.192e+02 3.864e+02 5.203e+02 1.017e+03, threshold=7.727e+02, percent-clipped=5.0 2023-06-18 23:35:37,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=330990.0, ans=0.0 2023-06-18 23:36:00,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=330990.0, ans=0.125 2023-06-18 23:36:36,269 INFO [train.py:996] (0/4) Epoch 2, batch 24700, loss[loss=0.2714, simple_loss=0.3254, pruned_loss=0.1087, over 21598.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3513, pruned_loss=0.1211, over 4279246.33 frames. ], batch size: 247, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:37:00,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=331230.0, ans=0.125 2023-06-18 23:37:05,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=331230.0, ans=0.2 2023-06-18 23:37:17,468 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-06-18 23:38:13,940 INFO [train.py:996] (0/4) Epoch 2, batch 24750, loss[loss=0.2628, simple_loss=0.3024, pruned_loss=0.1116, over 21272.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3431, pruned_loss=0.1166, over 4280772.63 frames. ], batch size: 551, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:38:33,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 3.050e+02 3.889e+02 4.934e+02 8.372e+02, threshold=7.777e+02, percent-clipped=3.0 2023-06-18 23:39:14,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=331650.0, ans=0.125 2023-06-18 23:39:17,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=331650.0, ans=0.125 2023-06-18 23:39:33,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-06-18 23:39:47,040 INFO [train.py:996] (0/4) Epoch 2, batch 24800, loss[loss=0.3073, simple_loss=0.3464, pruned_loss=0.1341, over 21802.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3394, pruned_loss=0.1164, over 4284853.55 frames. ], batch size: 282, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:40:30,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=331830.0, ans=0.125 2023-06-18 23:40:30,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=331830.0, ans=0.125 2023-06-18 23:40:46,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-18 23:40:57,519 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:41:26,063 INFO [train.py:996] (0/4) Epoch 2, batch 24850, loss[loss=0.3206, simple_loss=0.3765, pruned_loss=0.1323, over 21734.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3416, pruned_loss=0.119, over 4294464.44 frames. ], batch size: 389, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:41:28,046 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:41:29,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=332070.0, ans=0.125 2023-06-18 23:41:50,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 3.323e+02 4.351e+02 5.576e+02 8.938e+02, threshold=8.701e+02, percent-clipped=5.0 2023-06-18 23:41:50,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=332130.0, ans=0.125 2023-06-18 23:41:52,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=332130.0, ans=0.125 2023-06-18 23:42:09,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=22.5 2023-06-18 23:42:16,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=332190.0, ans=0.2 2023-06-18 23:42:52,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=332310.0, ans=0.125 2023-06-18 23:42:57,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=332310.0, ans=0.125 2023-06-18 23:43:09,989 INFO [train.py:996] (0/4) Epoch 2, batch 24900, loss[loss=0.368, simple_loss=0.4078, pruned_loss=0.1641, over 21764.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3481, pruned_loss=0.1222, over 4292483.56 frames. ], batch size: 441, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:43:12,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=332370.0, ans=0.125 2023-06-18 23:44:05,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=332490.0, ans=0.125 2023-06-18 23:44:23,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-18 23:44:55,508 INFO [train.py:996] (0/4) Epoch 2, batch 24950, loss[loss=0.2626, simple_loss=0.302, pruned_loss=0.1116, over 20171.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.3555, pruned_loss=0.1266, over 4285685.22 frames. ], batch size: 703, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:45:09,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=332670.0, ans=0.125 2023-06-18 23:45:14,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=332670.0, ans=0.125 2023-06-18 23:45:15,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.427e+02 4.669e+02 5.544e+02 9.304e+02, threshold=9.338e+02, percent-clipped=1.0 2023-06-18 23:45:16,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-18 23:45:34,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332730.0, ans=0.1 2023-06-18 23:45:42,292 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-18 23:46:40,049 INFO [train.py:996] (0/4) Epoch 2, batch 25000, loss[loss=0.3249, simple_loss=0.3707, pruned_loss=0.1396, over 21338.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3607, pruned_loss=0.1279, over 4284888.54 frames. ], batch size: 471, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:46:41,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.58 vs. limit=12.0 2023-06-18 23:46:48,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=332970.0, ans=0.2 2023-06-18 23:47:57,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=333210.0, ans=0.125 2023-06-18 23:48:12,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=333210.0, ans=0.125 2023-06-18 23:48:17,015 INFO [train.py:996] (0/4) Epoch 2, batch 25050, loss[loss=0.3242, simple_loss=0.352, pruned_loss=0.1482, over 21597.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3541, pruned_loss=0.1259, over 4283661.94 frames. ], batch size: 415, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:48:36,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.031e+02 3.621e+02 4.496e+02 7.145e+02, threshold=7.242e+02, percent-clipped=0.0 2023-06-18 23:48:55,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=333390.0, ans=0.1 2023-06-18 23:49:07,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.16 vs. limit=15.0 2023-06-18 23:49:13,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=333450.0, ans=0.1 2023-06-18 23:49:18,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.95 vs. limit=10.0 2023-06-18 23:49:47,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-06-18 23:49:55,965 INFO [train.py:996] (0/4) Epoch 2, batch 25100, loss[loss=0.2671, simple_loss=0.3125, pruned_loss=0.1108, over 21333.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3482, pruned_loss=0.1231, over 4275647.59 frames. ], batch size: 144, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:49:56,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=333570.0, ans=0.0 2023-06-18 23:49:57,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=333570.0, ans=0.125 2023-06-18 23:49:59,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=333570.0, ans=0.125 2023-06-18 23:50:02,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=333570.0, ans=0.05 2023-06-18 23:51:32,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=12.0 2023-06-18 23:51:33,448 INFO [train.py:996] (0/4) Epoch 2, batch 25150, loss[loss=0.2917, simple_loss=0.3751, pruned_loss=0.1041, over 21361.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3507, pruned_loss=0.1189, over 4281839.45 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:51:48,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 3.003e+02 3.483e+02 4.487e+02 9.549e+02, threshold=6.965e+02, percent-clipped=3.0 2023-06-18 23:52:03,342 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=22.5 2023-06-18 23:52:11,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-18 23:52:14,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.57 vs. limit=22.5 2023-06-18 23:52:21,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-18 23:53:10,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=334170.0, ans=0.0 2023-06-18 23:53:11,755 INFO [train.py:996] (0/4) Epoch 2, batch 25200, loss[loss=0.2574, simple_loss=0.3418, pruned_loss=0.08649, over 21785.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3478, pruned_loss=0.115, over 4282455.98 frames. ], batch size: 282, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:53:51,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=334290.0, ans=0.125 2023-06-18 23:54:09,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=334350.0, ans=0.07 2023-06-18 23:54:27,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=334410.0, ans=0.125 2023-06-18 23:54:39,797 INFO [train.py:996] (0/4) Epoch 2, batch 25250, loss[loss=0.3111, simple_loss=0.3496, pruned_loss=0.1363, over 21282.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3433, pruned_loss=0.1122, over 4251444.23 frames. ], batch size: 471, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:55:04,423 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.756e+02 3.618e+02 4.524e+02 8.260e+02, threshold=7.237e+02, percent-clipped=4.0 2023-06-18 23:55:08,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-18 23:55:54,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=334650.0, ans=0.2 2023-06-18 23:56:24,663 INFO [train.py:996] (0/4) Epoch 2, batch 25300, loss[loss=0.2933, simple_loss=0.3354, pruned_loss=0.1256, over 20128.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3422, pruned_loss=0.1124, over 4261460.76 frames. ], batch size: 703, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:56:55,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=334830.0, ans=0.025 2023-06-18 23:56:57,272 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:57:18,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=334890.0, ans=0.125 2023-06-18 23:57:25,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=334950.0, ans=0.1 2023-06-18 23:57:41,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=334950.0, ans=0.125 2023-06-18 23:58:09,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-18 23:58:10,293 INFO [train.py:996] (0/4) Epoch 2, batch 25350, loss[loss=0.2587, simple_loss=0.3253, pruned_loss=0.09602, over 21789.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3453, pruned_loss=0.112, over 4264381.67 frames. ], batch size: 317, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:58:29,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.861e+02 3.471e+02 4.257e+02 9.448e+02, threshold=6.941e+02, percent-clipped=2.0 2023-06-18 23:59:37,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=335310.0, ans=0.0 2023-06-18 23:59:38,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=335310.0, ans=0.125 2023-06-18 23:59:43,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=335370.0, ans=0.125 2023-06-18 23:59:44,362 INFO [train.py:996] (0/4) Epoch 2, batch 25400, loss[loss=0.304, simple_loss=0.3463, pruned_loss=0.1309, over 21774.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3406, pruned_loss=0.111, over 4257185.63 frames. ], batch size: 371, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 00:00:32,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=335490.0, ans=0.125 2023-06-19 00:01:22,831 INFO [train.py:996] (0/4) Epoch 2, batch 25450, loss[loss=0.3225, simple_loss=0.3853, pruned_loss=0.1299, over 21818.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3417, pruned_loss=0.1133, over 4264355.13 frames. ], batch size: 414, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 00:01:47,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.942e+02 3.491e+02 4.451e+02 7.396e+02, threshold=6.982e+02, percent-clipped=1.0 2023-06-19 00:02:06,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335790.0, ans=0.1 2023-06-19 00:02:54,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=335910.0, ans=0.2 2023-06-19 00:03:08,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.80 vs. limit=5.0 2023-06-19 00:03:09,115 INFO [train.py:996] (0/4) Epoch 2, batch 25500, loss[loss=0.3025, simple_loss=0.3631, pruned_loss=0.1209, over 21350.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3411, pruned_loss=0.1093, over 4260645.20 frames. ], batch size: 159, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 00:03:19,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=335970.0, ans=0.2 2023-06-19 00:03:20,772 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-56000.pt 2023-06-19 00:03:32,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-19 00:03:56,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=336090.0, ans=0.125 2023-06-19 00:04:56,959 INFO [train.py:996] (0/4) Epoch 2, batch 25550, loss[loss=0.2443, simple_loss=0.3249, pruned_loss=0.08181, over 21452.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3475, pruned_loss=0.1095, over 4265406.73 frames. ], batch size: 194, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:04:59,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=336270.0, ans=0.0 2023-06-19 00:05:06,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=336270.0, ans=0.125 2023-06-19 00:05:12,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.591e+02 3.118e+02 3.638e+02 5.445e+02, threshold=6.236e+02, percent-clipped=0.0 2023-06-19 00:05:19,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=336330.0, ans=0.125 2023-06-19 00:05:26,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.28 vs. limit=10.0 2023-06-19 00:05:35,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=336390.0, ans=0.1 2023-06-19 00:05:35,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-19 00:05:36,928 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:05:48,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=336390.0, ans=0.125 2023-06-19 00:05:51,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=336450.0, ans=0.125 2023-06-19 00:06:24,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=336510.0, ans=0.125 2023-06-19 00:06:38,379 INFO [train.py:996] (0/4) Epoch 2, batch 25600, loss[loss=0.2875, simple_loss=0.3861, pruned_loss=0.09445, over 19849.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3534, pruned_loss=0.1112, over 4266540.03 frames. ], batch size: 702, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:06:48,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=336570.0, ans=0.125 2023-06-19 00:07:07,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=336630.0, ans=0.125 2023-06-19 00:08:08,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=336810.0, ans=0.0 2023-06-19 00:08:10,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-19 00:08:17,783 INFO [train.py:996] (0/4) Epoch 2, batch 25650, loss[loss=0.2983, simple_loss=0.3403, pruned_loss=0.1282, over 21874.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3538, pruned_loss=0.1138, over 4262462.53 frames. ], batch size: 98, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:08:31,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.010e+02 3.647e+02 4.694e+02 1.135e+03, threshold=7.294e+02, percent-clipped=6.0 2023-06-19 00:08:35,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=336930.0, ans=0.1 2023-06-19 00:08:36,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=336930.0, ans=0.125 2023-06-19 00:09:27,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-19 00:09:35,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=337110.0, ans=0.125 2023-06-19 00:09:57,167 INFO [train.py:996] (0/4) Epoch 2, batch 25700, loss[loss=0.2526, simple_loss=0.3262, pruned_loss=0.08953, over 21616.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3513, pruned_loss=0.1152, over 4269559.63 frames. ], batch size: 230, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:10:11,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=337230.0, ans=0.2 2023-06-19 00:11:38,913 INFO [train.py:996] (0/4) Epoch 2, batch 25750, loss[loss=0.3609, simple_loss=0.4011, pruned_loss=0.1604, over 21611.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.358, pruned_loss=0.1206, over 4268224.86 frames. ], batch size: 230, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:11:43,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337470.0, ans=0.1 2023-06-19 00:11:45,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-19 00:11:54,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 3.020e+02 3.881e+02 5.422e+02 1.342e+03, threshold=7.762e+02, percent-clipped=9.0 2023-06-19 00:12:09,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=337530.0, ans=0.05 2023-06-19 00:12:47,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-19 00:13:06,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.71 vs. limit=10.0 2023-06-19 00:13:26,830 INFO [train.py:996] (0/4) Epoch 2, batch 25800, loss[loss=0.3211, simple_loss=0.3816, pruned_loss=0.1303, over 21304.00 frames. ], tot_loss[loss=0.3121, simple_loss=0.3716, pruned_loss=0.1263, over 4269113.53 frames. ], batch size: 548, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:13:48,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=337830.0, ans=0.2 2023-06-19 00:15:02,353 INFO [train.py:996] (0/4) Epoch 2, batch 25850, loss[loss=0.2706, simple_loss=0.3243, pruned_loss=0.1085, over 21508.00 frames. ], tot_loss[loss=0.3121, simple_loss=0.3724, pruned_loss=0.1259, over 4268661.31 frames. ], batch size: 131, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:15:11,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=338070.0, ans=0.1 2023-06-19 00:15:24,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=338070.0, ans=0.125 2023-06-19 00:15:26,952 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 3.253e+02 3.802e+02 4.832e+02 7.273e+02, threshold=7.603e+02, percent-clipped=0.0 2023-06-19 00:16:10,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=338250.0, ans=0.04949747468305833 2023-06-19 00:16:13,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=338250.0, ans=0.1 2023-06-19 00:16:30,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=338310.0, ans=0.0 2023-06-19 00:16:53,764 INFO [train.py:996] (0/4) Epoch 2, batch 25900, loss[loss=0.3631, simple_loss=0.4387, pruned_loss=0.1438, over 21717.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3732, pruned_loss=0.1269, over 4268852.82 frames. ], batch size: 389, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:16:55,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=338370.0, ans=0.05 2023-06-19 00:17:36,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=338490.0, ans=0.125 2023-06-19 00:17:40,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-19 00:17:44,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=338490.0, ans=0.2 2023-06-19 00:17:44,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=338490.0, ans=0.125 2023-06-19 00:17:48,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=338550.0, ans=0.0 2023-06-19 00:17:52,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=338550.0, ans=0.035 2023-06-19 00:18:13,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=338610.0, ans=0.125 2023-06-19 00:18:27,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338610.0, ans=0.1 2023-06-19 00:18:31,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-19 00:18:39,815 INFO [train.py:996] (0/4) Epoch 2, batch 25950, loss[loss=0.2956, simple_loss=0.3629, pruned_loss=0.1142, over 21622.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3799, pruned_loss=0.13, over 4268868.65 frames. ], batch size: 263, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:18:52,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=338670.0, ans=0.125 2023-06-19 00:18:54,228 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 3.185e+02 3.771e+02 4.566e+02 7.877e+02, threshold=7.541e+02, percent-clipped=2.0 2023-06-19 00:19:28,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=338850.0, ans=0.125 2023-06-19 00:20:20,785 INFO [train.py:996] (0/4) Epoch 2, batch 26000, loss[loss=0.377, simple_loss=0.4285, pruned_loss=0.1627, over 21681.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.3788, pruned_loss=0.1281, over 4271205.21 frames. ], batch size: 351, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:20:44,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=339030.0, ans=0.2 2023-06-19 00:20:47,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=339030.0, ans=0.1 2023-06-19 00:20:58,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=339090.0, ans=0.125 2023-06-19 00:21:18,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=339150.0, ans=0.1 2023-06-19 00:22:00,599 INFO [train.py:996] (0/4) Epoch 2, batch 26050, loss[loss=0.3183, simple_loss=0.363, pruned_loss=0.1368, over 21907.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.3802, pruned_loss=0.1306, over 4272615.75 frames. ], batch size: 113, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:22:14,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.427e+02 3.293e+02 3.871e+02 4.573e+02 8.054e+02, threshold=7.741e+02, percent-clipped=1.0 2023-06-19 00:22:16,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=339330.0, ans=0.125 2023-06-19 00:22:29,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=339330.0, ans=0.125 2023-06-19 00:23:38,911 INFO [train.py:996] (0/4) Epoch 2, batch 26100, loss[loss=0.296, simple_loss=0.3393, pruned_loss=0.1263, over 21726.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3736, pruned_loss=0.129, over 4278317.28 frames. ], batch size: 473, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:24:11,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=339690.0, ans=0.125 2023-06-19 00:25:02,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=339810.0, ans=0.1 2023-06-19 00:25:20,029 INFO [train.py:996] (0/4) Epoch 2, batch 26150, loss[loss=0.3257, simple_loss=0.3629, pruned_loss=0.1443, over 21794.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3704, pruned_loss=0.1283, over 4282626.67 frames. ], batch size: 282, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:25:33,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=339870.0, ans=0.0 2023-06-19 00:25:34,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.250e+02 3.966e+02 5.249e+02 8.349e+02, threshold=7.932e+02, percent-clipped=3.0 2023-06-19 00:25:56,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=339990.0, ans=0.1 2023-06-19 00:26:33,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=340050.0, ans=0.125 2023-06-19 00:26:41,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=340050.0, ans=0.07 2023-06-19 00:26:54,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=340110.0, ans=0.125 2023-06-19 00:27:00,539 INFO [train.py:996] (0/4) Epoch 2, batch 26200, loss[loss=0.2675, simple_loss=0.3517, pruned_loss=0.09163, over 21227.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3692, pruned_loss=0.1245, over 4281102.27 frames. ], batch size: 176, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:27:01,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=340170.0, ans=0.0 2023-06-19 00:27:33,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=340230.0, ans=0.125 2023-06-19 00:27:41,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=340290.0, ans=10.0 2023-06-19 00:28:02,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340290.0, ans=0.1 2023-06-19 00:28:07,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=340350.0, ans=0.0 2023-06-19 00:28:26,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.09 vs. limit=15.0 2023-06-19 00:28:39,151 INFO [train.py:996] (0/4) Epoch 2, batch 26250, loss[loss=0.3226, simple_loss=0.3785, pruned_loss=0.1334, over 21882.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3732, pruned_loss=0.1237, over 4276391.73 frames. ], batch size: 414, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:28:54,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.946e+02 3.629e+02 4.371e+02 7.049e+02, threshold=7.257e+02, percent-clipped=0.0 2023-06-19 00:29:19,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=340530.0, ans=0.0 2023-06-19 00:29:58,524 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:30:08,516 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-19 00:30:09,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=340710.0, ans=0.125 2023-06-19 00:30:16,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340710.0, ans=0.1 2023-06-19 00:30:20,195 INFO [train.py:996] (0/4) Epoch 2, batch 26300, loss[loss=0.2469, simple_loss=0.3002, pruned_loss=0.09677, over 21204.00 frames. ], tot_loss[loss=0.3096, simple_loss=0.3706, pruned_loss=0.1242, over 4281596.01 frames. ], batch size: 608, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:30:45,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=340830.0, ans=0.04949747468305833 2023-06-19 00:30:56,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=340830.0, ans=0.125 2023-06-19 00:30:58,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=340830.0, ans=0.125 2023-06-19 00:31:12,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=340890.0, ans=0.125 2023-06-19 00:31:14,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-19 00:31:33,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340950.0, ans=0.1 2023-06-19 00:31:53,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=341010.0, ans=0.125 2023-06-19 00:31:59,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=341070.0, ans=0.0 2023-06-19 00:31:59,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=341070.0, ans=0.0 2023-06-19 00:32:00,730 INFO [train.py:996] (0/4) Epoch 2, batch 26350, loss[loss=0.3174, simple_loss=0.3757, pruned_loss=0.1295, over 21682.00 frames. ], tot_loss[loss=0.3098, simple_loss=0.3692, pruned_loss=0.1252, over 4283840.39 frames. ], batch size: 351, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:32:15,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=341070.0, ans=0.0 2023-06-19 00:32:29,477 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 3.077e+02 3.703e+02 4.775e+02 7.605e+02, threshold=7.406e+02, percent-clipped=1.0 2023-06-19 00:33:00,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=341190.0, ans=0.0 2023-06-19 00:33:38,868 INFO [train.py:996] (0/4) Epoch 2, batch 26400, loss[loss=0.2517, simple_loss=0.3036, pruned_loss=0.0999, over 21827.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3618, pruned_loss=0.1245, over 4284028.99 frames. ], batch size: 102, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:33:54,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=341370.0, ans=0.0 2023-06-19 00:34:10,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=341430.0, ans=0.125 2023-06-19 00:34:36,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=341490.0, ans=0.1 2023-06-19 00:35:04,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=341550.0, ans=0.125 2023-06-19 00:35:10,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=341610.0, ans=0.125 2023-06-19 00:35:24,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.76 vs. limit=15.0 2023-06-19 00:35:33,198 INFO [train.py:996] (0/4) Epoch 2, batch 26450, loss[loss=0.3361, simple_loss=0.4206, pruned_loss=0.1258, over 21725.00 frames. ], tot_loss[loss=0.3064, simple_loss=0.364, pruned_loss=0.1244, over 4284425.48 frames. ], batch size: 351, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:35:57,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341670.0, ans=0.1 2023-06-19 00:35:58,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 3.117e+02 3.740e+02 5.003e+02 1.177e+03, threshold=7.481e+02, percent-clipped=6.0 2023-06-19 00:36:22,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=341790.0, ans=0.125 2023-06-19 00:36:33,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=341850.0, ans=0.2 2023-06-19 00:37:03,610 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=22.5 2023-06-19 00:37:11,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=341910.0, ans=0.125 2023-06-19 00:37:20,209 INFO [train.py:996] (0/4) Epoch 2, batch 26500, loss[loss=0.2647, simple_loss=0.3332, pruned_loss=0.0981, over 21568.00 frames. ], tot_loss[loss=0.3045, simple_loss=0.3647, pruned_loss=0.1222, over 4273058.45 frames. ], batch size: 263, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:38:31,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=342150.0, ans=0.0 2023-06-19 00:38:32,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=342150.0, ans=0.0 2023-06-19 00:39:03,699 INFO [train.py:996] (0/4) Epoch 2, batch 26550, loss[loss=0.3085, simple_loss=0.3885, pruned_loss=0.1143, over 21477.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3594, pruned_loss=0.1182, over 4275780.38 frames. ], batch size: 471, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:39:19,869 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.448e+02 4.337e+02 5.433e+02 9.319e+02, threshold=8.673e+02, percent-clipped=7.0 2023-06-19 00:39:22,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=342330.0, ans=0.2 2023-06-19 00:39:22,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-19 00:39:45,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=342390.0, ans=0.125 2023-06-19 00:40:42,959 INFO [train.py:996] (0/4) Epoch 2, batch 26600, loss[loss=0.2828, simple_loss=0.3444, pruned_loss=0.1106, over 21755.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3581, pruned_loss=0.1148, over 4267963.99 frames. ], batch size: 351, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:41:16,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-19 00:41:56,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=342750.0, ans=0.2 2023-06-19 00:42:07,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=342810.0, ans=0.2 2023-06-19 00:42:07,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=342810.0, ans=0.125 2023-06-19 00:42:22,735 INFO [train.py:996] (0/4) Epoch 2, batch 26650, loss[loss=0.2084, simple_loss=0.2949, pruned_loss=0.06094, over 21853.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3508, pruned_loss=0.1134, over 4263108.61 frames. ], batch size: 352, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:42:24,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=342870.0, ans=0.2 2023-06-19 00:42:38,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 3.189e+02 3.874e+02 5.287e+02 9.951e+02, threshold=7.747e+02, percent-clipped=1.0 2023-06-19 00:42:48,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=342930.0, ans=0.125 2023-06-19 00:42:52,981 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-19 00:43:50,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=343110.0, ans=0.125 2023-06-19 00:43:56,115 INFO [train.py:996] (0/4) Epoch 2, batch 26700, loss[loss=0.3359, simple_loss=0.3784, pruned_loss=0.1467, over 21831.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3421, pruned_loss=0.1088, over 4272393.37 frames. ], batch size: 441, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:44:13,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=343230.0, ans=0.0 2023-06-19 00:45:05,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-19 00:45:05,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=343350.0, ans=0.1 2023-06-19 00:45:18,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=343350.0, ans=0.1 2023-06-19 00:45:37,250 INFO [train.py:996] (0/4) Epoch 2, batch 26750, loss[loss=0.2481, simple_loss=0.3149, pruned_loss=0.0907, over 21633.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3423, pruned_loss=0.1083, over 4275766.57 frames. ], batch size: 230, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:45:37,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=343470.0, ans=0.125 2023-06-19 00:45:39,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=343470.0, ans=0.125 2023-06-19 00:45:41,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=343470.0, ans=0.125 2023-06-19 00:45:58,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.751e+02 3.226e+02 3.870e+02 9.468e+02, threshold=6.452e+02, percent-clipped=0.0 2023-06-19 00:46:31,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=343590.0, ans=0.125 2023-06-19 00:47:18,780 INFO [train.py:996] (0/4) Epoch 2, batch 26800, loss[loss=0.371, simple_loss=0.436, pruned_loss=0.153, over 21433.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3536, pruned_loss=0.1159, over 4276617.09 frames. ], batch size: 131, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:47:19,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=343770.0, ans=0.2 2023-06-19 00:47:35,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.10 vs. limit=5.0 2023-06-19 00:47:55,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=343830.0, ans=0.125 2023-06-19 00:47:58,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=343830.0, ans=0.125 2023-06-19 00:48:10,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=343890.0, ans=0.125 2023-06-19 00:48:58,644 INFO [train.py:996] (0/4) Epoch 2, batch 26850, loss[loss=0.3403, simple_loss=0.369, pruned_loss=0.1558, over 21558.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3558, pruned_loss=0.1199, over 4279781.92 frames. ], batch size: 441, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:49:29,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.578e+02 3.525e+02 4.180e+02 5.123e+02 1.126e+03, threshold=8.361e+02, percent-clipped=11.0 2023-06-19 00:49:40,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=344130.0, ans=0.0 2023-06-19 00:50:12,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=344250.0, ans=0.125 2023-06-19 00:50:37,548 INFO [train.py:996] (0/4) Epoch 2, batch 26900, loss[loss=0.254, simple_loss=0.3024, pruned_loss=0.1028, over 21579.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.347, pruned_loss=0.1181, over 4272343.42 frames. ], batch size: 298, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:51:07,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=344430.0, ans=0.0 2023-06-19 00:51:18,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=344490.0, ans=0.0 2023-06-19 00:52:12,729 INFO [train.py:996] (0/4) Epoch 2, batch 26950, loss[loss=0.3158, simple_loss=0.3782, pruned_loss=0.1267, over 21731.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3473, pruned_loss=0.1186, over 4269535.80 frames. ], batch size: 351, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:52:36,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=344670.0, ans=0.0 2023-06-19 00:52:43,530 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.137e+02 3.675e+02 4.908e+02 1.022e+03, threshold=7.351e+02, percent-clipped=1.0 2023-06-19 00:53:02,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=344790.0, ans=0.0 2023-06-19 00:53:12,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=344790.0, ans=0.2 2023-06-19 00:53:54,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-19 00:54:08,190 INFO [train.py:996] (0/4) Epoch 2, batch 27000, loss[loss=0.306, simple_loss=0.3759, pruned_loss=0.1181, over 21539.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3488, pruned_loss=0.1164, over 4273836.88 frames. ], batch size: 441, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:54:08,191 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 00:54:25,473 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2623, simple_loss=0.361, pruned_loss=0.08186, over 1796401.00 frames. 2023-06-19 00:54:25,474 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-19 00:54:34,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=344970.0, ans=0.125 2023-06-19 00:56:01,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=345210.0, ans=0.125 2023-06-19 00:56:06,869 INFO [train.py:996] (0/4) Epoch 2, batch 27050, loss[loss=0.2685, simple_loss=0.3504, pruned_loss=0.09328, over 21463.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3488, pruned_loss=0.1112, over 4273335.76 frames. ], batch size: 211, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:56:23,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.868e+02 3.454e+02 4.544e+02 1.088e+03, threshold=6.909e+02, percent-clipped=2.0 2023-06-19 00:56:29,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-19 00:56:30,381 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:56:33,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=345330.0, ans=0.125 2023-06-19 00:56:36,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=345330.0, ans=0.2 2023-06-19 00:56:38,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=345390.0, ans=0.125 2023-06-19 00:57:14,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=345450.0, ans=0.04949747468305833 2023-06-19 00:57:32,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=345510.0, ans=0.125 2023-06-19 00:57:37,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=345510.0, ans=0.125 2023-06-19 00:57:43,865 INFO [train.py:996] (0/4) Epoch 2, batch 27100, loss[loss=0.3061, simple_loss=0.376, pruned_loss=0.1182, over 21204.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3517, pruned_loss=0.1131, over 4270937.06 frames. ], batch size: 143, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:57:58,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=345570.0, ans=0.2 2023-06-19 00:58:24,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=345690.0, ans=0.125 2023-06-19 00:58:32,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=345690.0, ans=0.125 2023-06-19 00:59:20,188 INFO [train.py:996] (0/4) Epoch 2, batch 27150, loss[loss=0.361, simple_loss=0.438, pruned_loss=0.142, over 21675.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3635, pruned_loss=0.1167, over 4280055.86 frames. ], batch size: 414, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 00:59:20,712 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:59:22,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=345870.0, ans=0.125 2023-06-19 00:59:36,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.470e+02 4.217e+02 5.291e+02 1.062e+03, threshold=8.433e+02, percent-clipped=9.0 2023-06-19 00:59:56,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=345990.0, ans=0.0 2023-06-19 01:00:38,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=346110.0, ans=0.0 2023-06-19 01:00:55,729 INFO [train.py:996] (0/4) Epoch 2, batch 27200, loss[loss=0.3992, simple_loss=0.4406, pruned_loss=0.1789, over 21443.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3731, pruned_loss=0.121, over 4279825.11 frames. ], batch size: 471, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:01:50,604 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.19 vs. limit=6.0 2023-06-19 01:02:26,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=346410.0, ans=0.04949747468305833 2023-06-19 01:02:39,325 INFO [train.py:996] (0/4) Epoch 2, batch 27250, loss[loss=0.3187, simple_loss=0.3765, pruned_loss=0.1304, over 21586.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3762, pruned_loss=0.1263, over 4275439.97 frames. ], batch size: 389, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:02:59,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 3.070e+02 3.624e+02 4.371e+02 7.633e+02, threshold=7.247e+02, percent-clipped=0.0 2023-06-19 01:03:24,227 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:03:25,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=346590.0, ans=0.125 2023-06-19 01:03:43,010 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:03:43,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=346650.0, ans=0.2 2023-06-19 01:03:43,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=346650.0, ans=0.125 2023-06-19 01:03:59,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=346710.0, ans=0.125 2023-06-19 01:04:15,861 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:04:21,662 INFO [train.py:996] (0/4) Epoch 2, batch 27300, loss[loss=0.3371, simple_loss=0.3995, pruned_loss=0.1373, over 21689.00 frames. ], tot_loss[loss=0.3172, simple_loss=0.3781, pruned_loss=0.1281, over 4273898.39 frames. ], batch size: 351, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:04:54,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=346830.0, ans=0.125 2023-06-19 01:06:09,255 INFO [train.py:996] (0/4) Epoch 2, batch 27350, loss[loss=0.2942, simple_loss=0.3592, pruned_loss=0.1146, over 21890.00 frames. ], tot_loss[loss=0.319, simple_loss=0.3801, pruned_loss=0.1289, over 4277965.96 frames. ], batch size: 371, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:06:35,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.251e+02 3.372e+02 3.927e+02 4.716e+02 8.245e+02, threshold=7.854e+02, percent-clipped=1.0 2023-06-19 01:06:38,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=347130.0, ans=0.0 2023-06-19 01:06:39,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-19 01:06:45,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=347130.0, ans=0.125 2023-06-19 01:06:53,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.46 vs. limit=6.0 2023-06-19 01:07:17,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-19 01:07:33,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=347310.0, ans=0.125 2023-06-19 01:07:33,908 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=22.5 2023-06-19 01:07:46,361 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-19 01:07:47,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=347370.0, ans=0.1 2023-06-19 01:07:48,546 INFO [train.py:996] (0/4) Epoch 2, batch 27400, loss[loss=0.3094, simple_loss=0.3538, pruned_loss=0.1325, over 21532.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3763, pruned_loss=0.1286, over 4275862.10 frames. ], batch size: 389, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:07:57,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=347370.0, ans=0.125 2023-06-19 01:08:07,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=347370.0, ans=0.1 2023-06-19 01:08:35,681 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:08:57,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=347550.0, ans=0.125 2023-06-19 01:09:29,492 INFO [train.py:996] (0/4) Epoch 2, batch 27450, loss[loss=0.3033, simple_loss=0.3711, pruned_loss=0.1177, over 21426.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3673, pruned_loss=0.1249, over 4278206.60 frames. ], batch size: 471, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:09:37,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=347670.0, ans=0.0 2023-06-19 01:09:45,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.415e+02 3.960e+02 5.232e+02 1.053e+03, threshold=7.919e+02, percent-clipped=3.0 2023-06-19 01:10:52,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=347910.0, ans=0.125 2023-06-19 01:11:08,113 INFO [train.py:996] (0/4) Epoch 2, batch 27500, loss[loss=0.3035, simple_loss=0.355, pruned_loss=0.126, over 21857.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3658, pruned_loss=0.1256, over 4280481.86 frames. ], batch size: 414, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:11:43,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=348090.0, ans=0.0 2023-06-19 01:11:47,377 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-19 01:11:54,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=348150.0, ans=0.2 2023-06-19 01:12:47,547 INFO [train.py:996] (0/4) Epoch 2, batch 27550, loss[loss=0.2415, simple_loss=0.3118, pruned_loss=0.08556, over 21560.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3582, pruned_loss=0.1203, over 4284118.68 frames. ], batch size: 230, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 01:12:47,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=348270.0, ans=0.2 2023-06-19 01:13:05,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.210e+02 3.897e+02 4.749e+02 7.014e+02, threshold=7.795e+02, percent-clipped=0.0 2023-06-19 01:13:10,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=348330.0, ans=0.125 2023-06-19 01:13:49,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=348450.0, ans=0.125 2023-06-19 01:13:54,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=348450.0, ans=0.125 2023-06-19 01:14:01,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=348510.0, ans=0.07 2023-06-19 01:14:29,344 INFO [train.py:996] (0/4) Epoch 2, batch 27600, loss[loss=0.2632, simple_loss=0.3269, pruned_loss=0.09973, over 21337.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3524, pruned_loss=0.1197, over 4279014.66 frames. ], batch size: 194, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:14:59,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=348690.0, ans=0.0 2023-06-19 01:15:06,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=348690.0, ans=0.125 2023-06-19 01:15:25,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=348750.0, ans=0.125 2023-06-19 01:15:36,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=348810.0, ans=0.0 2023-06-19 01:16:03,371 INFO [train.py:996] (0/4) Epoch 2, batch 27650, loss[loss=0.3215, simple_loss=0.3836, pruned_loss=0.1297, over 21611.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3462, pruned_loss=0.1184, over 4276700.19 frames. ], batch size: 389, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:16:25,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.734e+02 3.535e+02 4.526e+02 5.756e+02 1.207e+03, threshold=9.051e+02, percent-clipped=8.0 2023-06-19 01:16:37,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=348930.0, ans=0.125 2023-06-19 01:16:42,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=348990.0, ans=0.125 2023-06-19 01:17:00,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=349050.0, ans=12.0 2023-06-19 01:17:11,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=15.0 2023-06-19 01:17:48,133 INFO [train.py:996] (0/4) Epoch 2, batch 27700, loss[loss=0.2227, simple_loss=0.2911, pruned_loss=0.07715, over 16500.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3451, pruned_loss=0.1153, over 4264077.72 frames. ], batch size: 61, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:18:08,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=349230.0, ans=0.125 2023-06-19 01:18:40,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=349350.0, ans=0.0 2023-06-19 01:18:46,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=349350.0, ans=10.0 2023-06-19 01:19:13,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349410.0, ans=0.1 2023-06-19 01:19:27,731 INFO [train.py:996] (0/4) Epoch 2, batch 27750, loss[loss=0.2549, simple_loss=0.3413, pruned_loss=0.08426, over 21831.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.348, pruned_loss=0.1154, over 4267863.70 frames. ], batch size: 371, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:19:39,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=349470.0, ans=0.0 2023-06-19 01:19:45,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.361e+02 3.964e+02 5.094e+02 9.268e+02, threshold=7.928e+02, percent-clipped=1.0 2023-06-19 01:19:58,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=349590.0, ans=0.125 2023-06-19 01:21:06,431 INFO [train.py:996] (0/4) Epoch 2, batch 27800, loss[loss=0.3043, simple_loss=0.329, pruned_loss=0.1398, over 20257.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.346, pruned_loss=0.1152, over 4272190.39 frames. ], batch size: 703, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:21:31,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.85 vs. limit=10.0 2023-06-19 01:21:33,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=349830.0, ans=0.125 2023-06-19 01:22:47,548 INFO [train.py:996] (0/4) Epoch 2, batch 27850, loss[loss=0.2729, simple_loss=0.3491, pruned_loss=0.09833, over 21770.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.348, pruned_loss=0.1186, over 4280089.52 frames. ], batch size: 247, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:22:49,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=350070.0, ans=0.125 2023-06-19 01:23:06,199 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.389e+02 4.361e+02 6.049e+02 9.596e+02, threshold=8.723e+02, percent-clipped=7.0 2023-06-19 01:23:40,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=350190.0, ans=0.2 2023-06-19 01:23:42,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350250.0, ans=0.1 2023-06-19 01:24:23,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=350310.0, ans=0.2 2023-06-19 01:24:30,954 INFO [train.py:996] (0/4) Epoch 2, batch 27900, loss[loss=0.3018, simple_loss=0.3779, pruned_loss=0.1129, over 21608.00 frames. ], tot_loss[loss=0.298, simple_loss=0.357, pruned_loss=0.1195, over 4280315.51 frames. ], batch size: 230, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:24:39,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=350370.0, ans=0.0 2023-06-19 01:24:45,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-19 01:25:32,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=350490.0, ans=0.0 2023-06-19 01:25:55,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-06-19 01:26:04,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=350610.0, ans=0.125 2023-06-19 01:26:13,849 INFO [train.py:996] (0/4) Epoch 2, batch 27950, loss[loss=0.2368, simple_loss=0.3162, pruned_loss=0.07876, over 21575.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3583, pruned_loss=0.1156, over 4280750.22 frames. ], batch size: 230, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:26:14,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=350670.0, ans=0.1 2023-06-19 01:26:17,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=350670.0, ans=0.0 2023-06-19 01:26:32,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.236e+02 3.896e+02 4.908e+02 8.483e+02, threshold=7.791e+02, percent-clipped=0.0 2023-06-19 01:26:35,937 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:26:55,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=350790.0, ans=0.125 2023-06-19 01:27:07,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-19 01:27:22,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=350850.0, ans=0.125 2023-06-19 01:27:36,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=350910.0, ans=0.0 2023-06-19 01:27:53,667 INFO [train.py:996] (0/4) Epoch 2, batch 28000, loss[loss=0.3189, simple_loss=0.3698, pruned_loss=0.1341, over 21844.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3542, pruned_loss=0.1118, over 4287134.66 frames. ], batch size: 414, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:28:14,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-19 01:28:47,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=12.0 2023-06-19 01:29:09,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=351150.0, ans=0.0 2023-06-19 01:29:24,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=351210.0, ans=0.2 2023-06-19 01:29:35,451 INFO [train.py:996] (0/4) Epoch 2, batch 28050, loss[loss=0.2618, simple_loss=0.2981, pruned_loss=0.1128, over 20263.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3508, pruned_loss=0.1131, over 4284774.58 frames. ], batch size: 703, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:29:55,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=22.5 2023-06-19 01:29:57,853 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.799e+02 3.165e+02 3.817e+02 7.021e+02, threshold=6.330e+02, percent-clipped=0.0 2023-06-19 01:30:19,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-19 01:30:34,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=351390.0, ans=0.0 2023-06-19 01:30:48,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=351450.0, ans=0.0 2023-06-19 01:31:15,449 INFO [train.py:996] (0/4) Epoch 2, batch 28100, loss[loss=0.3092, simple_loss=0.3436, pruned_loss=0.1374, over 21779.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3461, pruned_loss=0.1125, over 4282101.95 frames. ], batch size: 118, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:31:46,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=351630.0, ans=0.125 2023-06-19 01:32:02,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=351690.0, ans=0.125 2023-06-19 01:32:08,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=351690.0, ans=0.125 2023-06-19 01:32:16,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=351750.0, ans=0.125 2023-06-19 01:32:21,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=351750.0, ans=0.0 2023-06-19 01:32:54,278 INFO [train.py:996] (0/4) Epoch 2, batch 28150, loss[loss=0.2641, simple_loss=0.3079, pruned_loss=0.1101, over 21615.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3392, pruned_loss=0.1125, over 4282918.70 frames. ], batch size: 298, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:33:11,891 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.407e+02 3.356e+02 3.949e+02 5.361e+02 1.113e+03, threshold=7.898e+02, percent-clipped=11.0 2023-06-19 01:33:42,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=351990.0, ans=0.125 2023-06-19 01:34:29,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.11 vs. limit=15.0 2023-06-19 01:34:29,911 INFO [train.py:996] (0/4) Epoch 2, batch 28200, loss[loss=0.2667, simple_loss=0.3092, pruned_loss=0.1121, over 20684.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3392, pruned_loss=0.1153, over 4277766.99 frames. ], batch size: 607, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:34:32,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352170.0, ans=0.1 2023-06-19 01:34:35,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=352170.0, ans=0.125 2023-06-19 01:34:35,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352170.0, ans=0.1 2023-06-19 01:34:58,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=352230.0, ans=0.0 2023-06-19 01:35:17,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=352290.0, ans=0.2 2023-06-19 01:35:23,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.32 vs. limit=10.0 2023-06-19 01:36:10,825 INFO [train.py:996] (0/4) Epoch 2, batch 28250, loss[loss=0.2841, simple_loss=0.3345, pruned_loss=0.1168, over 21196.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.347, pruned_loss=0.12, over 4277956.44 frames. ], batch size: 176, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:36:38,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.575e+02 3.660e+02 4.283e+02 5.277e+02 9.711e+02, threshold=8.566e+02, percent-clipped=2.0 2023-06-19 01:36:57,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.11 vs. limit=6.0 2023-06-19 01:37:12,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-19 01:37:40,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=352710.0, ans=0.125 2023-06-19 01:37:51,597 INFO [train.py:996] (0/4) Epoch 2, batch 28300, loss[loss=0.2298, simple_loss=0.3172, pruned_loss=0.07123, over 21626.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3434, pruned_loss=0.1164, over 4273644.47 frames. ], batch size: 389, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:39:17,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=353010.0, ans=0.04949747468305833 2023-06-19 01:39:31,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=353010.0, ans=0.0 2023-06-19 01:39:44,107 INFO [train.py:996] (0/4) Epoch 2, batch 28350, loss[loss=0.2298, simple_loss=0.2905, pruned_loss=0.08449, over 21572.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3387, pruned_loss=0.1096, over 4262474.48 frames. ], batch size: 263, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:40:07,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.865e+02 3.652e+02 5.364e+02 1.153e+03, threshold=7.304e+02, percent-clipped=2.0 2023-06-19 01:41:30,029 INFO [train.py:996] (0/4) Epoch 2, batch 28400, loss[loss=0.3086, simple_loss=0.354, pruned_loss=0.1316, over 21223.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3359, pruned_loss=0.1107, over 4257599.48 frames. ], batch size: 159, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:41:37,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=353370.0, ans=0.07 2023-06-19 01:41:51,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=353430.0, ans=0.0 2023-06-19 01:42:20,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=353490.0, ans=0.2 2023-06-19 01:43:09,886 INFO [train.py:996] (0/4) Epoch 2, batch 28450, loss[loss=0.3166, simple_loss=0.3618, pruned_loss=0.1357, over 21251.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3447, pruned_loss=0.1164, over 4263766.50 frames. ], batch size: 143, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:43:27,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.368e+02 4.115e+02 5.202e+02 1.060e+03, threshold=8.231e+02, percent-clipped=7.0 2023-06-19 01:44:45,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=353910.0, ans=0.125 2023-06-19 01:44:50,985 INFO [train.py:996] (0/4) Epoch 2, batch 28500, loss[loss=0.3781, simple_loss=0.4182, pruned_loss=0.169, over 21678.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.346, pruned_loss=0.1182, over 4271044.73 frames. ], batch size: 415, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:45:05,619 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.60 vs. limit=6.0 2023-06-19 01:45:27,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=354030.0, ans=0.025 2023-06-19 01:46:13,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=354150.0, ans=0.0 2023-06-19 01:46:15,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=354150.0, ans=0.0 2023-06-19 01:46:16,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=354210.0, ans=0.1 2023-06-19 01:46:34,632 INFO [train.py:996] (0/4) Epoch 2, batch 28550, loss[loss=0.3016, simple_loss=0.3862, pruned_loss=0.1085, over 21620.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3559, pruned_loss=0.1216, over 4281463.77 frames. ], batch size: 230, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:46:52,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.021e+02 3.809e+02 4.877e+02 1.502e+03, threshold=7.617e+02, percent-clipped=6.0 2023-06-19 01:47:08,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=354330.0, ans=0.0 2023-06-19 01:47:16,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=354390.0, ans=0.2 2023-06-19 01:48:09,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=12.0 2023-06-19 01:48:11,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=354510.0, ans=0.125 2023-06-19 01:48:17,830 INFO [train.py:996] (0/4) Epoch 2, batch 28600, loss[loss=0.3688, simple_loss=0.4095, pruned_loss=0.164, over 21789.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3622, pruned_loss=0.1239, over 4276862.96 frames. ], batch size: 441, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:48:21,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=354570.0, ans=0.125 2023-06-19 01:49:45,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=354810.0, ans=0.125 2023-06-19 01:49:51,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.07 vs. limit=15.0 2023-06-19 01:49:53,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=354810.0, ans=0.0 2023-06-19 01:49:58,554 INFO [train.py:996] (0/4) Epoch 2, batch 28650, loss[loss=0.3028, simple_loss=0.3461, pruned_loss=0.1297, over 21691.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.356, pruned_loss=0.1228, over 4280099.84 frames. ], batch size: 333, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:50:18,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=354930.0, ans=0.1 2023-06-19 01:50:21,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.376e+02 3.990e+02 4.916e+02 8.510e+02, threshold=7.981e+02, percent-clipped=3.0 2023-06-19 01:50:23,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354930.0, ans=0.1 2023-06-19 01:50:34,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-06-19 01:51:02,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=354990.0, ans=0.0 2023-06-19 01:51:02,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-19 01:51:35,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=355110.0, ans=0.125 2023-06-19 01:51:38,429 INFO [train.py:996] (0/4) Epoch 2, batch 28700, loss[loss=0.3041, simple_loss=0.3558, pruned_loss=0.1263, over 21606.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3557, pruned_loss=0.124, over 4282394.94 frames. ], batch size: 230, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:52:34,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=355290.0, ans=0.125 2023-06-19 01:52:43,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=355290.0, ans=0.125 2023-06-19 01:52:57,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=355350.0, ans=0.0 2023-06-19 01:53:18,233 INFO [train.py:996] (0/4) Epoch 2, batch 28750, loss[loss=0.3284, simple_loss=0.3651, pruned_loss=0.1459, over 21744.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3553, pruned_loss=0.1245, over 4281224.22 frames. ], batch size: 507, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:53:46,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 3.065e+02 3.653e+02 4.286e+02 6.736e+02, threshold=7.306e+02, percent-clipped=0.0 2023-06-19 01:53:47,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-19 01:54:18,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.97 vs. limit=5.0 2023-06-19 01:54:43,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355710.0, ans=0.1 2023-06-19 01:54:58,865 INFO [train.py:996] (0/4) Epoch 2, batch 28800, loss[loss=0.3457, simple_loss=0.3973, pruned_loss=0.1471, over 21785.00 frames. ], tot_loss[loss=0.3055, simple_loss=0.3603, pruned_loss=0.1254, over 4285326.41 frames. ], batch size: 441, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:55:01,417 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:56:06,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=355950.0, ans=0.125 2023-06-19 01:56:08,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=355950.0, ans=0.125 2023-06-19 01:56:21,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-19 01:56:32,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=356010.0, ans=0.125 2023-06-19 01:56:40,239 INFO [train.py:996] (0/4) Epoch 2, batch 28850, loss[loss=0.2916, simple_loss=0.3386, pruned_loss=0.1223, over 21122.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3626, pruned_loss=0.1282, over 4294103.13 frames. ], batch size: 607, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:57:12,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.103e+02 3.762e+02 4.499e+02 8.286e+02, threshold=7.524e+02, percent-clipped=2.0 2023-06-19 01:58:26,604 INFO [train.py:996] (0/4) Epoch 2, batch 28900, loss[loss=0.4115, simple_loss=0.4399, pruned_loss=0.1915, over 21664.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3657, pruned_loss=0.1302, over 4297295.26 frames. ], batch size: 414, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:58:37,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=356370.0, ans=0.0 2023-06-19 01:58:39,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=356370.0, ans=0.1 2023-06-19 01:58:55,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=356430.0, ans=0.125 2023-06-19 01:59:13,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=356490.0, ans=0.0 2023-06-19 01:59:17,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.66 vs. limit=22.5 2023-06-19 01:59:26,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=356550.0, ans=0.125 2023-06-19 02:00:19,035 INFO [train.py:996] (0/4) Epoch 2, batch 28950, loss[loss=0.3114, simple_loss=0.4024, pruned_loss=0.1102, over 20717.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3667, pruned_loss=0.1286, over 4282793.53 frames. ], batch size: 607, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:00:37,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.281e+02 4.028e+02 5.318e+02 1.006e+03, threshold=8.055e+02, percent-clipped=4.0 2023-06-19 02:00:44,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-19 02:01:13,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=356850.0, ans=0.125 2023-06-19 02:02:01,668 INFO [train.py:996] (0/4) Epoch 2, batch 29000, loss[loss=0.328, simple_loss=0.3876, pruned_loss=0.1342, over 21777.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3711, pruned_loss=0.1274, over 4280017.12 frames. ], batch size: 332, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:02:36,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=357090.0, ans=0.0 2023-06-19 02:02:44,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=357090.0, ans=0.0 2023-06-19 02:03:16,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=357150.0, ans=0.2 2023-06-19 02:03:24,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=357150.0, ans=0.125 2023-06-19 02:03:44,470 INFO [train.py:996] (0/4) Epoch 2, batch 29050, loss[loss=0.2885, simple_loss=0.344, pruned_loss=0.1165, over 21341.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3697, pruned_loss=0.1286, over 4284731.86 frames. ], batch size: 143, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:03:46,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=357270.0, ans=0.125 2023-06-19 02:03:52,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=357270.0, ans=0.2 2023-06-19 02:03:57,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=357270.0, ans=0.1 2023-06-19 02:04:02,382 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.182e+02 3.653e+02 4.375e+02 6.472e+02, threshold=7.306e+02, percent-clipped=0.0 2023-06-19 02:04:16,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-19 02:05:22,863 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:05:25,367 INFO [train.py:996] (0/4) Epoch 2, batch 29100, loss[loss=0.2356, simple_loss=0.2941, pruned_loss=0.08852, over 21749.00 frames. ], tot_loss[loss=0.3047, simple_loss=0.3591, pruned_loss=0.1251, over 4275743.57 frames. ], batch size: 351, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:05:31,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-19 02:05:38,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=357570.0, ans=0.2 2023-06-19 02:06:30,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=357750.0, ans=0.1 2023-06-19 02:06:44,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=357750.0, ans=0.2 2023-06-19 02:07:04,300 INFO [train.py:996] (0/4) Epoch 2, batch 29150, loss[loss=0.3461, simple_loss=0.3908, pruned_loss=0.1507, over 21386.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3551, pruned_loss=0.1217, over 4267850.52 frames. ], batch size: 471, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:07:06,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=357870.0, ans=0.0 2023-06-19 02:07:11,947 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-19 02:07:17,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-19 02:07:21,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 3.617e+02 4.258e+02 5.180e+02 9.047e+02, threshold=8.516e+02, percent-clipped=9.0 2023-06-19 02:07:27,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=357930.0, ans=0.0 2023-06-19 02:07:27,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=357930.0, ans=0.0 2023-06-19 02:08:09,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-19 02:08:39,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=358110.0, ans=0.0 2023-06-19 02:08:44,218 INFO [train.py:996] (0/4) Epoch 2, batch 29200, loss[loss=0.231, simple_loss=0.2913, pruned_loss=0.08535, over 21398.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.3499, pruned_loss=0.1197, over 4261539.56 frames. ], batch size: 131, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:09:29,317 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.03 vs. limit=5.0 2023-06-19 02:10:00,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-19 02:10:24,009 INFO [train.py:996] (0/4) Epoch 2, batch 29250, loss[loss=0.2981, simple_loss=0.3701, pruned_loss=0.113, over 21752.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3488, pruned_loss=0.1168, over 4266477.49 frames. ], batch size: 282, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:10:36,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358470.0, ans=0.1 2023-06-19 02:10:46,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.694e+02 3.473e+02 5.021e+02 8.866e+02, threshold=6.946e+02, percent-clipped=1.0 2023-06-19 02:11:24,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=358590.0, ans=0.125 2023-06-19 02:11:37,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=358650.0, ans=0.125 2023-06-19 02:12:04,059 INFO [train.py:996] (0/4) Epoch 2, batch 29300, loss[loss=0.3422, simple_loss=0.3791, pruned_loss=0.1526, over 21542.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3499, pruned_loss=0.1154, over 4266579.74 frames. ], batch size: 441, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:12:46,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=358830.0, ans=0.07 2023-06-19 02:12:56,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=358890.0, ans=0.05 2023-06-19 02:13:44,857 INFO [train.py:996] (0/4) Epoch 2, batch 29350, loss[loss=0.2499, simple_loss=0.2949, pruned_loss=0.1024, over 21147.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3439, pruned_loss=0.1146, over 4265087.02 frames. ], batch size: 159, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:14:13,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 2.964e+02 3.404e+02 4.114e+02 7.296e+02, threshold=6.809e+02, percent-clipped=1.0 2023-06-19 02:14:18,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=359130.0, ans=0.125 2023-06-19 02:14:40,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=359190.0, ans=0.125 2023-06-19 02:15:00,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=359250.0, ans=0.125 2023-06-19 02:15:22,042 INFO [train.py:996] (0/4) Epoch 2, batch 29400, loss[loss=0.3348, simple_loss=0.3951, pruned_loss=0.1372, over 21468.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3444, pruned_loss=0.1122, over 4262656.46 frames. ], batch size: 509, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:15:40,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=359370.0, ans=0.5 2023-06-19 02:15:56,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=359430.0, ans=0.125 2023-06-19 02:16:22,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=359490.0, ans=0.0 2023-06-19 02:16:48,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=359610.0, ans=0.015 2023-06-19 02:17:03,085 INFO [train.py:996] (0/4) Epoch 2, batch 29450, loss[loss=0.3404, simple_loss=0.3895, pruned_loss=0.1456, over 21757.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3411, pruned_loss=0.1102, over 4270149.17 frames. ], batch size: 332, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:17:25,756 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.078e+02 3.690e+02 4.615e+02 7.103e+02, threshold=7.380e+02, percent-clipped=1.0 2023-06-19 02:17:39,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-19 02:18:37,446 INFO [train.py:996] (0/4) Epoch 2, batch 29500, loss[loss=0.2549, simple_loss=0.3086, pruned_loss=0.1006, over 21176.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3478, pruned_loss=0.1152, over 4269024.38 frames. ], batch size: 608, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:18:45,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=359970.0, ans=0.0 2023-06-19 02:18:53,664 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-60000.pt 2023-06-19 02:18:57,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=359970.0, ans=0.1 2023-06-19 02:19:06,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=360030.0, ans=0.125 2023-06-19 02:19:25,848 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=15.0 2023-06-19 02:19:31,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.33 vs. limit=10.0 2023-06-19 02:19:51,976 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:19:55,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=360210.0, ans=0.04949747468305833 2023-06-19 02:20:17,243 INFO [train.py:996] (0/4) Epoch 2, batch 29550, loss[loss=0.2999, simple_loss=0.3506, pruned_loss=0.1246, over 21887.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3463, pruned_loss=0.1165, over 4273648.62 frames. ], batch size: 316, lr: 1.45e-02, grad_scale: 64.0 2023-06-19 02:20:49,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 3.148e+02 3.536e+02 4.853e+02 9.360e+02, threshold=7.072e+02, percent-clipped=2.0 2023-06-19 02:21:02,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.88 vs. limit=15.0 2023-06-19 02:21:11,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=360390.0, ans=0.0 2023-06-19 02:21:24,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=360450.0, ans=0.125 2023-06-19 02:21:37,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=360510.0, ans=0.05 2023-06-19 02:22:10,035 INFO [train.py:996] (0/4) Epoch 2, batch 29600, loss[loss=0.2716, simple_loss=0.3139, pruned_loss=0.1146, over 20310.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3537, pruned_loss=0.1202, over 4280128.75 frames. ], batch size: 703, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 02:22:10,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=360570.0, ans=0.2 2023-06-19 02:22:46,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=360690.0, ans=0.0 2023-06-19 02:22:56,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=360690.0, ans=0.2 2023-06-19 02:23:43,640 INFO [train.py:996] (0/4) Epoch 2, batch 29650, loss[loss=0.2355, simple_loss=0.2964, pruned_loss=0.0873, over 21833.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3503, pruned_loss=0.1164, over 4282266.20 frames. ], batch size: 282, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:23:50,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=360870.0, ans=0.125 2023-06-19 02:23:59,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-19 02:24:07,298 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.035e+02 3.587e+02 4.924e+02 8.544e+02, threshold=7.175e+02, percent-clipped=8.0 2023-06-19 02:24:07,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360930.0, ans=0.1 2023-06-19 02:24:45,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=361050.0, ans=0.0 2023-06-19 02:25:23,752 INFO [train.py:996] (0/4) Epoch 2, batch 29700, loss[loss=0.3062, simple_loss=0.4041, pruned_loss=0.1041, over 21759.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3531, pruned_loss=0.1175, over 4283396.50 frames. ], batch size: 298, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:25:24,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=361170.0, ans=0.125 2023-06-19 02:25:37,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=361170.0, ans=0.125 2023-06-19 02:25:42,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=361170.0, ans=0.1 2023-06-19 02:25:57,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.18 vs. limit=15.0 2023-06-19 02:26:21,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=361350.0, ans=0.125 2023-06-19 02:27:02,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=361470.0, ans=0.125 2023-06-19 02:27:03,544 INFO [train.py:996] (0/4) Epoch 2, batch 29750, loss[loss=0.2707, simple_loss=0.3611, pruned_loss=0.09014, over 21834.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3592, pruned_loss=0.1176, over 4288274.25 frames. ], batch size: 371, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:27:07,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=361470.0, ans=0.05 2023-06-19 02:27:27,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 3.188e+02 3.972e+02 5.342e+02 1.059e+03, threshold=7.944e+02, percent-clipped=5.0 2023-06-19 02:28:02,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=361650.0, ans=0.0 2023-06-19 02:28:17,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=361710.0, ans=0.0 2023-06-19 02:28:28,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=361710.0, ans=0.5 2023-06-19 02:28:35,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=361710.0, ans=0.09899494936611666 2023-06-19 02:28:42,378 INFO [train.py:996] (0/4) Epoch 2, batch 29800, loss[loss=0.2974, simple_loss=0.348, pruned_loss=0.1234, over 21893.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3598, pruned_loss=0.1175, over 4286306.59 frames. ], batch size: 414, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:30:22,326 INFO [train.py:996] (0/4) Epoch 2, batch 29850, loss[loss=0.3077, simple_loss=0.3599, pruned_loss=0.1278, over 21919.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3558, pruned_loss=0.1154, over 4282323.59 frames. ], batch size: 107, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:30:41,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=362130.0, ans=0.04949747468305833 2023-06-19 02:30:45,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=362130.0, ans=0.0 2023-06-19 02:30:46,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 2.912e+02 3.664e+02 4.469e+02 7.842e+02, threshold=7.327e+02, percent-clipped=0.0 2023-06-19 02:31:00,114 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-19 02:31:10,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=362190.0, ans=0.125 2023-06-19 02:32:05,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=362370.0, ans=0.0 2023-06-19 02:32:06,368 INFO [train.py:996] (0/4) Epoch 2, batch 29900, loss[loss=0.3242, simple_loss=0.3759, pruned_loss=0.1363, over 21302.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3558, pruned_loss=0.1177, over 4282314.89 frames. ], batch size: 159, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:33:24,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-06-19 02:33:28,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362610.0, ans=0.1 2023-06-19 02:33:33,362 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:33:49,230 INFO [train.py:996] (0/4) Epoch 2, batch 29950, loss[loss=0.2976, simple_loss=0.3507, pruned_loss=0.1222, over 20628.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.36, pruned_loss=0.1222, over 4277940.06 frames. ], batch size: 607, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:33:51,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=362670.0, ans=0.125 2023-06-19 02:33:58,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=362670.0, ans=0.2 2023-06-19 02:34:09,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.188e+02 4.013e+02 5.057e+02 1.029e+03, threshold=8.025e+02, percent-clipped=2.0 2023-06-19 02:34:10,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-19 02:34:37,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=362790.0, ans=0.0 2023-06-19 02:35:02,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=362850.0, ans=0.09899494936611666 2023-06-19 02:35:29,999 INFO [train.py:996] (0/4) Epoch 2, batch 30000, loss[loss=0.2958, simple_loss=0.36, pruned_loss=0.1157, over 21431.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3624, pruned_loss=0.1224, over 4281642.68 frames. ], batch size: 131, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:35:30,000 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 02:35:38,656 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.4219, 2.9321, 2.8822, 1.6547], device='cuda:0') 2023-06-19 02:35:47,465 INFO [train.py:1028] (0/4) Epoch 2, validation: loss=0.2693, simple_loss=0.3684, pruned_loss=0.08513, over 1796401.00 frames. 2023-06-19 02:35:47,466 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-19 02:35:51,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.57 vs. limit=22.5 2023-06-19 02:36:33,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=363030.0, ans=0.125 2023-06-19 02:36:48,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=363090.0, ans=0.04949747468305833 2023-06-19 02:37:36,935 INFO [train.py:996] (0/4) Epoch 2, batch 30050, loss[loss=0.3432, simple_loss=0.4314, pruned_loss=0.1275, over 21805.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3658, pruned_loss=0.1185, over 4284941.97 frames. ], batch size: 371, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:38:06,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.812e+02 3.422e+02 4.683e+02 8.613e+02, threshold=6.845e+02, percent-clipped=2.0 2023-06-19 02:38:19,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=363330.0, ans=0.07 2023-06-19 02:38:19,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363330.0, ans=0.1 2023-06-19 02:38:29,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=363390.0, ans=0.125 2023-06-19 02:38:32,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=363390.0, ans=0.0 2023-06-19 02:38:41,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=363450.0, ans=0.04949747468305833 2023-06-19 02:38:49,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=363450.0, ans=0.0 2023-06-19 02:38:50,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363450.0, ans=0.1 2023-06-19 02:38:54,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363510.0, ans=0.1 2023-06-19 02:39:14,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=363570.0, ans=0.0 2023-06-19 02:39:15,589 INFO [train.py:996] (0/4) Epoch 2, batch 30100, loss[loss=0.3167, simple_loss=0.3399, pruned_loss=0.1468, over 21306.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3633, pruned_loss=0.1182, over 4281396.17 frames. ], batch size: 507, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:39:46,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=363630.0, ans=0.125 2023-06-19 02:40:01,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2023-06-19 02:40:09,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=363690.0, ans=0.0 2023-06-19 02:40:13,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-19 02:40:17,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=363750.0, ans=0.025 2023-06-19 02:40:33,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-19 02:41:02,516 INFO [train.py:996] (0/4) Epoch 2, batch 30150, loss[loss=0.3222, simple_loss=0.3743, pruned_loss=0.135, over 21563.00 frames. ], tot_loss[loss=0.2975, simple_loss=0.3572, pruned_loss=0.1189, over 4279117.07 frames. ], batch size: 389, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:41:15,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-19 02:41:26,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=363870.0, ans=0.125 2023-06-19 02:41:32,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.181e+02 3.774e+02 4.610e+02 8.129e+02, threshold=7.548e+02, percent-clipped=2.0 2023-06-19 02:42:45,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=364110.0, ans=0.0 2023-06-19 02:42:56,644 INFO [train.py:996] (0/4) Epoch 2, batch 30200, loss[loss=0.3655, simple_loss=0.4226, pruned_loss=0.1542, over 21444.00 frames. ], tot_loss[loss=0.2991, simple_loss=0.3614, pruned_loss=0.1184, over 4282002.51 frames. ], batch size: 471, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:43:07,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=364170.0, ans=0.125 2023-06-19 02:43:22,244 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:44:00,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=364350.0, ans=0.125 2023-06-19 02:44:23,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=364410.0, ans=0.0 2023-06-19 02:44:26,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=364410.0, ans=0.125 2023-06-19 02:44:39,240 INFO [train.py:996] (0/4) Epoch 2, batch 30250, loss[loss=0.3188, simple_loss=0.389, pruned_loss=0.1243, over 21334.00 frames. ], tot_loss[loss=0.308, simple_loss=0.3709, pruned_loss=0.1225, over 4282043.33 frames. ], batch size: 159, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:44:58,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 3.107e+02 3.710e+02 5.079e+02 9.516e+02, threshold=7.420e+02, percent-clipped=5.0 2023-06-19 02:45:05,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=364530.0, ans=0.1 2023-06-19 02:45:28,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=364590.0, ans=10.0 2023-06-19 02:46:03,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=364710.0, ans=0.125 2023-06-19 02:46:19,606 INFO [train.py:996] (0/4) Epoch 2, batch 30300, loss[loss=0.2562, simple_loss=0.3029, pruned_loss=0.1048, over 21228.00 frames. ], tot_loss[loss=0.3045, simple_loss=0.3661, pruned_loss=0.1214, over 4284028.45 frames. ], batch size: 176, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:46:50,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=364830.0, ans=0.125 2023-06-19 02:47:21,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=364890.0, ans=0.125 2023-06-19 02:47:32,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=364950.0, ans=0.125 2023-06-19 02:48:03,273 INFO [train.py:996] (0/4) Epoch 2, batch 30350, loss[loss=0.2904, simple_loss=0.3497, pruned_loss=0.1156, over 21726.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.3647, pruned_loss=0.1218, over 4286909.13 frames. ], batch size: 298, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:48:20,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=365070.0, ans=0.125 2023-06-19 02:48:26,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.339e+02 3.934e+02 4.976e+02 9.196e+02, threshold=7.868e+02, percent-clipped=1.0 2023-06-19 02:48:30,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=365130.0, ans=0.0 2023-06-19 02:48:47,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=365190.0, ans=0.125 2023-06-19 02:48:49,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=365190.0, ans=0.0 2023-06-19 02:49:05,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=12.0 2023-06-19 02:49:14,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2023-06-19 02:49:20,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.39 vs. limit=10.0 2023-06-19 02:49:31,191 INFO [train.py:996] (0/4) Epoch 2, batch 30400, loss[loss=0.2948, simple_loss=0.3226, pruned_loss=0.1335, over 20315.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3551, pruned_loss=0.1184, over 4274673.63 frames. ], batch size: 703, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:49:31,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=365370.0, ans=0.0 2023-06-19 02:50:03,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=365490.0, ans=0.07 2023-06-19 02:50:04,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-19 02:50:22,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-19 02:50:56,383 INFO [train.py:996] (0/4) Epoch 2, batch 30450, loss[loss=0.4133, simple_loss=0.5071, pruned_loss=0.1597, over 19934.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3593, pruned_loss=0.1198, over 4212203.23 frames. ], batch size: 702, lr: 1.43e-02, grad_scale: 32.0 2023-06-19 02:51:11,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=365730.0, ans=0.125 2023-06-19 02:51:15,886 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.508e+02 4.343e+02 5.750e+02 8.532e+02 2.294e+03, threshold=1.150e+03, percent-clipped=29.0 2023-06-19 02:52:05,609 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-2.pt 2023-06-19 02:53:41,468 INFO [train.py:996] (0/4) Epoch 3, batch 0, loss[loss=0.3095, simple_loss=0.3593, pruned_loss=0.1298, over 21772.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3593, pruned_loss=0.1298, over 21772.00 frames. ], batch size: 102, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:53:41,469 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 02:53:57,716 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2735, simple_loss=0.3782, pruned_loss=0.08435, over 1796401.00 frames. 2023-06-19 02:53:57,717 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24320MB 2023-06-19 02:54:02,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=365934.0, ans=0.0 2023-06-19 02:54:06,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.22 vs. limit=6.0 2023-06-19 02:54:35,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=366054.0, ans=0.0 2023-06-19 02:54:41,751 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:55:21,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=366174.0, ans=0.125 2023-06-19 02:55:30,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=366174.0, ans=0.0 2023-06-19 02:55:32,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.64 vs. limit=10.0 2023-06-19 02:55:36,580 INFO [train.py:996] (0/4) Epoch 3, batch 50, loss[loss=0.3653, simple_loss=0.4334, pruned_loss=0.1486, over 20691.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3729, pruned_loss=0.1243, over 961366.82 frames. ], batch size: 607, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:55:40,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=366234.0, ans=0.125 2023-06-19 02:56:10,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 3.611e+02 4.559e+02 6.599e+02 1.492e+03, threshold=9.117e+02, percent-clipped=9.0 2023-06-19 02:56:44,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=366414.0, ans=0.125 2023-06-19 02:56:54,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=366474.0, ans=0.0 2023-06-19 02:57:15,640 INFO [train.py:996] (0/4) Epoch 3, batch 100, loss[loss=0.3872, simple_loss=0.4356, pruned_loss=0.1694, over 21481.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3794, pruned_loss=0.1235, over 1694312.29 frames. ], batch size: 471, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:57:18,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-19 02:57:57,616 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:58:28,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.71 vs. limit=6.0 2023-06-19 02:58:51,810 INFO [train.py:996] (0/4) Epoch 3, batch 150, loss[loss=0.2968, simple_loss=0.3728, pruned_loss=0.1104, over 21787.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3759, pruned_loss=0.1221, over 2269200.96 frames. ], batch size: 371, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:59:25,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.063e+02 3.532e+02 4.732e+02 9.517e+02, threshold=7.065e+02, percent-clipped=1.0 2023-06-19 03:00:03,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=367074.0, ans=0.1 2023-06-19 03:00:30,817 INFO [train.py:996] (0/4) Epoch 3, batch 200, loss[loss=0.3001, simple_loss=0.3843, pruned_loss=0.1079, over 21770.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3736, pruned_loss=0.1195, over 2720744.33 frames. ], batch size: 332, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 03:00:48,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=367194.0, ans=0.1 2023-06-19 03:01:24,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=367254.0, ans=0.1 2023-06-19 03:02:09,314 INFO [train.py:996] (0/4) Epoch 3, batch 250, loss[loss=0.333, simple_loss=0.3996, pruned_loss=0.1332, over 21635.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3702, pruned_loss=0.1184, over 3068506.57 frames. ], batch size: 414, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 03:02:26,105 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=12.0 2023-06-19 03:02:28,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=367494.0, ans=0.125 2023-06-19 03:02:38,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=367494.0, ans=0.0 2023-06-19 03:02:42,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.832e+02 3.615e+02 5.126e+02 8.493e+02, threshold=7.230e+02, percent-clipped=8.0 2023-06-19 03:03:49,711 INFO [train.py:996] (0/4) Epoch 3, batch 300, loss[loss=0.2333, simple_loss=0.2857, pruned_loss=0.09045, over 21321.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3645, pruned_loss=0.1172, over 3336308.73 frames. ], batch size: 131, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:04:05,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-06-19 03:04:28,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=367854.0, ans=0.0 2023-06-19 03:04:28,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=367854.0, ans=0.025 2023-06-19 03:05:08,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=367974.0, ans=0.125 2023-06-19 03:05:31,317 INFO [train.py:996] (0/4) Epoch 3, batch 350, loss[loss=0.263, simple_loss=0.3179, pruned_loss=0.104, over 21871.00 frames. ], tot_loss[loss=0.2944, simple_loss=0.3583, pruned_loss=0.1153, over 3538763.66 frames. ], batch size: 373, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:05:39,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=368034.0, ans=0.07 2023-06-19 03:05:53,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=368094.0, ans=0.1 2023-06-19 03:06:01,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=368094.0, ans=0.125 2023-06-19 03:06:05,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.954e+02 3.445e+02 4.197e+02 6.448e+02, threshold=6.891e+02, percent-clipped=0.0 2023-06-19 03:06:08,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=368154.0, ans=0.0 2023-06-19 03:06:12,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=368154.0, ans=0.0 2023-06-19 03:06:37,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-19 03:06:38,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=368214.0, ans=0.125 2023-06-19 03:06:41,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=368214.0, ans=0.0 2023-06-19 03:06:49,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=368274.0, ans=0.125 2023-06-19 03:07:12,370 INFO [train.py:996] (0/4) Epoch 3, batch 400, loss[loss=0.2856, simple_loss=0.336, pruned_loss=0.1177, over 21891.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3495, pruned_loss=0.1123, over 3699169.66 frames. ], batch size: 107, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:07:20,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=368334.0, ans=0.125 2023-06-19 03:07:44,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=368394.0, ans=0.2 2023-06-19 03:07:44,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=368394.0, ans=0.0 2023-06-19 03:08:24,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=368514.0, ans=0.125 2023-06-19 03:08:53,090 INFO [train.py:996] (0/4) Epoch 3, batch 450, loss[loss=0.2453, simple_loss=0.3318, pruned_loss=0.07938, over 21575.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3465, pruned_loss=0.1108, over 3830260.29 frames. ], batch size: 389, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:08:59,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=368634.0, ans=0.0 2023-06-19 03:09:27,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.891e+02 3.614e+02 4.402e+02 7.378e+02, threshold=7.228e+02, percent-clipped=3.0 2023-06-19 03:09:56,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=368814.0, ans=0.0 2023-06-19 03:10:05,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.82 vs. limit=15.0 2023-06-19 03:10:28,827 INFO [train.py:996] (0/4) Epoch 3, batch 500, loss[loss=0.2815, simple_loss=0.3526, pruned_loss=0.1052, over 21243.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3488, pruned_loss=0.1087, over 3932750.89 frames. ], batch size: 159, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:10:49,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=368994.0, ans=0.125 2023-06-19 03:11:27,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=369114.0, ans=0.2 2023-06-19 03:11:35,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=369114.0, ans=0.125 2023-06-19 03:12:08,140 INFO [train.py:996] (0/4) Epoch 3, batch 550, loss[loss=0.3681, simple_loss=0.4539, pruned_loss=0.1411, over 21632.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3555, pruned_loss=0.1093, over 4012320.07 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:12:10,772 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-19 03:12:46,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.117e+02 3.637e+02 4.984e+02 1.103e+03, threshold=7.274e+02, percent-clipped=1.0 2023-06-19 03:12:47,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=369294.0, ans=0.0 2023-06-19 03:13:47,849 INFO [train.py:996] (0/4) Epoch 3, batch 600, loss[loss=0.3269, simple_loss=0.4184, pruned_loss=0.1177, over 21784.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3587, pruned_loss=0.111, over 4080855.23 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:14:01,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=369534.0, ans=0.0 2023-06-19 03:14:11,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=369594.0, ans=0.125 2023-06-19 03:14:11,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=369594.0, ans=0.0 2023-06-19 03:14:17,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=369594.0, ans=0.0 2023-06-19 03:14:19,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=22.5 2023-06-19 03:15:28,818 INFO [train.py:996] (0/4) Epoch 3, batch 650, loss[loss=0.3002, simple_loss=0.4202, pruned_loss=0.09012, over 19863.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3575, pruned_loss=0.111, over 4112130.58 frames. ], batch size: 702, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:15:33,468 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=15.0 2023-06-19 03:16:02,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.260e+02 4.172e+02 5.495e+02 8.347e+02, threshold=8.343e+02, percent-clipped=4.0 2023-06-19 03:17:09,761 INFO [train.py:996] (0/4) Epoch 3, batch 700, loss[loss=0.2503, simple_loss=0.3151, pruned_loss=0.09272, over 21867.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3592, pruned_loss=0.1125, over 4150607.39 frames. ], batch size: 98, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:17:33,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=370194.0, ans=0.125 2023-06-19 03:18:49,216 INFO [train.py:996] (0/4) Epoch 3, batch 750, loss[loss=0.2573, simple_loss=0.3238, pruned_loss=0.09542, over 21473.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3579, pruned_loss=0.1131, over 4185067.83 frames. ], batch size: 211, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:19:14,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=370494.0, ans=0.2 2023-06-19 03:19:28,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.022e+02 3.507e+02 4.070e+02 7.167e+02, threshold=7.014e+02, percent-clipped=0.0 2023-06-19 03:19:38,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=370554.0, ans=0.125 2023-06-19 03:19:42,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=370554.0, ans=0.0 2023-06-19 03:19:48,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=370614.0, ans=0.2 2023-06-19 03:19:53,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=370614.0, ans=0.125 2023-06-19 03:20:23,734 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.79 vs. limit=6.0 2023-06-19 03:20:31,120 INFO [train.py:996] (0/4) Epoch 3, batch 800, loss[loss=0.299, simple_loss=0.3422, pruned_loss=0.1279, over 21260.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3542, pruned_loss=0.1131, over 4200069.16 frames. ], batch size: 159, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:20:57,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=370794.0, ans=0.1 2023-06-19 03:21:22,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=370854.0, ans=10.0 2023-06-19 03:22:06,201 INFO [train.py:996] (0/4) Epoch 3, batch 850, loss[loss=0.2672, simple_loss=0.3252, pruned_loss=0.1046, over 21538.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3509, pruned_loss=0.1135, over 4228226.35 frames. ], batch size: 212, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:22:11,466 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:22:46,296 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.110e+02 3.682e+02 5.059e+02 8.553e+02, threshold=7.364e+02, percent-clipped=4.0 2023-06-19 03:22:48,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=371154.0, ans=0.1 2023-06-19 03:22:53,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=371154.0, ans=0.0 2023-06-19 03:22:59,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=371154.0, ans=0.2 2023-06-19 03:23:25,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=371274.0, ans=0.125 2023-06-19 03:23:33,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=371274.0, ans=0.125 2023-06-19 03:23:43,055 INFO [train.py:996] (0/4) Epoch 3, batch 900, loss[loss=0.2653, simple_loss=0.3233, pruned_loss=0.1037, over 21891.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3463, pruned_loss=0.1126, over 4247966.27 frames. ], batch size: 351, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:23:53,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=371334.0, ans=0.2 2023-06-19 03:24:05,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=371394.0, ans=0.125 2023-06-19 03:24:07,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-19 03:24:23,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=371394.0, ans=0.125 2023-06-19 03:24:27,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-19 03:25:05,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=371574.0, ans=0.0 2023-06-19 03:25:10,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=371574.0, ans=0.0 2023-06-19 03:25:12,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=371574.0, ans=0.125 2023-06-19 03:25:15,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=371574.0, ans=0.125 2023-06-19 03:25:24,167 INFO [train.py:996] (0/4) Epoch 3, batch 950, loss[loss=0.2577, simple_loss=0.3106, pruned_loss=0.1024, over 21525.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3441, pruned_loss=0.1114, over 4252035.62 frames. ], batch size: 548, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:25:30,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-19 03:25:59,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.854e+02 3.566e+02 4.630e+02 9.213e+02, threshold=7.133e+02, percent-clipped=4.0 2023-06-19 03:26:09,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=371754.0, ans=0.125 2023-06-19 03:26:09,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=371754.0, ans=0.125 2023-06-19 03:27:03,897 INFO [train.py:996] (0/4) Epoch 3, batch 1000, loss[loss=0.3045, simple_loss=0.3558, pruned_loss=0.1267, over 21800.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3446, pruned_loss=0.1114, over 4261445.96 frames. ], batch size: 298, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:27:23,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=371994.0, ans=0.2 2023-06-19 03:27:27,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=371994.0, ans=0.1 2023-06-19 03:27:42,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=372054.0, ans=0.125 2023-06-19 03:27:54,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=372054.0, ans=0.0 2023-06-19 03:28:22,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-19 03:28:48,348 INFO [train.py:996] (0/4) Epoch 3, batch 1050, loss[loss=0.2946, simple_loss=0.3471, pruned_loss=0.121, over 21453.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3442, pruned_loss=0.1107, over 4275967.34 frames. ], batch size: 194, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:28:56,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=372234.0, ans=0.2 2023-06-19 03:29:05,914 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:29:24,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.077e+02 3.761e+02 4.435e+02 8.515e+02, threshold=7.523e+02, percent-clipped=2.0 2023-06-19 03:30:26,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.94 vs. limit=10.0 2023-06-19 03:30:31,893 INFO [train.py:996] (0/4) Epoch 3, batch 1100, loss[loss=0.2, simple_loss=0.2704, pruned_loss=0.06481, over 16467.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3434, pruned_loss=0.1092, over 4275636.99 frames. ], batch size: 61, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:30:37,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=372534.0, ans=0.2 2023-06-19 03:30:39,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=372534.0, ans=0.0 2023-06-19 03:31:01,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-19 03:31:34,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=372654.0, ans=0.125 2023-06-19 03:31:36,502 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:32:04,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=372774.0, ans=0.125 2023-06-19 03:32:17,154 INFO [train.py:996] (0/4) Epoch 3, batch 1150, loss[loss=0.3106, simple_loss=0.3685, pruned_loss=0.1263, over 21743.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3443, pruned_loss=0.1097, over 4284835.20 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:33:03,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.941e+02 3.564e+02 4.361e+02 9.852e+02, threshold=7.128e+02, percent-clipped=2.0 2023-06-19 03:33:16,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=372954.0, ans=0.125 2023-06-19 03:34:05,591 INFO [train.py:996] (0/4) Epoch 3, batch 1200, loss[loss=0.3207, simple_loss=0.3774, pruned_loss=0.132, over 21724.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.346, pruned_loss=0.1109, over 4283890.52 frames. ], batch size: 389, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:35:30,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=373374.0, ans=0.125 2023-06-19 03:35:49,160 INFO [train.py:996] (0/4) Epoch 3, batch 1250, loss[loss=0.2936, simple_loss=0.3502, pruned_loss=0.1185, over 21787.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3495, pruned_loss=0.1125, over 4290689.33 frames. ], batch size: 247, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:36:24,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=373494.0, ans=0.2 2023-06-19 03:36:30,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.020e+02 3.067e+02 3.657e+02 4.609e+02 8.051e+02, threshold=7.314e+02, percent-clipped=2.0 2023-06-19 03:36:51,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=373554.0, ans=0.0 2023-06-19 03:37:07,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-19 03:37:16,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-19 03:37:25,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=373674.0, ans=0.0 2023-06-19 03:37:33,727 INFO [train.py:996] (0/4) Epoch 3, batch 1300, loss[loss=0.3215, simple_loss=0.3695, pruned_loss=0.1367, over 21498.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3503, pruned_loss=0.1126, over 4297857.39 frames. ], batch size: 194, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:38:31,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=373854.0, ans=0.125 2023-06-19 03:39:04,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=373974.0, ans=0.2 2023-06-19 03:39:18,833 INFO [train.py:996] (0/4) Epoch 3, batch 1350, loss[loss=0.2635, simple_loss=0.3252, pruned_loss=0.1009, over 21498.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3533, pruned_loss=0.1154, over 4297135.29 frames. ], batch size: 131, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:39:55,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=374094.0, ans=0.125 2023-06-19 03:40:01,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.516e+02 4.679e+02 5.899e+02 9.616e+02, threshold=9.359e+02, percent-clipped=8.0 2023-06-19 03:40:08,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=374154.0, ans=0.0 2023-06-19 03:41:03,064 INFO [train.py:996] (0/4) Epoch 3, batch 1400, loss[loss=0.3041, simple_loss=0.3569, pruned_loss=0.1256, over 20110.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3511, pruned_loss=0.1148, over 4293866.33 frames. ], batch size: 703, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:41:35,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=374394.0, ans=0.5 2023-06-19 03:42:47,140 INFO [train.py:996] (0/4) Epoch 3, batch 1450, loss[loss=0.2668, simple_loss=0.3365, pruned_loss=0.09852, over 21692.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3536, pruned_loss=0.1163, over 4295011.93 frames. ], batch size: 112, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:43:23,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=374694.0, ans=0.0 2023-06-19 03:43:28,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.105e+02 3.604e+02 4.454e+02 7.120e+02, threshold=7.209e+02, percent-clipped=0.0 2023-06-19 03:43:33,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-19 03:44:32,101 INFO [train.py:996] (0/4) Epoch 3, batch 1500, loss[loss=0.3226, simple_loss=0.3809, pruned_loss=0.1321, over 21559.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.3535, pruned_loss=0.1179, over 4298820.06 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:45:06,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=374994.0, ans=0.1 2023-06-19 03:45:50,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=375114.0, ans=0.0 2023-06-19 03:46:17,199 INFO [train.py:996] (0/4) Epoch 3, batch 1550, loss[loss=0.2854, simple_loss=0.3419, pruned_loss=0.1144, over 21767.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.352, pruned_loss=0.1174, over 4289278.82 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:46:42,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=375234.0, ans=0.125 2023-06-19 03:47:05,570 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.746e+02 3.313e+02 3.948e+02 6.762e+02, threshold=6.626e+02, percent-clipped=0.0 2023-06-19 03:47:23,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-19 03:48:04,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=375474.0, ans=0.0 2023-06-19 03:48:14,574 INFO [train.py:996] (0/4) Epoch 3, batch 1600, loss[loss=0.2794, simple_loss=0.3484, pruned_loss=0.1052, over 20026.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3496, pruned_loss=0.1145, over 4279293.73 frames. ], batch size: 702, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:48:25,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=375534.0, ans=0.2 2023-06-19 03:49:54,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=375774.0, ans=0.125 2023-06-19 03:49:59,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=375834.0, ans=0.07 2023-06-19 03:50:00,868 INFO [train.py:996] (0/4) Epoch 3, batch 1650, loss[loss=0.3458, simple_loss=0.4035, pruned_loss=0.1441, over 21931.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3483, pruned_loss=0.1131, over 4280632.84 frames. ], batch size: 372, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:50:38,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.775e+02 3.357e+02 4.211e+02 7.088e+02, threshold=6.714e+02, percent-clipped=2.0 2023-06-19 03:51:05,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-19 03:51:49,361 INFO [train.py:996] (0/4) Epoch 3, batch 1700, loss[loss=0.2956, simple_loss=0.3713, pruned_loss=0.1099, over 21592.00 frames. ], tot_loss[loss=0.2913, simple_loss=0.353, pruned_loss=0.1148, over 4281634.47 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:52:05,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=376134.0, ans=0.0 2023-06-19 03:52:41,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=376254.0, ans=0.0 2023-06-19 03:52:59,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=376314.0, ans=0.1 2023-06-19 03:53:16,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=22.5 2023-06-19 03:53:33,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=376374.0, ans=0.2 2023-06-19 03:53:33,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=376374.0, ans=0.1 2023-06-19 03:53:43,170 INFO [train.py:996] (0/4) Epoch 3, batch 1750, loss[loss=0.2998, simple_loss=0.3937, pruned_loss=0.103, over 21631.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3532, pruned_loss=0.1145, over 4275332.75 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:54:23,389 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 3.144e+02 4.448e+02 5.330e+02 9.147e+02, threshold=8.897e+02, percent-clipped=12.0 2023-06-19 03:54:34,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=376554.0, ans=0.125 2023-06-19 03:54:45,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.37 vs. limit=10.0 2023-06-19 03:54:46,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=376614.0, ans=0.2 2023-06-19 03:55:31,661 INFO [train.py:996] (0/4) Epoch 3, batch 1800, loss[loss=0.2627, simple_loss=0.3453, pruned_loss=0.09004, over 21403.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3506, pruned_loss=0.1113, over 4275645.13 frames. ], batch size: 194, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:56:03,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=376794.0, ans=0.05 2023-06-19 03:56:26,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=376854.0, ans=0.125 2023-06-19 03:56:57,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=22.5 2023-06-19 03:57:11,739 INFO [train.py:996] (0/4) Epoch 3, batch 1850, loss[loss=0.2587, simple_loss=0.3545, pruned_loss=0.08148, over 21004.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3509, pruned_loss=0.1097, over 4274104.18 frames. ], batch size: 607, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:57:31,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=377034.0, ans=0.0 2023-06-19 03:58:00,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.940e+02 3.521e+02 4.849e+02 8.658e+02, threshold=7.043e+02, percent-clipped=0.0 2023-06-19 03:58:13,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-19 03:58:48,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=377274.0, ans=0.1 2023-06-19 03:59:02,452 INFO [train.py:996] (0/4) Epoch 3, batch 1900, loss[loss=0.237, simple_loss=0.3062, pruned_loss=0.08388, over 21647.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3514, pruned_loss=0.11, over 4277596.67 frames. ], batch size: 247, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:59:55,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.09 vs. limit=6.0 2023-06-19 04:00:06,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=377514.0, ans=0.125 2023-06-19 04:00:46,522 INFO [train.py:996] (0/4) Epoch 3, batch 1950, loss[loss=0.2598, simple_loss=0.3263, pruned_loss=0.09664, over 21489.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.348, pruned_loss=0.1097, over 4281629.06 frames. ], batch size: 212, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 04:01:26,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-19 04:01:30,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 3.084e+02 3.765e+02 4.629e+02 7.601e+02, threshold=7.530e+02, percent-clipped=2.0 2023-06-19 04:01:33,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=377754.0, ans=0.5 2023-06-19 04:01:42,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=377754.0, ans=15.0 2023-06-19 04:02:31,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=377934.0, ans=0.035 2023-06-19 04:02:32,687 INFO [train.py:996] (0/4) Epoch 3, batch 2000, loss[loss=0.2505, simple_loss=0.3148, pruned_loss=0.09306, over 20774.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3416, pruned_loss=0.1065, over 4272591.78 frames. ], batch size: 607, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:03:29,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=378054.0, ans=0.125 2023-06-19 04:04:11,041 INFO [train.py:996] (0/4) Epoch 3, batch 2050, loss[loss=0.2775, simple_loss=0.3321, pruned_loss=0.1114, over 21929.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3435, pruned_loss=0.107, over 4282904.55 frames. ], batch size: 316, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:04:41,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378294.0, ans=0.1 2023-06-19 04:04:43,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=378294.0, ans=0.125 2023-06-19 04:04:50,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-19 04:04:54,239 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 2.993e+02 3.653e+02 4.561e+02 8.702e+02, threshold=7.306e+02, percent-clipped=1.0 2023-06-19 04:04:56,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=378354.0, ans=0.125 2023-06-19 04:05:18,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=378414.0, ans=0.125 2023-06-19 04:05:29,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=378414.0, ans=0.025 2023-06-19 04:05:48,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=378474.0, ans=0.125 2023-06-19 04:05:54,042 INFO [train.py:996] (0/4) Epoch 3, batch 2100, loss[loss=0.2435, simple_loss=0.2928, pruned_loss=0.09707, over 20243.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.347, pruned_loss=0.1097, over 4282253.98 frames. ], batch size: 703, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:06:58,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-19 04:07:01,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=378714.0, ans=0.125 2023-06-19 04:07:27,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=378774.0, ans=0.0 2023-06-19 04:07:32,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-19 04:07:39,945 INFO [train.py:996] (0/4) Epoch 3, batch 2150, loss[loss=0.3317, simple_loss=0.3627, pruned_loss=0.1503, over 21611.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3462, pruned_loss=0.1113, over 4273516.48 frames. ], batch size: 415, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:07:44,191 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:08:30,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.207e+02 3.919e+02 5.012e+02 8.780e+02, threshold=7.837e+02, percent-clipped=4.0 2023-06-19 04:09:24,830 INFO [train.py:996] (0/4) Epoch 3, batch 2200, loss[loss=0.3265, simple_loss=0.4105, pruned_loss=0.1212, over 21606.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3527, pruned_loss=0.1122, over 4275450.16 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:10:15,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=379254.0, ans=0.125 2023-06-19 04:10:23,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=379254.0, ans=0.1 2023-06-19 04:11:03,874 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-19 04:11:04,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=379374.0, ans=0.0 2023-06-19 04:11:09,117 INFO [train.py:996] (0/4) Epoch 3, batch 2250, loss[loss=0.2883, simple_loss=0.3229, pruned_loss=0.1268, over 21406.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3487, pruned_loss=0.1099, over 4279180.74 frames. ], batch size: 475, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:11:32,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=379434.0, ans=0.125 2023-06-19 04:11:52,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=379494.0, ans=0.125 2023-06-19 04:11:56,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.971e+02 3.785e+02 4.786e+02 8.748e+02, threshold=7.570e+02, percent-clipped=4.0 2023-06-19 04:12:02,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=379554.0, ans=0.0 2023-06-19 04:12:10,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=379554.0, ans=0.125 2023-06-19 04:12:33,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=379614.0, ans=0.125 2023-06-19 04:12:39,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=379674.0, ans=0.0 2023-06-19 04:12:52,024 INFO [train.py:996] (0/4) Epoch 3, batch 2300, loss[loss=0.2812, simple_loss=0.3344, pruned_loss=0.114, over 21452.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3439, pruned_loss=0.1086, over 4282741.79 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:13:33,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=379794.0, ans=0.0 2023-06-19 04:14:35,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=380034.0, ans=0.0 2023-06-19 04:14:42,045 INFO [train.py:996] (0/4) Epoch 3, batch 2350, loss[loss=0.2777, simple_loss=0.3286, pruned_loss=0.1133, over 21839.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3402, pruned_loss=0.1093, over 4284242.07 frames. ], batch size: 107, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 04:15:11,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.62 vs. limit=22.5 2023-06-19 04:15:14,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=380094.0, ans=0.07 2023-06-19 04:15:27,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.593e+02 3.227e+02 3.663e+02 5.028e+02 9.666e+02, threshold=7.327e+02, percent-clipped=5.0 2023-06-19 04:16:08,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=380274.0, ans=0.125 2023-06-19 04:16:22,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=380274.0, ans=0.0 2023-06-19 04:16:35,293 INFO [train.py:996] (0/4) Epoch 3, batch 2400, loss[loss=0.2888, simple_loss=0.3496, pruned_loss=0.114, over 21804.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3441, pruned_loss=0.1119, over 4279658.32 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:16:35,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=380334.0, ans=0.125 2023-06-19 04:16:39,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-19 04:17:20,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.00 vs. limit=5.0 2023-06-19 04:18:21,217 INFO [train.py:996] (0/4) Epoch 3, batch 2450, loss[loss=0.3411, simple_loss=0.3938, pruned_loss=0.1442, over 21296.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3513, pruned_loss=0.1156, over 4276983.03 frames. ], batch size: 143, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:19:00,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 3.007e+02 3.779e+02 4.461e+02 8.893e+02, threshold=7.558e+02, percent-clipped=3.0 2023-06-19 04:19:22,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-19 04:19:39,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=380874.0, ans=0.125 2023-06-19 04:20:04,650 INFO [train.py:996] (0/4) Epoch 3, batch 2500, loss[loss=0.2713, simple_loss=0.3278, pruned_loss=0.1073, over 21513.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3479, pruned_loss=0.1132, over 4262898.82 frames. ], batch size: 414, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:20:28,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=380994.0, ans=0.125 2023-06-19 04:20:37,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380994.0, ans=0.1 2023-06-19 04:20:48,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=381054.0, ans=0.125 2023-06-19 04:20:52,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=381054.0, ans=0.0 2023-06-19 04:21:49,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=381234.0, ans=0.0 2023-06-19 04:21:50,614 INFO [train.py:996] (0/4) Epoch 3, batch 2550, loss[loss=0.2875, simple_loss=0.3318, pruned_loss=0.1216, over 21137.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3447, pruned_loss=0.1112, over 4265735.49 frames. ], batch size: 159, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:22:31,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.967e+02 3.517e+02 4.789e+02 7.584e+02, threshold=7.035e+02, percent-clipped=1.0 2023-06-19 04:23:36,610 INFO [train.py:996] (0/4) Epoch 3, batch 2600, loss[loss=0.3312, simple_loss=0.3895, pruned_loss=0.1364, over 16794.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3478, pruned_loss=0.1142, over 4264180.15 frames. ], batch size: 60, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:24:20,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=381654.0, ans=0.125 2023-06-19 04:24:47,108 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:25:18,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=381774.0, ans=0.0 2023-06-19 04:25:24,336 INFO [train.py:996] (0/4) Epoch 3, batch 2650, loss[loss=0.2419, simple_loss=0.307, pruned_loss=0.08839, over 20821.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3486, pruned_loss=0.1154, over 4274074.79 frames. ], batch size: 608, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:25:57,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=381894.0, ans=0.0 2023-06-19 04:25:58,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-19 04:25:59,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=381894.0, ans=0.0 2023-06-19 04:26:01,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=381894.0, ans=10.0 2023-06-19 04:26:05,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 3.151e+02 3.898e+02 4.845e+02 8.708e+02, threshold=7.796e+02, percent-clipped=4.0 2023-06-19 04:26:24,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=382014.0, ans=0.125 2023-06-19 04:26:26,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=382014.0, ans=0.1 2023-06-19 04:26:57,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=382074.0, ans=0.125 2023-06-19 04:27:09,898 INFO [train.py:996] (0/4) Epoch 3, batch 2700, loss[loss=0.3137, simple_loss=0.3748, pruned_loss=0.1263, over 21585.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3459, pruned_loss=0.1125, over 4268816.80 frames. ], batch size: 473, lr: 1.19e-02, grad_scale: 16.0 2023-06-19 04:28:13,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-19 04:28:54,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=382434.0, ans=0.125 2023-06-19 04:28:55,313 INFO [train.py:996] (0/4) Epoch 3, batch 2750, loss[loss=0.2636, simple_loss=0.3565, pruned_loss=0.08531, over 21832.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3439, pruned_loss=0.112, over 4276719.93 frames. ], batch size: 351, lr: 1.19e-02, grad_scale: 16.0 2023-06-19 04:29:19,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=382494.0, ans=0.0 2023-06-19 04:29:37,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.554e+02 3.438e+02 4.314e+02 5.827e+02 1.229e+03, threshold=8.627e+02, percent-clipped=3.0 2023-06-19 04:30:44,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=382734.0, ans=0.125 2023-06-19 04:30:45,982 INFO [train.py:996] (0/4) Epoch 3, batch 2800, loss[loss=0.3729, simple_loss=0.4321, pruned_loss=0.1568, over 21655.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3498, pruned_loss=0.1145, over 4279197.22 frames. ], batch size: 389, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:30:46,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=382734.0, ans=0.125 2023-06-19 04:30:52,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-19 04:30:57,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=382734.0, ans=0.125 2023-06-19 04:31:07,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=382794.0, ans=15.0 2023-06-19 04:31:32,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=382854.0, ans=0.0 2023-06-19 04:32:32,060 INFO [train.py:996] (0/4) Epoch 3, batch 2850, loss[loss=0.2684, simple_loss=0.3236, pruned_loss=0.1066, over 21629.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3493, pruned_loss=0.1145, over 4278614.31 frames. ], batch size: 263, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:32:37,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=383034.0, ans=0.125 2023-06-19 04:33:19,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.317e+02 3.934e+02 4.710e+02 8.134e+02, threshold=7.867e+02, percent-clipped=0.0 2023-06-19 04:33:27,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=383154.0, ans=0.2 2023-06-19 04:33:52,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-19 04:33:55,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=383274.0, ans=0.125 2023-06-19 04:34:04,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=383274.0, ans=0.5 2023-06-19 04:34:17,107 INFO [train.py:996] (0/4) Epoch 3, batch 2900, loss[loss=0.2207, simple_loss=0.2698, pruned_loss=0.08584, over 20787.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3452, pruned_loss=0.1136, over 4284435.10 frames. ], batch size: 608, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:34:22,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=383334.0, ans=0.0 2023-06-19 04:34:38,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=383394.0, ans=0.0 2023-06-19 04:34:48,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=383394.0, ans=0.07 2023-06-19 04:34:51,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=383394.0, ans=0.125 2023-06-19 04:35:24,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=383514.0, ans=0.125 2023-06-19 04:35:39,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.37 vs. limit=6.0 2023-06-19 04:35:56,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=383574.0, ans=0.05 2023-06-19 04:35:58,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-06-19 04:36:02,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.08 vs. limit=22.5 2023-06-19 04:36:02,571 INFO [train.py:996] (0/4) Epoch 3, batch 2950, loss[loss=0.3787, simple_loss=0.4378, pruned_loss=0.1598, over 21594.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3466, pruned_loss=0.1135, over 4288563.99 frames. ], batch size: 508, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:36:45,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=383754.0, ans=0.0 2023-06-19 04:36:50,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.976e+02 3.392e+02 4.326e+02 8.351e+02, threshold=6.785e+02, percent-clipped=1.0 2023-06-19 04:37:04,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=383754.0, ans=0.2 2023-06-19 04:37:20,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=383814.0, ans=0.0 2023-06-19 04:37:48,121 INFO [train.py:996] (0/4) Epoch 3, batch 3000, loss[loss=0.3612, simple_loss=0.4067, pruned_loss=0.1579, over 21574.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3502, pruned_loss=0.1149, over 4289532.28 frames. ], batch size: 414, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:37:48,122 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 04:38:05,905 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2668, simple_loss=0.3633, pruned_loss=0.08521, over 1796401.00 frames. 2023-06-19 04:38:05,906 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 04:38:33,758 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-64000.pt 2023-06-19 04:38:40,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=383994.0, ans=0.125 2023-06-19 04:38:51,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=383994.0, ans=0.2 2023-06-19 04:39:03,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=384054.0, ans=0.0 2023-06-19 04:39:03,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=384054.0, ans=0.125 2023-06-19 04:39:52,626 INFO [train.py:996] (0/4) Epoch 3, batch 3050, loss[loss=0.2718, simple_loss=0.3528, pruned_loss=0.09544, over 21672.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3516, pruned_loss=0.1135, over 4291991.19 frames. ], batch size: 414, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:40:30,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-19 04:40:33,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=384294.0, ans=0.04949747468305833 2023-06-19 04:40:44,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 3.108e+02 3.737e+02 4.686e+02 8.351e+02, threshold=7.474e+02, percent-clipped=4.0 2023-06-19 04:40:45,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=384354.0, ans=0.125 2023-06-19 04:41:05,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-06-19 04:41:18,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=384474.0, ans=0.125 2023-06-19 04:41:41,706 INFO [train.py:996] (0/4) Epoch 3, batch 3100, loss[loss=0.2531, simple_loss=0.3354, pruned_loss=0.08545, over 21678.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3507, pruned_loss=0.1119, over 4288546.95 frames. ], batch size: 298, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:42:09,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=384594.0, ans=0.09899494936611666 2023-06-19 04:42:10,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=384594.0, ans=0.125 2023-06-19 04:42:20,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-19 04:43:20,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=384774.0, ans=0.125 2023-06-19 04:43:32,013 INFO [train.py:996] (0/4) Epoch 3, batch 3150, loss[loss=0.2838, simple_loss=0.3479, pruned_loss=0.1098, over 21522.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3573, pruned_loss=0.1156, over 4292172.85 frames. ], batch size: 230, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:44:16,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=384954.0, ans=0.125 2023-06-19 04:44:19,355 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.177e+02 3.933e+02 4.816e+02 8.908e+02, threshold=7.865e+02, percent-clipped=2.0 2023-06-19 04:44:25,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=384954.0, ans=0.125 2023-06-19 04:44:51,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-06-19 04:45:23,773 INFO [train.py:996] (0/4) Epoch 3, batch 3200, loss[loss=0.3037, simple_loss=0.367, pruned_loss=0.1202, over 21926.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3569, pruned_loss=0.1147, over 4285518.79 frames. ], batch size: 317, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:45:25,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=385134.0, ans=0.0 2023-06-19 04:45:27,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=385134.0, ans=0.0 2023-06-19 04:45:31,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=385134.0, ans=0.125 2023-06-19 04:46:01,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=385254.0, ans=0.1 2023-06-19 04:46:06,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-19 04:46:30,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=385314.0, ans=0.09899494936611666 2023-06-19 04:46:33,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-19 04:46:41,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=385314.0, ans=0.125 2023-06-19 04:47:08,029 INFO [train.py:996] (0/4) Epoch 3, batch 3250, loss[loss=0.2678, simple_loss=0.3174, pruned_loss=0.1091, over 21649.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3561, pruned_loss=0.116, over 4289153.24 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:47:50,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.327e+02 4.160e+02 5.584e+02 8.725e+02, threshold=8.319e+02, percent-clipped=2.0 2023-06-19 04:48:21,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-19 04:48:59,528 INFO [train.py:996] (0/4) Epoch 3, batch 3300, loss[loss=0.2725, simple_loss=0.3629, pruned_loss=0.09103, over 20845.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3526, pruned_loss=0.1145, over 4287020.13 frames. ], batch size: 608, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:49:18,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=385794.0, ans=0.125 2023-06-19 04:50:44,363 INFO [train.py:996] (0/4) Epoch 3, batch 3350, loss[loss=0.2661, simple_loss=0.3241, pruned_loss=0.1041, over 21818.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3538, pruned_loss=0.1134, over 4284984.27 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:51:03,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=386094.0, ans=0.0 2023-06-19 04:51:10,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=386094.0, ans=0.125 2023-06-19 04:51:20,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.290e+02 3.790e+02 4.247e+02 7.031e+02, threshold=7.579e+02, percent-clipped=0.0 2023-06-19 04:51:29,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=386154.0, ans=0.125 2023-06-19 04:51:38,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=386154.0, ans=0.125 2023-06-19 04:51:41,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=386154.0, ans=0.125 2023-06-19 04:52:27,720 INFO [train.py:996] (0/4) Epoch 3, batch 3400, loss[loss=0.2532, simple_loss=0.3219, pruned_loss=0.09219, over 21672.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.353, pruned_loss=0.114, over 4289762.12 frames. ], batch size: 298, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:52:28,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=386334.0, ans=0.125 2023-06-19 04:53:20,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-06-19 04:53:40,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=386514.0, ans=10.0 2023-06-19 04:54:13,040 INFO [train.py:996] (0/4) Epoch 3, batch 3450, loss[loss=0.2847, simple_loss=0.3297, pruned_loss=0.1199, over 21827.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3508, pruned_loss=0.1137, over 4286071.61 frames. ], batch size: 372, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:54:27,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=386634.0, ans=0.125 2023-06-19 04:54:51,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=386754.0, ans=0.0 2023-06-19 04:55:06,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.192e+02 3.930e+02 4.779e+02 8.558e+02, threshold=7.861e+02, percent-clipped=2.0 2023-06-19 04:55:49,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386874.0, ans=0.1 2023-06-19 04:55:57,928 INFO [train.py:996] (0/4) Epoch 3, batch 3500, loss[loss=0.3304, simple_loss=0.3799, pruned_loss=0.1405, over 21798.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3602, pruned_loss=0.1185, over 4287662.68 frames. ], batch size: 118, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 04:56:54,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-19 04:57:44,031 INFO [train.py:996] (0/4) Epoch 3, batch 3550, loss[loss=0.2571, simple_loss=0.3138, pruned_loss=0.1002, over 21380.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3603, pruned_loss=0.1191, over 4286975.32 frames. ], batch size: 211, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 04:58:33,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.02 vs. limit=6.0 2023-06-19 04:58:38,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.279e+02 3.936e+02 4.776e+02 8.299e+02, threshold=7.873e+02, percent-clipped=2.0 2023-06-19 04:58:46,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=387354.0, ans=0.125 2023-06-19 04:59:23,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=387474.0, ans=0.125 2023-06-19 04:59:31,864 INFO [train.py:996] (0/4) Epoch 3, batch 3600, loss[loss=0.2764, simple_loss=0.3398, pruned_loss=0.1065, over 21235.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3564, pruned_loss=0.1192, over 4284756.43 frames. ], batch size: 549, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:00:41,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=387654.0, ans=0.1 2023-06-19 05:00:59,896 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.34 vs. limit=22.5 2023-06-19 05:01:02,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=387774.0, ans=0.1 2023-06-19 05:01:16,083 INFO [train.py:996] (0/4) Epoch 3, batch 3650, loss[loss=0.3368, simple_loss=0.406, pruned_loss=0.1338, over 21555.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.359, pruned_loss=0.1198, over 4278222.56 frames. ], batch size: 508, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:01:55,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=387894.0, ans=0.1 2023-06-19 05:02:08,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.313e+02 3.848e+02 4.708e+02 1.033e+03, threshold=7.696e+02, percent-clipped=4.0 2023-06-19 05:02:27,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.53 vs. limit=10.0 2023-06-19 05:02:59,747 INFO [train.py:996] (0/4) Epoch 3, batch 3700, loss[loss=0.3224, simple_loss=0.3897, pruned_loss=0.1275, over 21427.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3574, pruned_loss=0.1184, over 4283187.04 frames. ], batch size: 548, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:03:34,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388194.0, ans=0.1 2023-06-19 05:03:54,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388254.0, ans=0.1 2023-06-19 05:04:01,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=388254.0, ans=0.125 2023-06-19 05:04:56,334 INFO [train.py:996] (0/4) Epoch 3, batch 3750, loss[loss=0.4123, simple_loss=0.447, pruned_loss=0.1888, over 21724.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3545, pruned_loss=0.1168, over 4286728.04 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:04:56,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=388434.0, ans=0.125 2023-06-19 05:05:07,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=388434.0, ans=0.125 2023-06-19 05:05:12,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=388434.0, ans=0.125 2023-06-19 05:05:17,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388494.0, ans=0.1 2023-06-19 05:05:43,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 3.137e+02 4.357e+02 5.330e+02 7.776e+02, threshold=8.713e+02, percent-clipped=1.0 2023-06-19 05:05:47,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=388554.0, ans=0.0 2023-06-19 05:05:52,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=388554.0, ans=0.125 2023-06-19 05:06:23,402 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.17 vs. limit=22.5 2023-06-19 05:06:46,842 INFO [train.py:996] (0/4) Epoch 3, batch 3800, loss[loss=0.3207, simple_loss=0.3734, pruned_loss=0.134, over 21756.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3537, pruned_loss=0.1163, over 4286816.37 frames. ], batch size: 392, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:06:57,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=388734.0, ans=0.0 2023-06-19 05:07:02,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=388794.0, ans=0.05 2023-06-19 05:07:07,623 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-19 05:07:56,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=388974.0, ans=0.1 2023-06-19 05:08:08,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=388974.0, ans=0.025 2023-06-19 05:08:19,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=22.5 2023-06-19 05:08:23,812 INFO [train.py:996] (0/4) Epoch 3, batch 3850, loss[loss=0.2458, simple_loss=0.295, pruned_loss=0.09826, over 21696.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3497, pruned_loss=0.1164, over 4282414.02 frames. ], batch size: 124, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:08:48,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=389094.0, ans=0.125 2023-06-19 05:09:10,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 3.065e+02 3.544e+02 4.567e+02 7.617e+02, threshold=7.087e+02, percent-clipped=0.0 2023-06-19 05:09:24,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=389214.0, ans=0.0 2023-06-19 05:09:28,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-19 05:09:31,314 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:09:31,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-19 05:10:07,271 INFO [train.py:996] (0/4) Epoch 3, batch 3900, loss[loss=0.3403, simple_loss=0.3667, pruned_loss=0.157, over 21766.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3451, pruned_loss=0.1148, over 4278813.21 frames. ], batch size: 508, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:11:05,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=389454.0, ans=0.0 2023-06-19 05:11:23,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389574.0, ans=0.1 2023-06-19 05:11:32,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=389574.0, ans=0.0 2023-06-19 05:11:33,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=389574.0, ans=0.0 2023-06-19 05:11:51,758 INFO [train.py:996] (0/4) Epoch 3, batch 3950, loss[loss=0.1916, simple_loss=0.2747, pruned_loss=0.05425, over 21559.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3434, pruned_loss=0.1122, over 4273867.73 frames. ], batch size: 230, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:12:02,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=389634.0, ans=0.125 2023-06-19 05:12:03,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=389634.0, ans=0.0 2023-06-19 05:12:27,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=389694.0, ans=0.125 2023-06-19 05:12:32,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389694.0, ans=0.1 2023-06-19 05:12:38,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.040e+02 3.554e+02 4.206e+02 5.675e+02, threshold=7.109e+02, percent-clipped=0.0 2023-06-19 05:12:39,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389754.0, ans=0.1 2023-06-19 05:13:26,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=389874.0, ans=0.125 2023-06-19 05:13:36,459 INFO [train.py:996] (0/4) Epoch 3, batch 4000, loss[loss=0.2556, simple_loss=0.3012, pruned_loss=0.105, over 21206.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3353, pruned_loss=0.1079, over 4276442.17 frames. ], batch size: 159, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:14:02,870 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=22.5 2023-06-19 05:14:03,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=389994.0, ans=0.2 2023-06-19 05:14:32,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=390054.0, ans=0.125 2023-06-19 05:15:23,384 INFO [train.py:996] (0/4) Epoch 3, batch 4050, loss[loss=0.2395, simple_loss=0.3145, pruned_loss=0.08221, over 21417.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3362, pruned_loss=0.1064, over 4276604.30 frames. ], batch size: 194, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:16:10,711 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.094e+02 3.976e+02 4.759e+02 9.787e+02, threshold=7.952e+02, percent-clipped=5.0 2023-06-19 05:17:13,596 INFO [train.py:996] (0/4) Epoch 3, batch 4100, loss[loss=0.2708, simple_loss=0.3721, pruned_loss=0.08472, over 19914.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3389, pruned_loss=0.1076, over 4286174.71 frames. ], batch size: 703, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:18:13,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=390714.0, ans=0.125 2023-06-19 05:18:37,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=390774.0, ans=0.125 2023-06-19 05:18:58,808 INFO [train.py:996] (0/4) Epoch 3, batch 4150, loss[loss=0.3262, simple_loss=0.3741, pruned_loss=0.1391, over 21467.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3395, pruned_loss=0.1046, over 4275849.78 frames. ], batch size: 509, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:19:13,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=390834.0, ans=0.0 2023-06-19 05:19:26,689 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:19:41,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 3.003e+02 3.732e+02 5.110e+02 9.922e+02, threshold=7.464e+02, percent-clipped=2.0 2023-06-19 05:19:57,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=390954.0, ans=0.0 2023-06-19 05:20:10,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=391014.0, ans=0.0 2023-06-19 05:20:29,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=391074.0, ans=0.2 2023-06-19 05:20:51,298 INFO [train.py:996] (0/4) Epoch 3, batch 4200, loss[loss=0.3658, simple_loss=0.4527, pruned_loss=0.1395, over 21511.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3407, pruned_loss=0.1052, over 4274342.97 frames. ], batch size: 471, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:21:19,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=391194.0, ans=0.2 2023-06-19 05:21:52,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=391254.0, ans=0.125 2023-06-19 05:22:07,729 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:22:19,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-19 05:22:31,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=391374.0, ans=0.0 2023-06-19 05:22:40,230 INFO [train.py:996] (0/4) Epoch 3, batch 4250, loss[loss=0.3705, simple_loss=0.4365, pruned_loss=0.1522, over 21595.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3459, pruned_loss=0.1069, over 4270393.91 frames. ], batch size: 414, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:23:11,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=391494.0, ans=0.125 2023-06-19 05:23:30,824 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.309e+02 4.046e+02 4.889e+02 9.500e+02, threshold=8.092e+02, percent-clipped=4.0 2023-06-19 05:23:33,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=391554.0, ans=0.125 2023-06-19 05:23:38,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=391554.0, ans=0.125 2023-06-19 05:24:16,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=391674.0, ans=0.07 2023-06-19 05:24:26,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=391734.0, ans=0.035 2023-06-19 05:24:28,001 INFO [train.py:996] (0/4) Epoch 3, batch 4300, loss[loss=0.3548, simple_loss=0.4401, pruned_loss=0.1348, over 21628.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3516, pruned_loss=0.1096, over 4267448.69 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:24:30,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391734.0, ans=0.1 2023-06-19 05:24:39,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-19 05:24:52,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-19 05:26:08,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-06-19 05:26:13,019 INFO [train.py:996] (0/4) Epoch 3, batch 4350, loss[loss=0.2395, simple_loss=0.297, pruned_loss=0.09099, over 21504.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3486, pruned_loss=0.1078, over 4268410.06 frames. ], batch size: 212, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:27:08,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.136e+02 3.673e+02 4.293e+02 1.094e+03, threshold=7.346e+02, percent-clipped=4.0 2023-06-19 05:27:08,835 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:27:22,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392154.0, ans=0.1 2023-06-19 05:27:59,250 INFO [train.py:996] (0/4) Epoch 3, batch 4400, loss[loss=0.2942, simple_loss=0.3479, pruned_loss=0.1202, over 21489.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.346, pruned_loss=0.1081, over 4270744.47 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:28:37,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=392394.0, ans=0.125 2023-06-19 05:28:41,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=392394.0, ans=0.2 2023-06-19 05:28:50,426 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:29:00,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=392454.0, ans=0.2 2023-06-19 05:29:31,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=392574.0, ans=0.0 2023-06-19 05:29:49,767 INFO [train.py:996] (0/4) Epoch 3, batch 4450, loss[loss=0.3047, simple_loss=0.3889, pruned_loss=0.1102, over 21654.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3552, pruned_loss=0.1098, over 4275453.70 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:30:02,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-19 05:30:17,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=392694.0, ans=0.0 2023-06-19 05:30:40,193 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.016e+02 3.680e+02 4.427e+02 7.679e+02, threshold=7.360e+02, percent-clipped=2.0 2023-06-19 05:30:42,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=392754.0, ans=0.125 2023-06-19 05:31:03,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-19 05:31:08,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=392874.0, ans=0.0 2023-06-19 05:31:35,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=392934.0, ans=0.2 2023-06-19 05:31:36,782 INFO [train.py:996] (0/4) Epoch 3, batch 4500, loss[loss=0.2496, simple_loss=0.3242, pruned_loss=0.08752, over 21243.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3566, pruned_loss=0.1123, over 4274820.36 frames. ], batch size: 176, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:31:53,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-19 05:33:28,842 INFO [train.py:996] (0/4) Epoch 3, batch 4550, loss[loss=0.3629, simple_loss=0.4142, pruned_loss=0.1558, over 21433.00 frames. ], tot_loss[loss=0.2944, simple_loss=0.3621, pruned_loss=0.1133, over 4272016.43 frames. ], batch size: 471, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:34:03,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=393294.0, ans=0.125 2023-06-19 05:34:13,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.526e+02 4.619e+02 6.028e+02 1.155e+03, threshold=9.238e+02, percent-clipped=14.0 2023-06-19 05:34:15,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=393354.0, ans=0.025 2023-06-19 05:35:01,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=393474.0, ans=0.125 2023-06-19 05:35:14,626 INFO [train.py:996] (0/4) Epoch 3, batch 4600, loss[loss=0.3173, simple_loss=0.4155, pruned_loss=0.1095, over 21196.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3644, pruned_loss=0.1155, over 4276554.21 frames. ], batch size: 548, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:35:23,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=393534.0, ans=0.0 2023-06-19 05:36:42,119 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:36:43,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=393774.0, ans=0.0 2023-06-19 05:36:55,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=393774.0, ans=0.0 2023-06-19 05:37:01,739 INFO [train.py:996] (0/4) Epoch 3, batch 4650, loss[loss=0.2696, simple_loss=0.3255, pruned_loss=0.1069, over 21278.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3576, pruned_loss=0.1141, over 4281552.95 frames. ], batch size: 143, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:37:16,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=393834.0, ans=0.0 2023-06-19 05:37:26,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393894.0, ans=0.1 2023-06-19 05:37:39,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=393954.0, ans=0.125 2023-06-19 05:37:44,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.775e+02 3.310e+02 3.723e+02 7.638e+02, threshold=6.620e+02, percent-clipped=0.0 2023-06-19 05:37:47,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2023-06-19 05:38:37,047 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:38:42,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=394074.0, ans=0.0 2023-06-19 05:38:53,962 INFO [train.py:996] (0/4) Epoch 3, batch 4700, loss[loss=0.2297, simple_loss=0.2841, pruned_loss=0.08762, over 21471.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3464, pruned_loss=0.1101, over 4279608.89 frames. ], batch size: 212, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:40:32,953 INFO [train.py:996] (0/4) Epoch 3, batch 4750, loss[loss=0.2608, simple_loss=0.3131, pruned_loss=0.1043, over 21678.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3407, pruned_loss=0.1092, over 4282514.60 frames. ], batch size: 230, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:40:41,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=394434.0, ans=0.125 2023-06-19 05:40:43,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=394434.0, ans=0.1 2023-06-19 05:41:13,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394554.0, ans=0.1 2023-06-19 05:41:21,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 3.052e+02 3.874e+02 5.001e+02 1.083e+03, threshold=7.748e+02, percent-clipped=9.0 2023-06-19 05:41:44,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=394614.0, ans=0.2 2023-06-19 05:41:45,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=8.0 2023-06-19 05:42:25,186 INFO [train.py:996] (0/4) Epoch 3, batch 4800, loss[loss=0.2711, simple_loss=0.3586, pruned_loss=0.09176, over 21800.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3425, pruned_loss=0.1106, over 4290275.45 frames. ], batch size: 371, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:43:09,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-19 05:43:17,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=394854.0, ans=0.125 2023-06-19 05:43:58,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.78 vs. limit=22.5 2023-06-19 05:44:03,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=394974.0, ans=0.125 2023-06-19 05:44:10,708 INFO [train.py:996] (0/4) Epoch 3, batch 4850, loss[loss=0.2611, simple_loss=0.3186, pruned_loss=0.1018, over 21878.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3396, pruned_loss=0.1095, over 4291827.25 frames. ], batch size: 118, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:44:13,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=395034.0, ans=0.125 2023-06-19 05:44:26,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=395094.0, ans=0.05 2023-06-19 05:44:29,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-06-19 05:44:29,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=395094.0, ans=0.0 2023-06-19 05:44:54,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.516e+02 4.493e+02 6.112e+02 1.101e+03, threshold=8.986e+02, percent-clipped=11.0 2023-06-19 05:45:32,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=395274.0, ans=0.125 2023-06-19 05:45:42,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=12.0 2023-06-19 05:45:54,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=395334.0, ans=0.09899494936611666 2023-06-19 05:45:55,759 INFO [train.py:996] (0/4) Epoch 3, batch 4900, loss[loss=0.2842, simple_loss=0.354, pruned_loss=0.1072, over 21591.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3412, pruned_loss=0.1101, over 4282453.83 frames. ], batch size: 230, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:45:56,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-19 05:46:10,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=22.5 2023-06-19 05:46:12,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=395394.0, ans=0.0 2023-06-19 05:46:16,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=395394.0, ans=0.07 2023-06-19 05:46:57,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=395514.0, ans=0.125 2023-06-19 05:47:41,186 INFO [train.py:996] (0/4) Epoch 3, batch 4950, loss[loss=0.2242, simple_loss=0.301, pruned_loss=0.07368, over 21445.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3437, pruned_loss=0.1077, over 4288176.45 frames. ], batch size: 194, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:47:41,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=395634.0, ans=0.125 2023-06-19 05:47:41,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=395634.0, ans=0.04949747468305833 2023-06-19 05:48:02,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=395694.0, ans=0.2 2023-06-19 05:48:31,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.805e+02 3.354e+02 4.068e+02 9.306e+02, threshold=6.708e+02, percent-clipped=1.0 2023-06-19 05:49:06,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=395874.0, ans=10.0 2023-06-19 05:49:18,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=395874.0, ans=0.125 2023-06-19 05:49:27,320 INFO [train.py:996] (0/4) Epoch 3, batch 5000, loss[loss=0.2786, simple_loss=0.3423, pruned_loss=0.1074, over 21892.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3426, pruned_loss=0.104, over 4293517.66 frames. ], batch size: 316, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:49:27,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=395934.0, ans=0.125 2023-06-19 05:49:34,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=395934.0, ans=0.125 2023-06-19 05:49:36,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=395934.0, ans=0.0 2023-06-19 05:50:09,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=396054.0, ans=0.04949747468305833 2023-06-19 05:50:27,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=396114.0, ans=0.125 2023-06-19 05:50:32,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=396114.0, ans=0.0 2023-06-19 05:51:12,158 INFO [train.py:996] (0/4) Epoch 3, batch 5050, loss[loss=0.2646, simple_loss=0.3535, pruned_loss=0.08786, over 21632.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3442, pruned_loss=0.1063, over 4298810.44 frames. ], batch size: 441, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:51:27,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=396294.0, ans=0.0 2023-06-19 05:51:41,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=396294.0, ans=0.05 2023-06-19 05:51:41,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=396294.0, ans=0.125 2023-06-19 05:51:56,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.428e+02 4.062e+02 4.972e+02 8.550e+02, threshold=8.125e+02, percent-clipped=7.0 2023-06-19 05:52:10,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=396414.0, ans=0.0 2023-06-19 05:52:17,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=396414.0, ans=0.09899494936611666 2023-06-19 05:52:22,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-06-19 05:52:51,451 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:52:52,329 INFO [train.py:996] (0/4) Epoch 3, batch 5100, loss[loss=0.3021, simple_loss=0.3599, pruned_loss=0.1222, over 21721.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3444, pruned_loss=0.1074, over 4302014.93 frames. ], batch size: 389, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:53:08,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=396594.0, ans=0.0 2023-06-19 05:53:13,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=396594.0, ans=0.125 2023-06-19 05:53:53,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=396714.0, ans=0.125 2023-06-19 05:54:03,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=396714.0, ans=0.125 2023-06-19 05:54:06,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-19 05:54:07,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=396714.0, ans=0.2 2023-06-19 05:54:17,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=396774.0, ans=0.125 2023-06-19 05:54:22,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=396774.0, ans=0.0 2023-06-19 05:54:25,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=396774.0, ans=0.2 2023-06-19 05:54:37,017 INFO [train.py:996] (0/4) Epoch 3, batch 5150, loss[loss=0.3049, simple_loss=0.3571, pruned_loss=0.1263, over 21768.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3442, pruned_loss=0.1095, over 4302787.04 frames. ], batch size: 441, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:54:46,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=396834.0, ans=0.2 2023-06-19 05:55:27,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.247e+02 3.957e+02 4.711e+02 9.896e+02, threshold=7.915e+02, percent-clipped=1.0 2023-06-19 05:55:36,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=396954.0, ans=0.2 2023-06-19 05:56:08,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=397074.0, ans=0.125 2023-06-19 05:56:23,673 INFO [train.py:996] (0/4) Epoch 3, batch 5200, loss[loss=0.2959, simple_loss=0.3722, pruned_loss=0.1098, over 21766.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.346, pruned_loss=0.1107, over 4300939.89 frames. ], batch size: 247, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:56:29,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=397134.0, ans=0.125 2023-06-19 05:56:29,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=397134.0, ans=0.0 2023-06-19 05:57:59,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=397374.0, ans=0.0 2023-06-19 05:58:09,835 INFO [train.py:996] (0/4) Epoch 3, batch 5250, loss[loss=0.2511, simple_loss=0.3169, pruned_loss=0.09268, over 21761.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3488, pruned_loss=0.1093, over 4297317.08 frames. ], batch size: 112, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:58:59,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.349e+02 3.883e+02 5.144e+02 8.715e+02, threshold=7.765e+02, percent-clipped=1.0 2023-06-19 05:59:33,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=397614.0, ans=0.025 2023-06-19 05:59:53,227 INFO [train.py:996] (0/4) Epoch 3, batch 5300, loss[loss=0.2398, simple_loss=0.3129, pruned_loss=0.08331, over 21592.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3487, pruned_loss=0.1094, over 4296755.94 frames. ], batch size: 263, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:00:48,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=397854.0, ans=0.035 2023-06-19 06:00:53,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397914.0, ans=0.1 2023-06-19 06:00:54,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=397914.0, ans=0.125 2023-06-19 06:01:11,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-19 06:01:29,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=397974.0, ans=0.125 2023-06-19 06:01:33,554 INFO [train.py:996] (0/4) Epoch 3, batch 5350, loss[loss=0.3743, simple_loss=0.4929, pruned_loss=0.1278, over 19774.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3474, pruned_loss=0.1104, over 4300979.35 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:01:47,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398034.0, ans=0.1 2023-06-19 06:02:23,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.150e+02 3.600e+02 4.539e+02 9.021e+02, threshold=7.200e+02, percent-clipped=2.0 2023-06-19 06:03:08,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=398274.0, ans=0.5 2023-06-19 06:03:18,198 INFO [train.py:996] (0/4) Epoch 3, batch 5400, loss[loss=0.3037, simple_loss=0.36, pruned_loss=0.1237, over 19985.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3486, pruned_loss=0.1127, over 4295702.12 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:03:29,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=398334.0, ans=0.2 2023-06-19 06:04:13,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=398454.0, ans=0.125 2023-06-19 06:04:21,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=398454.0, ans=0.04949747468305833 2023-06-19 06:04:48,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=398574.0, ans=0.125 2023-06-19 06:05:02,982 INFO [train.py:996] (0/4) Epoch 3, batch 5450, loss[loss=0.2412, simple_loss=0.3218, pruned_loss=0.08032, over 21384.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.348, pruned_loss=0.1104, over 4296117.88 frames. ], batch size: 194, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:05:06,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398634.0, ans=0.1 2023-06-19 06:06:00,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.869e+02 3.395e+02 4.566e+02 8.866e+02, threshold=6.789e+02, percent-clipped=3.0 2023-06-19 06:06:40,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=398874.0, ans=0.04949747468305833 2023-06-19 06:07:02,453 INFO [train.py:996] (0/4) Epoch 3, batch 5500, loss[loss=0.2517, simple_loss=0.355, pruned_loss=0.07422, over 20983.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3522, pruned_loss=0.1069, over 4294210.84 frames. ], batch size: 607, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:07:17,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=398934.0, ans=0.0 2023-06-19 06:08:01,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=399054.0, ans=0.125 2023-06-19 06:08:20,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=399114.0, ans=0.125 2023-06-19 06:08:40,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=399174.0, ans=15.0 2023-06-19 06:08:46,546 INFO [train.py:996] (0/4) Epoch 3, batch 5550, loss[loss=0.2525, simple_loss=0.3469, pruned_loss=0.079, over 21642.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3478, pruned_loss=0.1023, over 4293675.34 frames. ], batch size: 414, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:09:01,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-19 06:09:14,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=399294.0, ans=0.125 2023-06-19 06:09:16,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=399294.0, ans=0.125 2023-06-19 06:09:33,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-19 06:09:38,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.755e+02 3.259e+02 4.197e+02 7.319e+02, threshold=6.518e+02, percent-clipped=2.0 2023-06-19 06:10:07,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=399414.0, ans=0.05 2023-06-19 06:10:18,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=399474.0, ans=0.125 2023-06-19 06:10:34,002 INFO [train.py:996] (0/4) Epoch 3, batch 5600, loss[loss=0.4326, simple_loss=0.4851, pruned_loss=0.19, over 21442.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.349, pruned_loss=0.1023, over 4291822.22 frames. ], batch size: 507, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:11:06,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=399594.0, ans=0.125 2023-06-19 06:11:46,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-19 06:11:52,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-19 06:11:56,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=399714.0, ans=0.2 2023-06-19 06:12:17,382 INFO [train.py:996] (0/4) Epoch 3, batch 5650, loss[loss=0.2882, simple_loss=0.338, pruned_loss=0.1192, over 21535.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3533, pruned_loss=0.1045, over 4290639.01 frames. ], batch size: 211, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:12:22,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=12.0 2023-06-19 06:12:49,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=399894.0, ans=0.0 2023-06-19 06:13:13,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 3.102e+02 3.958e+02 5.147e+02 8.863e+02, threshold=7.916e+02, percent-clipped=12.0 2023-06-19 06:13:34,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=400014.0, ans=0.1 2023-06-19 06:13:43,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=400014.0, ans=0.125 2023-06-19 06:13:45,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=400014.0, ans=0.2 2023-06-19 06:14:10,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=400074.0, ans=0.125 2023-06-19 06:14:16,683 INFO [train.py:996] (0/4) Epoch 3, batch 5700, loss[loss=0.2388, simple_loss=0.3211, pruned_loss=0.07824, over 21596.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3522, pruned_loss=0.1061, over 4290556.95 frames. ], batch size: 230, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:14:34,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=400194.0, ans=0.125 2023-06-19 06:14:47,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=400194.0, ans=0.125 2023-06-19 06:15:13,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=400254.0, ans=0.0 2023-06-19 06:15:27,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=400314.0, ans=0.0 2023-06-19 06:15:54,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=400374.0, ans=0.125 2023-06-19 06:16:03,939 INFO [train.py:996] (0/4) Epoch 3, batch 5750, loss[loss=0.253, simple_loss=0.3295, pruned_loss=0.08827, over 21024.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3461, pruned_loss=0.102, over 4278723.97 frames. ], batch size: 608, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:16:21,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=400434.0, ans=0.125 2023-06-19 06:16:54,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.901e+02 3.353e+02 4.192e+02 8.562e+02, threshold=6.706e+02, percent-clipped=1.0 2023-06-19 06:17:33,174 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-19 06:17:48,452 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-19 06:17:48,979 INFO [train.py:996] (0/4) Epoch 3, batch 5800, loss[loss=0.2521, simple_loss=0.33, pruned_loss=0.08704, over 21283.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3431, pruned_loss=0.09929, over 4279127.68 frames. ], batch size: 176, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:18:25,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=400794.0, ans=0.125 2023-06-19 06:19:41,066 INFO [train.py:996] (0/4) Epoch 3, batch 5850, loss[loss=0.2767, simple_loss=0.3688, pruned_loss=0.09228, over 21470.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3403, pruned_loss=0.09422, over 4282945.09 frames. ], batch size: 471, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:20:18,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-19 06:20:23,161 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-19 06:20:32,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 2.443e+02 2.875e+02 3.533e+02 5.012e+02, threshold=5.751e+02, percent-clipped=0.0 2023-06-19 06:21:31,748 INFO [train.py:996] (0/4) Epoch 3, batch 5900, loss[loss=0.1997, simple_loss=0.2859, pruned_loss=0.05674, over 21713.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3312, pruned_loss=0.08702, over 4288223.26 frames. ], batch size: 298, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:21:32,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-19 06:21:35,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=401334.0, ans=0.125 2023-06-19 06:21:49,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=401394.0, ans=0.125 2023-06-19 06:22:09,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401454.0, ans=0.1 2023-06-19 06:22:22,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=401454.0, ans=0.95 2023-06-19 06:23:13,894 INFO [train.py:996] (0/4) Epoch 3, batch 5950, loss[loss=0.2742, simple_loss=0.3185, pruned_loss=0.1149, over 20142.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3327, pruned_loss=0.0919, over 4285738.67 frames. ], batch size: 703, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:23:58,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 2.851e+02 3.351e+02 4.142e+02 6.067e+02, threshold=6.702e+02, percent-clipped=3.0 2023-06-19 06:24:28,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-19 06:25:00,270 INFO [train.py:996] (0/4) Epoch 3, batch 6000, loss[loss=0.2218, simple_loss=0.3324, pruned_loss=0.0556, over 21241.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3302, pruned_loss=0.09597, over 4264453.07 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:25:00,271 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 06:25:17,467 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2818, simple_loss=0.374, pruned_loss=0.0948, over 1796401.00 frames. 2023-06-19 06:25:17,468 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 06:25:23,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=401934.0, ans=0.2 2023-06-19 06:25:28,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=401934.0, ans=0.0 2023-06-19 06:25:50,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=401994.0, ans=0.125 2023-06-19 06:26:54,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402174.0, ans=0.1 2023-06-19 06:26:59,518 INFO [train.py:996] (0/4) Epoch 3, batch 6050, loss[loss=0.2945, simple_loss=0.3285, pruned_loss=0.1302, over 21240.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3272, pruned_loss=0.09828, over 4260787.86 frames. ], batch size: 608, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:27:26,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-19 06:27:49,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.918e+02 3.547e+02 4.372e+02 9.416e+02, threshold=7.093e+02, percent-clipped=6.0 2023-06-19 06:28:43,530 INFO [train.py:996] (0/4) Epoch 3, batch 6100, loss[loss=0.2422, simple_loss=0.3103, pruned_loss=0.08703, over 21765.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3259, pruned_loss=0.0971, over 4267808.62 frames. ], batch size: 247, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:29:46,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402654.0, ans=0.1 2023-06-19 06:30:01,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=402714.0, ans=0.025 2023-06-19 06:30:05,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=402714.0, ans=0.0 2023-06-19 06:30:30,053 INFO [train.py:996] (0/4) Epoch 3, batch 6150, loss[loss=0.293, simple_loss=0.3526, pruned_loss=0.1167, over 21645.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3302, pruned_loss=0.1015, over 4274309.00 frames. ], batch size: 414, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:31:08,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=402894.0, ans=0.1 2023-06-19 06:31:15,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=402954.0, ans=0.95 2023-06-19 06:31:21,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 3.076e+02 3.570e+02 4.379e+02 8.300e+02, threshold=7.140e+02, percent-clipped=3.0 2023-06-19 06:31:52,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=403014.0, ans=0.125 2023-06-19 06:32:10,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=403074.0, ans=0.04949747468305833 2023-06-19 06:32:15,747 INFO [train.py:996] (0/4) Epoch 3, batch 6200, loss[loss=0.3086, simple_loss=0.3573, pruned_loss=0.1299, over 21313.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3327, pruned_loss=0.1008, over 4268531.90 frames. ], batch size: 159, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:33:29,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=403314.0, ans=0.125 2023-06-19 06:33:50,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-19 06:33:52,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=403374.0, ans=0.125 2023-06-19 06:34:05,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=403374.0, ans=0.125 2023-06-19 06:34:08,061 INFO [train.py:996] (0/4) Epoch 3, batch 6250, loss[loss=0.2609, simple_loss=0.355, pruned_loss=0.08341, over 21663.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3377, pruned_loss=0.1014, over 4263749.44 frames. ], batch size: 263, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:34:22,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.98 vs. limit=5.0 2023-06-19 06:34:36,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=403494.0, ans=0.05 2023-06-19 06:35:06,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-19 06:35:08,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 3.015e+02 3.767e+02 4.898e+02 1.129e+03, threshold=7.534e+02, percent-clipped=8.0 2023-06-19 06:35:14,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=403554.0, ans=0.125 2023-06-19 06:35:22,957 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:35:24,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=403614.0, ans=0.125 2023-06-19 06:35:40,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=403674.0, ans=0.0 2023-06-19 06:36:01,226 INFO [train.py:996] (0/4) Epoch 3, batch 6300, loss[loss=0.3121, simple_loss=0.3603, pruned_loss=0.132, over 21335.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3428, pruned_loss=0.1011, over 4266867.62 frames. ], batch size: 144, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:37:29,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=403974.0, ans=0.125 2023-06-19 06:37:41,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=403974.0, ans=0.1 2023-06-19 06:37:45,995 INFO [train.py:996] (0/4) Epoch 3, batch 6350, loss[loss=0.3196, simple_loss=0.3701, pruned_loss=0.1346, over 21601.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3486, pruned_loss=0.1082, over 4268036.23 frames. ], batch size: 389, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:38:38,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.096e+02 3.646e+02 4.304e+02 8.936e+02, threshold=7.293e+02, percent-clipped=1.0 2023-06-19 06:38:44,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=404154.0, ans=0.2 2023-06-19 06:38:44,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.58 vs. limit=6.0 2023-06-19 06:38:52,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=404214.0, ans=0.07 2023-06-19 06:39:12,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=404214.0, ans=0.0 2023-06-19 06:39:31,466 INFO [train.py:996] (0/4) Epoch 3, batch 6400, loss[loss=0.3257, simple_loss=0.3882, pruned_loss=0.1315, over 21446.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3553, pruned_loss=0.1122, over 4268483.94 frames. ], batch size: 131, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:40:19,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=404454.0, ans=0.125 2023-06-19 06:40:54,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.72 vs. limit=10.0 2023-06-19 06:41:17,063 INFO [train.py:996] (0/4) Epoch 3, batch 6450, loss[loss=0.3399, simple_loss=0.4455, pruned_loss=0.1171, over 19750.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3586, pruned_loss=0.1124, over 4266075.03 frames. ], batch size: 702, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:41:55,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=404754.0, ans=0.2 2023-06-19 06:42:09,373 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.087e+02 4.205e+02 5.976e+02 1.329e+03, threshold=8.410e+02, percent-clipped=11.0 2023-06-19 06:42:46,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404874.0, ans=0.1 2023-06-19 06:43:02,070 INFO [train.py:996] (0/4) Epoch 3, batch 6500, loss[loss=0.2333, simple_loss=0.295, pruned_loss=0.08575, over 21731.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.35, pruned_loss=0.1095, over 4260597.42 frames. ], batch size: 124, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:43:12,105 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-06-19 06:43:52,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=405054.0, ans=0.125 2023-06-19 06:44:10,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=405114.0, ans=0.125 2023-06-19 06:44:41,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=15.0 2023-06-19 06:44:54,424 INFO [train.py:996] (0/4) Epoch 3, batch 6550, loss[loss=0.2345, simple_loss=0.2932, pruned_loss=0.08795, over 21371.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3476, pruned_loss=0.108, over 4260128.98 frames. ], batch size: 211, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:45:41,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.916e+02 3.526e+02 4.372e+02 9.339e+02, threshold=7.052e+02, percent-clipped=1.0 2023-06-19 06:46:03,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=405414.0, ans=0.04949747468305833 2023-06-19 06:46:38,356 INFO [train.py:996] (0/4) Epoch 3, batch 6600, loss[loss=0.2514, simple_loss=0.3054, pruned_loss=0.09866, over 21122.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3406, pruned_loss=0.1072, over 4257604.10 frames. ], batch size: 143, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:46:44,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=405534.0, ans=15.0 2023-06-19 06:47:07,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405594.0, ans=0.1 2023-06-19 06:47:31,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=405654.0, ans=0.125 2023-06-19 06:48:18,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=405774.0, ans=0.125 2023-06-19 06:48:23,549 INFO [train.py:996] (0/4) Epoch 3, batch 6650, loss[loss=0.2941, simple_loss=0.3553, pruned_loss=0.1165, over 21499.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3328, pruned_loss=0.1036, over 4264461.84 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:48:47,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-19 06:49:05,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=405954.0, ans=0.125 2023-06-19 06:49:17,961 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.970e+02 3.451e+02 4.323e+02 7.420e+02, threshold=6.902e+02, percent-clipped=1.0 2023-06-19 06:49:28,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=406014.0, ans=0.1 2023-06-19 06:49:36,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=406014.0, ans=0.2 2023-06-19 06:50:07,767 INFO [train.py:996] (0/4) Epoch 3, batch 6700, loss[loss=0.226, simple_loss=0.2849, pruned_loss=0.08349, over 21493.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3264, pruned_loss=0.1026, over 4249461.57 frames. ], batch size: 230, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:50:20,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=406134.0, ans=0.125 2023-06-19 06:51:52,475 INFO [train.py:996] (0/4) Epoch 3, batch 6750, loss[loss=0.3438, simple_loss=0.3768, pruned_loss=0.1554, over 21752.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3255, pruned_loss=0.1038, over 4242432.78 frames. ], batch size: 441, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:52:40,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.875e+02 3.278e+02 4.228e+02 8.254e+02, threshold=6.556e+02, percent-clipped=2.0 2023-06-19 06:52:48,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.64 vs. limit=6.0 2023-06-19 06:52:49,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=406554.0, ans=0.0 2023-06-19 06:52:51,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=406614.0, ans=0.0 2023-06-19 06:53:09,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.47 vs. limit=15.0 2023-06-19 06:53:34,997 INFO [train.py:996] (0/4) Epoch 3, batch 6800, loss[loss=0.2574, simple_loss=0.3037, pruned_loss=0.1055, over 21467.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3275, pruned_loss=0.1058, over 4239475.39 frames. ], batch size: 212, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:53:57,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=406794.0, ans=0.0 2023-06-19 06:54:26,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=406854.0, ans=0.125 2023-06-19 06:54:45,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=406914.0, ans=0.125 2023-06-19 06:55:19,158 INFO [train.py:996] (0/4) Epoch 3, batch 6850, loss[loss=0.277, simple_loss=0.3386, pruned_loss=0.1077, over 21873.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3266, pruned_loss=0.1071, over 4250147.50 frames. ], batch size: 107, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:55:24,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=407034.0, ans=0.0 2023-06-19 06:55:29,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=407034.0, ans=0.125 2023-06-19 06:56:08,183 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.177e+02 3.620e+02 4.749e+02 9.271e+02, threshold=7.240e+02, percent-clipped=3.0 2023-06-19 06:57:00,806 INFO [train.py:996] (0/4) Epoch 3, batch 6900, loss[loss=0.2914, simple_loss=0.4076, pruned_loss=0.08762, over 19734.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3298, pruned_loss=0.1079, over 4260647.05 frames. ], batch size: 702, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:57:03,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=407334.0, ans=0.125 2023-06-19 06:57:08,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2023-06-19 06:57:38,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407454.0, ans=0.1 2023-06-19 06:58:07,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=407514.0, ans=0.0 2023-06-19 06:58:35,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-19 06:58:40,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=407574.0, ans=0.07 2023-06-19 06:58:42,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=407574.0, ans=0.125 2023-06-19 06:58:46,753 INFO [train.py:996] (0/4) Epoch 3, batch 6950, loss[loss=0.2799, simple_loss=0.3419, pruned_loss=0.1089, over 21903.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3294, pruned_loss=0.1033, over 4265790.56 frames. ], batch size: 316, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 06:59:11,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-19 06:59:42,602 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.993e+02 3.659e+02 4.526e+02 7.412e+02, threshold=7.319e+02, percent-clipped=1.0 2023-06-19 06:59:48,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407814.0, ans=0.1 2023-06-19 06:59:56,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=407814.0, ans=0.125 2023-06-19 07:00:32,101 INFO [train.py:996] (0/4) Epoch 3, batch 7000, loss[loss=0.2757, simple_loss=0.3264, pruned_loss=0.1125, over 21809.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3328, pruned_loss=0.1057, over 4267121.79 frames. ], batch size: 118, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:00:54,329 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-68000.pt 2023-06-19 07:02:19,518 INFO [train.py:996] (0/4) Epoch 3, batch 7050, loss[loss=0.2607, simple_loss=0.322, pruned_loss=0.09972, over 15427.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3326, pruned_loss=0.1053, over 4258718.43 frames. ], batch size: 60, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:03:19,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.081e+02 3.762e+02 4.566e+02 1.137e+03, threshold=7.524e+02, percent-clipped=2.0 2023-06-19 07:03:46,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=408474.0, ans=0.1 2023-06-19 07:03:48,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=408474.0, ans=0.125 2023-06-19 07:03:52,697 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.18 vs. limit=15.0 2023-06-19 07:03:56,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=408474.0, ans=0.0 2023-06-19 07:04:00,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=408474.0, ans=0.125 2023-06-19 07:04:10,050 INFO [train.py:996] (0/4) Epoch 3, batch 7100, loss[loss=0.2679, simple_loss=0.3509, pruned_loss=0.09247, over 21661.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3356, pruned_loss=0.1061, over 4243720.24 frames. ], batch size: 441, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:05:07,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408654.0, ans=0.1 2023-06-19 07:05:55,538 INFO [train.py:996] (0/4) Epoch 3, batch 7150, loss[loss=0.2679, simple_loss=0.335, pruned_loss=0.1004, over 21901.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3331, pruned_loss=0.1034, over 4245338.78 frames. ], batch size: 372, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:06:00,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=408834.0, ans=0.125 2023-06-19 07:06:02,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-19 07:06:12,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=408834.0, ans=0.125 2023-06-19 07:06:23,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=408894.0, ans=0.125 2023-06-19 07:06:30,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408894.0, ans=0.1 2023-06-19 07:06:57,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 3.041e+02 3.413e+02 3.887e+02 5.883e+02, threshold=6.826e+02, percent-clipped=0.0 2023-06-19 07:06:59,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=408954.0, ans=0.1 2023-06-19 07:07:40,874 INFO [train.py:996] (0/4) Epoch 3, batch 7200, loss[loss=0.2884, simple_loss=0.3359, pruned_loss=0.1204, over 21862.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3367, pruned_loss=0.1063, over 4244511.80 frames. ], batch size: 373, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:08:10,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=409194.0, ans=0.0 2023-06-19 07:08:12,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.19 vs. limit=6.0 2023-06-19 07:08:19,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=409194.0, ans=0.125 2023-06-19 07:08:32,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=409254.0, ans=0.125 2023-06-19 07:09:20,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=409374.0, ans=0.0 2023-06-19 07:09:32,329 INFO [train.py:996] (0/4) Epoch 3, batch 7250, loss[loss=0.2971, simple_loss=0.3278, pruned_loss=0.1332, over 21391.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3323, pruned_loss=0.1057, over 4253832.64 frames. ], batch size: 509, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:09:46,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=409434.0, ans=0.2 2023-06-19 07:10:24,995 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.049e+02 3.903e+02 5.201e+02 1.242e+03, threshold=7.806e+02, percent-clipped=6.0 2023-06-19 07:10:36,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=22.5 2023-06-19 07:11:11,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=409674.0, ans=0.2 2023-06-19 07:11:13,535 INFO [train.py:996] (0/4) Epoch 3, batch 7300, loss[loss=0.2168, simple_loss=0.2746, pruned_loss=0.07947, over 21548.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.327, pruned_loss=0.1047, over 4255531.88 frames. ], batch size: 263, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:11:57,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=409854.0, ans=0.2 2023-06-19 07:12:59,469 INFO [train.py:996] (0/4) Epoch 3, batch 7350, loss[loss=0.3206, simple_loss=0.3622, pruned_loss=0.1395, over 21659.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3242, pruned_loss=0.1056, over 4256093.11 frames. ], batch size: 332, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:13:59,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.138e+02 3.671e+02 4.789e+02 1.075e+03, threshold=7.343e+02, percent-clipped=3.0 2023-06-19 07:14:55,732 INFO [train.py:996] (0/4) Epoch 3, batch 7400, loss[loss=0.3166, simple_loss=0.3933, pruned_loss=0.12, over 21555.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3314, pruned_loss=0.1079, over 4258469.22 frames. ], batch size: 473, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:15:51,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=410514.0, ans=0.2 2023-06-19 07:16:26,551 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:16:48,387 INFO [train.py:996] (0/4) Epoch 3, batch 7450, loss[loss=0.2526, simple_loss=0.3016, pruned_loss=0.1018, over 21416.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3299, pruned_loss=0.106, over 4265861.96 frames. ], batch size: 194, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:16:49,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=410634.0, ans=0.0 2023-06-19 07:17:02,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=410634.0, ans=0.125 2023-06-19 07:17:41,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.990e+02 3.597e+02 4.537e+02 7.540e+02, threshold=7.195e+02, percent-clipped=1.0 2023-06-19 07:18:27,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=410874.0, ans=0.2 2023-06-19 07:18:35,622 INFO [train.py:996] (0/4) Epoch 3, batch 7500, loss[loss=0.3099, simple_loss=0.374, pruned_loss=0.1229, over 21857.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3355, pruned_loss=0.1089, over 4269025.06 frames. ], batch size: 118, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:18:46,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=410934.0, ans=0.125 2023-06-19 07:18:46,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=410934.0, ans=0.0 2023-06-19 07:18:58,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=410994.0, ans=0.2 2023-06-19 07:19:07,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=410994.0, ans=0.125 2023-06-19 07:20:04,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-19 07:20:05,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=411174.0, ans=0.0 2023-06-19 07:20:24,688 INFO [train.py:996] (0/4) Epoch 3, batch 7550, loss[loss=0.2359, simple_loss=0.3121, pruned_loss=0.07981, over 21218.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3432, pruned_loss=0.1073, over 4270826.04 frames. ], batch size: 143, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:20:35,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=411234.0, ans=0.125 2023-06-19 07:20:41,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=411294.0, ans=0.0 2023-06-19 07:20:56,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=411294.0, ans=0.0 2023-06-19 07:21:22,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.181e+02 3.735e+02 4.565e+02 8.412e+02, threshold=7.470e+02, percent-clipped=4.0 2023-06-19 07:21:29,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=411414.0, ans=0.125 2023-06-19 07:22:09,845 INFO [train.py:996] (0/4) Epoch 3, batch 7600, loss[loss=0.2671, simple_loss=0.3319, pruned_loss=0.1011, over 21899.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.341, pruned_loss=0.1054, over 4270645.46 frames. ], batch size: 351, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:22:13,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=411534.0, ans=0.125 2023-06-19 07:22:52,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=411654.0, ans=0.1 2023-06-19 07:23:55,469 INFO [train.py:996] (0/4) Epoch 3, batch 7650, loss[loss=0.2594, simple_loss=0.3171, pruned_loss=0.1009, over 21824.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3409, pruned_loss=0.107, over 4277209.37 frames. ], batch size: 282, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:24:11,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=411894.0, ans=0.125 2023-06-19 07:24:18,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-19 07:24:34,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=411954.0, ans=0.0 2023-06-19 07:24:54,786 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.811e+02 3.352e+02 3.855e+02 5.541e+02, threshold=6.704e+02, percent-clipped=0.0 2023-06-19 07:24:56,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=412014.0, ans=0.0 2023-06-19 07:25:32,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=22.5 2023-06-19 07:25:44,621 INFO [train.py:996] (0/4) Epoch 3, batch 7700, loss[loss=0.3291, simple_loss=0.3819, pruned_loss=0.1382, over 21467.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3451, pruned_loss=0.111, over 4282251.59 frames. ], batch size: 194, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:25:55,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2023-06-19 07:26:16,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=412194.0, ans=0.125 2023-06-19 07:26:23,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=412194.0, ans=0.125 2023-06-19 07:26:36,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-19 07:26:40,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.61 vs. limit=15.0 2023-06-19 07:26:41,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=412254.0, ans=0.125 2023-06-19 07:27:12,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=412314.0, ans=0.0 2023-06-19 07:27:12,627 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-19 07:27:18,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=412374.0, ans=0.0 2023-06-19 07:27:22,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=412374.0, ans=0.125 2023-06-19 07:27:24,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=412374.0, ans=0.125 2023-06-19 07:27:32,036 INFO [train.py:996] (0/4) Epoch 3, batch 7750, loss[loss=0.2365, simple_loss=0.2822, pruned_loss=0.09536, over 20795.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3497, pruned_loss=0.1112, over 4284927.91 frames. ], batch size: 609, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:28:46,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 3.582e+02 4.541e+02 5.903e+02 1.038e+03, threshold=9.082e+02, percent-clipped=9.0 2023-06-19 07:29:07,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=412674.0, ans=0.125 2023-06-19 07:29:24,321 INFO [train.py:996] (0/4) Epoch 3, batch 7800, loss[loss=0.2305, simple_loss=0.2711, pruned_loss=0.09495, over 21277.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3522, pruned_loss=0.1121, over 4279634.03 frames. ], batch size: 143, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:29:34,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=412734.0, ans=0.125 2023-06-19 07:31:07,827 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:31:13,610 INFO [train.py:996] (0/4) Epoch 3, batch 7850, loss[loss=0.2598, simple_loss=0.3087, pruned_loss=0.1054, over 21584.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3478, pruned_loss=0.1113, over 4285813.46 frames. ], batch size: 415, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:31:39,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=413034.0, ans=0.125 2023-06-19 07:32:23,196 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 3.181e+02 3.685e+02 4.397e+02 7.326e+02, threshold=7.370e+02, percent-clipped=0.0 2023-06-19 07:32:41,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=413214.0, ans=0.125 2023-06-19 07:32:43,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=413274.0, ans=0.1 2023-06-19 07:32:49,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=413274.0, ans=0.0 2023-06-19 07:33:08,240 INFO [train.py:996] (0/4) Epoch 3, batch 7900, loss[loss=0.4429, simple_loss=0.4947, pruned_loss=0.1955, over 21424.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.342, pruned_loss=0.1102, over 4275602.36 frames. ], batch size: 507, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:33:08,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=413334.0, ans=0.125 2023-06-19 07:34:42,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=413574.0, ans=0.025 2023-06-19 07:34:56,196 INFO [train.py:996] (0/4) Epoch 3, batch 7950, loss[loss=0.2718, simple_loss=0.3686, pruned_loss=0.08747, over 21734.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3468, pruned_loss=0.1088, over 4269069.56 frames. ], batch size: 351, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:34:56,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=413634.0, ans=0.125 2023-06-19 07:35:07,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-19 07:35:56,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.292e+02 2.859e+02 3.738e+02 4.773e+02 1.037e+03, threshold=7.477e+02, percent-clipped=3.0 2023-06-19 07:36:06,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413814.0, ans=0.1 2023-06-19 07:36:28,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=413874.0, ans=0.0 2023-06-19 07:36:38,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=413874.0, ans=0.125 2023-06-19 07:36:44,342 INFO [train.py:996] (0/4) Epoch 3, batch 8000, loss[loss=0.3264, simple_loss=0.3954, pruned_loss=0.1287, over 21609.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3539, pruned_loss=0.1136, over 4269289.36 frames. ], batch size: 389, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:36:44,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=413934.0, ans=0.0 2023-06-19 07:37:11,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=413934.0, ans=0.125 2023-06-19 07:37:27,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=413994.0, ans=0.0 2023-06-19 07:38:13,994 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.38 vs. limit=22.5 2023-06-19 07:38:16,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=414114.0, ans=0.0 2023-06-19 07:38:28,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=414174.0, ans=0.2 2023-06-19 07:38:45,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414234.0, ans=0.1 2023-06-19 07:38:46,666 INFO [train.py:996] (0/4) Epoch 3, batch 8050, loss[loss=0.3172, simple_loss=0.4132, pruned_loss=0.1106, over 21263.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3573, pruned_loss=0.1129, over 4262469.87 frames. ], batch size: 549, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:39:06,700 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:39:47,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.438e+02 3.985e+02 5.129e+02 7.856e+02, threshold=7.969e+02, percent-clipped=2.0 2023-06-19 07:40:14,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.47 vs. limit=15.0 2023-06-19 07:40:35,189 INFO [train.py:996] (0/4) Epoch 3, batch 8100, loss[loss=0.2631, simple_loss=0.3188, pruned_loss=0.1037, over 21516.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3545, pruned_loss=0.1137, over 4273528.44 frames. ], batch size: 177, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:40:43,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=12.0 2023-06-19 07:41:20,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=414654.0, ans=0.125 2023-06-19 07:41:32,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=414654.0, ans=0.125 2023-06-19 07:42:24,535 INFO [train.py:996] (0/4) Epoch 3, batch 8150, loss[loss=0.2587, simple_loss=0.3441, pruned_loss=0.08667, over 21645.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3612, pruned_loss=0.114, over 4272529.79 frames. ], batch size: 247, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:43:38,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.875e+02 3.410e+02 4.043e+02 9.100e+02, threshold=6.821e+02, percent-clipped=2.0 2023-06-19 07:43:44,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=415014.0, ans=0.125 2023-06-19 07:44:13,640 INFO [train.py:996] (0/4) Epoch 3, batch 8200, loss[loss=0.2589, simple_loss=0.3131, pruned_loss=0.1023, over 21802.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3523, pruned_loss=0.1118, over 4272243.85 frames. ], batch size: 352, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:44:48,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=415194.0, ans=0.2 2023-06-19 07:44:49,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=415194.0, ans=0.0 2023-06-19 07:45:08,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=415254.0, ans=0.95 2023-06-19 07:45:12,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415254.0, ans=0.1 2023-06-19 07:45:38,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.07 vs. limit=12.0 2023-06-19 07:45:58,202 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-19 07:45:58,728 INFO [train.py:996] (0/4) Epoch 3, batch 8250, loss[loss=0.2269, simple_loss=0.2981, pruned_loss=0.07782, over 21840.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3483, pruned_loss=0.1114, over 4269851.40 frames. ], batch size: 102, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:46:08,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=415434.0, ans=0.025 2023-06-19 07:46:32,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-19 07:47:12,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.026e+02 3.791e+02 5.480e+02 8.265e+02, threshold=7.583e+02, percent-clipped=10.0 2023-06-19 07:47:17,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=415614.0, ans=0.2 2023-06-19 07:47:52,039 INFO [train.py:996] (0/4) Epoch 3, batch 8300, loss[loss=0.3611, simple_loss=0.4139, pruned_loss=0.1541, over 21477.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3466, pruned_loss=0.108, over 4271012.16 frames. ], batch size: 508, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:47:54,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=415734.0, ans=0.125 2023-06-19 07:47:55,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415734.0, ans=0.1 2023-06-19 07:48:20,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=415794.0, ans=0.125 2023-06-19 07:49:06,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415914.0, ans=0.1 2023-06-19 07:49:16,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=415974.0, ans=0.0 2023-06-19 07:49:38,479 INFO [train.py:996] (0/4) Epoch 3, batch 8350, loss[loss=0.2553, simple_loss=0.3249, pruned_loss=0.09282, over 21347.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3441, pruned_loss=0.1044, over 4266346.59 frames. ], batch size: 160, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:49:51,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416034.0, ans=0.1 2023-06-19 07:49:53,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=416034.0, ans=0.125 2023-06-19 07:50:46,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.962e+02 3.800e+02 4.795e+02 8.641e+02, threshold=7.601e+02, percent-clipped=3.0 2023-06-19 07:50:55,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=416214.0, ans=0.2 2023-06-19 07:51:23,703 INFO [train.py:996] (0/4) Epoch 3, batch 8400, loss[loss=0.267, simple_loss=0.3454, pruned_loss=0.09432, over 21739.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3432, pruned_loss=0.103, over 4261136.14 frames. ], batch size: 351, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:51:39,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=416334.0, ans=0.04949747468305833 2023-06-19 07:51:49,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=416394.0, ans=0.0 2023-06-19 07:52:00,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=416454.0, ans=0.0 2023-06-19 07:52:44,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=416574.0, ans=0.125 2023-06-19 07:52:51,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=416574.0, ans=0.0 2023-06-19 07:53:07,349 INFO [train.py:996] (0/4) Epoch 3, batch 8450, loss[loss=0.3089, simple_loss=0.3714, pruned_loss=0.1232, over 16917.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.34, pruned_loss=0.1024, over 4264166.42 frames. ], batch size: 60, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:53:34,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=416694.0, ans=0.2 2023-06-19 07:53:53,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=416754.0, ans=0.125 2023-06-19 07:54:15,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.539e+02 3.236e+02 3.912e+02 6.365e+02, threshold=6.471e+02, percent-clipped=0.0 2023-06-19 07:54:17,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=416814.0, ans=0.0 2023-06-19 07:54:19,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-19 07:54:34,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=416874.0, ans=0.0 2023-06-19 07:54:59,968 INFO [train.py:996] (0/4) Epoch 3, batch 8500, loss[loss=0.3619, simple_loss=0.4634, pruned_loss=0.1302, over 20753.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3379, pruned_loss=0.1047, over 4251525.95 frames. ], batch size: 607, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:55:29,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=416994.0, ans=0.05 2023-06-19 07:55:57,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-19 07:55:59,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=417054.0, ans=0.04949747468305833 2023-06-19 07:56:14,575 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:56:44,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-19 07:56:48,448 INFO [train.py:996] (0/4) Epoch 3, batch 8550, loss[loss=0.2448, simple_loss=0.3193, pruned_loss=0.08512, over 21190.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3419, pruned_loss=0.1076, over 4257487.20 frames. ], batch size: 176, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:57:04,097 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:57:47,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.294e+02 4.189e+02 5.052e+02 1.014e+03, threshold=8.378e+02, percent-clipped=9.0 2023-06-19 07:57:48,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=417414.0, ans=0.035 2023-06-19 07:58:25,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=417474.0, ans=0.125 2023-06-19 07:58:32,126 INFO [train.py:996] (0/4) Epoch 3, batch 8600, loss[loss=0.3865, simple_loss=0.4393, pruned_loss=0.1668, over 21491.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3515, pruned_loss=0.1113, over 4258466.06 frames. ], batch size: 508, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:59:13,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=417594.0, ans=0.125 2023-06-19 08:00:20,149 INFO [train.py:996] (0/4) Epoch 3, batch 8650, loss[loss=0.1932, simple_loss=0.2881, pruned_loss=0.04912, over 21758.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3567, pruned_loss=0.1121, over 4266780.69 frames. ], batch size: 282, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:00:53,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=417894.0, ans=0.1 2023-06-19 08:01:15,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=417954.0, ans=0.0 2023-06-19 08:01:28,298 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.885e+02 3.395e+02 4.112e+02 7.467e+02, threshold=6.789e+02, percent-clipped=0.0 2023-06-19 08:01:52,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=418074.0, ans=0.2 2023-06-19 08:01:52,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-19 08:02:05,245 INFO [train.py:996] (0/4) Epoch 3, batch 8700, loss[loss=0.2263, simple_loss=0.3048, pruned_loss=0.07392, over 21227.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3472, pruned_loss=0.1083, over 4271389.57 frames. ], batch size: 548, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:03:57,807 INFO [train.py:996] (0/4) Epoch 3, batch 8750, loss[loss=0.2543, simple_loss=0.3124, pruned_loss=0.09809, over 21983.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3426, pruned_loss=0.1089, over 4275718.32 frames. ], batch size: 103, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:05:06,661 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 3.018e+02 3.630e+02 4.545e+02 8.299e+02, threshold=7.260e+02, percent-clipped=2.0 2023-06-19 08:05:19,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=418614.0, ans=0.125 2023-06-19 08:05:35,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=418674.0, ans=0.125 2023-06-19 08:05:37,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=418674.0, ans=0.125 2023-06-19 08:05:44,861 INFO [train.py:996] (0/4) Epoch 3, batch 8800, loss[loss=0.2979, simple_loss=0.3693, pruned_loss=0.1133, over 21692.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3504, pruned_loss=0.1112, over 4279263.09 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:06:13,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=418794.0, ans=0.0 2023-06-19 08:06:52,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=418854.0, ans=0.0 2023-06-19 08:06:52,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=418854.0, ans=0.125 2023-06-19 08:06:57,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=22.5 2023-06-19 08:07:12,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=418974.0, ans=0.125 2023-06-19 08:07:29,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=418974.0, ans=0.125 2023-06-19 08:07:42,028 INFO [train.py:996] (0/4) Epoch 3, batch 8850, loss[loss=0.2718, simple_loss=0.3425, pruned_loss=0.1005, over 21596.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3586, pruned_loss=0.1134, over 4268723.58 frames. ], batch size: 414, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:08:35,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-19 08:08:39,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=419154.0, ans=0.125 2023-06-19 08:08:40,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-19 08:08:46,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.554e+02 4.291e+02 5.667e+02 9.091e+02, threshold=8.581e+02, percent-clipped=5.0 2023-06-19 08:08:51,196 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-06-19 08:09:07,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=419274.0, ans=0.125 2023-06-19 08:09:13,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-19 08:09:25,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=419274.0, ans=0.0 2023-06-19 08:09:29,176 INFO [train.py:996] (0/4) Epoch 3, batch 8900, loss[loss=0.2966, simple_loss=0.344, pruned_loss=0.1246, over 21781.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.351, pruned_loss=0.111, over 4267934.53 frames. ], batch size: 371, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:09:34,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419334.0, ans=0.1 2023-06-19 08:09:37,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-19 08:09:38,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=419334.0, ans=0.0 2023-06-19 08:09:44,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=419394.0, ans=0.125 2023-06-19 08:10:16,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=419454.0, ans=0.125 2023-06-19 08:11:10,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=12.0 2023-06-19 08:11:18,145 INFO [train.py:996] (0/4) Epoch 3, batch 8950, loss[loss=0.3402, simple_loss=0.424, pruned_loss=0.1282, over 21256.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3527, pruned_loss=0.1104, over 4268439.30 frames. ], batch size: 549, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:12:12,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=419754.0, ans=0.125 2023-06-19 08:12:27,670 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 3.236e+02 4.208e+02 5.168e+02 9.134e+02, threshold=8.417e+02, percent-clipped=1.0 2023-06-19 08:12:40,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-19 08:13:05,009 INFO [train.py:996] (0/4) Epoch 3, batch 9000, loss[loss=0.3437, simple_loss=0.4125, pruned_loss=0.1375, over 20722.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3449, pruned_loss=0.1095, over 4265497.97 frames. ], batch size: 608, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:13:05,010 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 08:13:24,324 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2787, simple_loss=0.3793, pruned_loss=0.08906, over 1796401.00 frames. 2023-06-19 08:13:24,325 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 08:13:40,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=419934.0, ans=0.2 2023-06-19 08:14:45,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=420114.0, ans=0.125 2023-06-19 08:15:19,210 INFO [train.py:996] (0/4) Epoch 3, batch 9050, loss[loss=0.2394, simple_loss=0.3164, pruned_loss=0.08117, over 21700.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3388, pruned_loss=0.1054, over 4261670.17 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:15:45,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=420294.0, ans=0.125 2023-06-19 08:16:06,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=420354.0, ans=0.125 2023-06-19 08:16:11,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=420354.0, ans=0.1 2023-06-19 08:16:18,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=420354.0, ans=0.125 2023-06-19 08:16:20,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=420414.0, ans=0.0 2023-06-19 08:16:23,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 3.182e+02 3.849e+02 4.740e+02 7.257e+02, threshold=7.697e+02, percent-clipped=0.0 2023-06-19 08:17:03,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=420474.0, ans=0.2 2023-06-19 08:17:05,691 INFO [train.py:996] (0/4) Epoch 3, batch 9100, loss[loss=0.2644, simple_loss=0.3469, pruned_loss=0.09093, over 21713.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3453, pruned_loss=0.1077, over 4259055.47 frames. ], batch size: 247, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:17:55,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=420654.0, ans=0.125 2023-06-19 08:18:52,705 INFO [train.py:996] (0/4) Epoch 3, batch 9150, loss[loss=0.2553, simple_loss=0.3277, pruned_loss=0.09147, over 21736.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3499, pruned_loss=0.1066, over 4257680.37 frames. ], batch size: 112, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:19:03,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=420834.0, ans=0.125 2023-06-19 08:19:16,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=420894.0, ans=0.05 2023-06-19 08:19:20,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=420894.0, ans=0.125 2023-06-19 08:19:59,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=421014.0, ans=0.125 2023-06-19 08:20:00,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.911e+02 3.357e+02 4.018e+02 6.144e+02, threshold=6.715e+02, percent-clipped=0.0 2023-06-19 08:20:45,106 INFO [train.py:996] (0/4) Epoch 3, batch 9200, loss[loss=0.2216, simple_loss=0.2996, pruned_loss=0.07178, over 21413.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3507, pruned_loss=0.1043, over 4269426.20 frames. ], batch size: 211, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:22:31,109 INFO [train.py:996] (0/4) Epoch 3, batch 9250, loss[loss=0.2988, simple_loss=0.3476, pruned_loss=0.125, over 21497.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3542, pruned_loss=0.1081, over 4266969.13 frames. ], batch size: 194, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:22:36,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=421434.0, ans=0.2 2023-06-19 08:23:14,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-19 08:23:39,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.086e+02 3.800e+02 4.447e+02 7.339e+02, threshold=7.599e+02, percent-clipped=2.0 2023-06-19 08:24:05,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=421674.0, ans=0.125 2023-06-19 08:24:17,069 INFO [train.py:996] (0/4) Epoch 3, batch 9300, loss[loss=0.2711, simple_loss=0.3146, pruned_loss=0.1138, over 20759.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3481, pruned_loss=0.108, over 4270742.45 frames. ], batch size: 608, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:24:55,930 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:25:49,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.32 vs. limit=5.0 2023-06-19 08:25:53,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=421974.0, ans=0.5 2023-06-19 08:25:53,814 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:25:59,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=421974.0, ans=0.07 2023-06-19 08:26:11,696 INFO [train.py:996] (0/4) Epoch 3, batch 9350, loss[loss=0.2962, simple_loss=0.3469, pruned_loss=0.1228, over 20068.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3562, pruned_loss=0.1101, over 4271232.41 frames. ], batch size: 702, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:27:05,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.98 vs. limit=6.0 2023-06-19 08:27:20,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 3.071e+02 3.690e+02 4.644e+02 6.944e+02, threshold=7.381e+02, percent-clipped=0.0 2023-06-19 08:27:32,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=422214.0, ans=0.125 2023-06-19 08:27:34,193 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-19 08:27:46,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=422274.0, ans=0.2 2023-06-19 08:27:59,152 INFO [train.py:996] (0/4) Epoch 3, batch 9400, loss[loss=0.2521, simple_loss=0.3111, pruned_loss=0.09661, over 21879.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3577, pruned_loss=0.1107, over 4265521.68 frames. ], batch size: 107, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:28:00,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-06-19 08:28:03,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=422334.0, ans=0.2 2023-06-19 08:28:45,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-19 08:29:12,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=422514.0, ans=0.1 2023-06-19 08:29:34,561 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:29:44,030 INFO [train.py:996] (0/4) Epoch 3, batch 9450, loss[loss=0.2691, simple_loss=0.3517, pruned_loss=0.09324, over 20795.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3484, pruned_loss=0.1094, over 4266864.40 frames. ], batch size: 609, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:30:19,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-19 08:30:34,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.03 vs. limit=22.5 2023-06-19 08:30:51,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 3.147e+02 3.732e+02 4.957e+02 8.626e+02, threshold=7.464e+02, percent-clipped=5.0 2023-06-19 08:31:16,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=422874.0, ans=0.0 2023-06-19 08:31:25,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=422874.0, ans=0.125 2023-06-19 08:31:28,633 INFO [train.py:996] (0/4) Epoch 3, batch 9500, loss[loss=0.2477, simple_loss=0.3084, pruned_loss=0.09346, over 21349.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3409, pruned_loss=0.1083, over 4263676.20 frames. ], batch size: 131, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:31:52,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=422934.0, ans=0.125 2023-06-19 08:32:44,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=423114.0, ans=0.125 2023-06-19 08:33:14,251 INFO [train.py:996] (0/4) Epoch 3, batch 9550, loss[loss=0.3158, simple_loss=0.371, pruned_loss=0.1303, over 21685.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3455, pruned_loss=0.1108, over 4271647.27 frames. ], batch size: 351, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:33:14,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=423234.0, ans=0.0 2023-06-19 08:33:22,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=15.0 2023-06-19 08:33:39,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423294.0, ans=0.1 2023-06-19 08:34:20,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.824e+02 3.274e+02 3.853e+02 7.090e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-19 08:34:47,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=423474.0, ans=0.0 2023-06-19 08:34:50,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=423474.0, ans=0.125 2023-06-19 08:34:51,975 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:34:53,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=423474.0, ans=0.0 2023-06-19 08:34:58,291 INFO [train.py:996] (0/4) Epoch 3, batch 9600, loss[loss=0.2761, simple_loss=0.3343, pruned_loss=0.1089, over 21711.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3491, pruned_loss=0.1134, over 4280414.76 frames. ], batch size: 263, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:35:19,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=423534.0, ans=0.0 2023-06-19 08:35:34,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423594.0, ans=0.1 2023-06-19 08:35:38,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=22.5 2023-06-19 08:35:59,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=423654.0, ans=0.125 2023-06-19 08:36:03,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=423714.0, ans=0.125 2023-06-19 08:36:45,596 INFO [train.py:996] (0/4) Epoch 3, batch 9650, loss[loss=0.2903, simple_loss=0.3518, pruned_loss=0.1144, over 21741.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.3491, pruned_loss=0.1124, over 4284966.73 frames. ], batch size: 298, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:37:14,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=423894.0, ans=0.125 2023-06-19 08:37:29,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=423894.0, ans=15.0 2023-06-19 08:37:35,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=423954.0, ans=0.125 2023-06-19 08:37:41,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=423954.0, ans=0.0 2023-06-19 08:37:58,445 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.232e+02 3.864e+02 5.587e+02 9.927e+02, threshold=7.728e+02, percent-clipped=9.0 2023-06-19 08:38:06,124 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-19 08:38:22,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-19 08:38:40,960 INFO [train.py:996] (0/4) Epoch 3, batch 9700, loss[loss=0.2542, simple_loss=0.3369, pruned_loss=0.08581, over 21864.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3516, pruned_loss=0.1132, over 4282932.10 frames. ], batch size: 371, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:39:18,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=424254.0, ans=0.2 2023-06-19 08:39:40,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-19 08:39:46,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=424314.0, ans=0.125 2023-06-19 08:39:51,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=424314.0, ans=0.2 2023-06-19 08:39:59,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=424374.0, ans=0.2 2023-06-19 08:40:04,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=424374.0, ans=0.0 2023-06-19 08:40:18,552 INFO [train.py:996] (0/4) Epoch 3, batch 9750, loss[loss=0.289, simple_loss=0.3337, pruned_loss=0.1222, over 21766.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3476, pruned_loss=0.1121, over 4282729.12 frames. ], batch size: 351, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:40:50,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=424494.0, ans=0.125 2023-06-19 08:41:18,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.037e+02 3.853e+02 4.411e+02 7.266e+02, threshold=7.707e+02, percent-clipped=0.0 2023-06-19 08:41:55,554 INFO [train.py:996] (0/4) Epoch 3, batch 9800, loss[loss=0.3202, simple_loss=0.3701, pruned_loss=0.1352, over 21725.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3464, pruned_loss=0.1117, over 4281383.81 frames. ], batch size: 389, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:41:59,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=424734.0, ans=0.2 2023-06-19 08:42:24,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=424794.0, ans=0.125 2023-06-19 08:43:01,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=424914.0, ans=0.125 2023-06-19 08:43:03,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=424914.0, ans=0.2 2023-06-19 08:43:04,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.11 vs. limit=22.5 2023-06-19 08:43:39,055 INFO [train.py:996] (0/4) Epoch 3, batch 9850, loss[loss=0.2818, simple_loss=0.3205, pruned_loss=0.1216, over 21311.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3431, pruned_loss=0.1119, over 4277011.75 frames. ], batch size: 548, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:44:03,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=425094.0, ans=0.04949747468305833 2023-06-19 08:44:50,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.794e+02 3.273e+02 4.007e+02 7.022e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-19 08:44:54,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-19 08:45:00,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=425214.0, ans=0.0 2023-06-19 08:45:21,560 INFO [train.py:996] (0/4) Epoch 3, batch 9900, loss[loss=0.3372, simple_loss=0.3888, pruned_loss=0.1428, over 21371.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3388, pruned_loss=0.1111, over 4265914.82 frames. ], batch size: 471, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:45:44,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-19 08:45:51,560 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-19 08:46:45,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-19 08:47:10,943 INFO [train.py:996] (0/4) Epoch 3, batch 9950, loss[loss=0.2779, simple_loss=0.334, pruned_loss=0.1109, over 21799.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3402, pruned_loss=0.1132, over 4257465.48 frames. ], batch size: 118, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:47:35,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=425694.0, ans=0.125 2023-06-19 08:47:46,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=425694.0, ans=0.0 2023-06-19 08:47:59,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=425754.0, ans=0.1 2023-06-19 08:48:18,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.027e+02 3.494e+02 4.278e+02 9.586e+02, threshold=6.989e+02, percent-clipped=3.0 2023-06-19 08:49:03,424 INFO [train.py:996] (0/4) Epoch 3, batch 10000, loss[loss=0.27, simple_loss=0.3346, pruned_loss=0.1027, over 20716.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3348, pruned_loss=0.111, over 4254669.60 frames. ], batch size: 608, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:49:07,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=425934.0, ans=0.0 2023-06-19 08:49:24,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=425994.0, ans=0.1 2023-06-19 08:49:41,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=425994.0, ans=15.0 2023-06-19 08:50:45,789 INFO [train.py:996] (0/4) Epoch 3, batch 10050, loss[loss=0.2127, simple_loss=0.2883, pruned_loss=0.06854, over 21428.00 frames. ], tot_loss[loss=0.281, simple_loss=0.3383, pruned_loss=0.1119, over 4259895.49 frames. ], batch size: 211, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:51:50,940 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.885e+02 3.512e+02 4.083e+02 6.660e+02, threshold=7.024e+02, percent-clipped=0.0 2023-06-19 08:51:51,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=426414.0, ans=0.125 2023-06-19 08:52:37,749 INFO [train.py:996] (0/4) Epoch 3, batch 10100, loss[loss=0.3089, simple_loss=0.361, pruned_loss=0.1283, over 21388.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3343, pruned_loss=0.1078, over 4263876.03 frames. ], batch size: 131, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:52:48,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=426534.0, ans=0.125 2023-06-19 08:53:58,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=426714.0, ans=0.125 2023-06-19 08:54:09,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=426774.0, ans=0.125 2023-06-19 08:54:23,736 INFO [train.py:996] (0/4) Epoch 3, batch 10150, loss[loss=0.3137, simple_loss=0.3776, pruned_loss=0.1249, over 21407.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3434, pruned_loss=0.1115, over 4260691.55 frames. ], batch size: 143, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:55:35,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.064e+02 3.671e+02 4.442e+02 6.348e+02, threshold=7.343e+02, percent-clipped=0.0 2023-06-19 08:55:45,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-19 08:56:10,029 INFO [train.py:996] (0/4) Epoch 3, batch 10200, loss[loss=0.3287, simple_loss=0.3848, pruned_loss=0.1363, over 21374.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3416, pruned_loss=0.1087, over 4267160.14 frames. ], batch size: 507, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:56:15,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=427134.0, ans=10.0 2023-06-19 08:56:38,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=427194.0, ans=0.125 2023-06-19 08:56:48,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=427194.0, ans=0.0 2023-06-19 08:57:30,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=427314.0, ans=0.125 2023-06-19 08:57:57,571 INFO [train.py:996] (0/4) Epoch 3, batch 10250, loss[loss=0.3017, simple_loss=0.3709, pruned_loss=0.1162, over 21587.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3333, pruned_loss=0.101, over 4274734.28 frames. ], batch size: 389, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:57:58,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=427434.0, ans=0.04949747468305833 2023-06-19 08:58:04,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=427434.0, ans=0.125 2023-06-19 08:58:27,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=427494.0, ans=0.125 2023-06-19 08:59:07,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 2.442e+02 2.730e+02 3.285e+02 6.537e+02, threshold=5.460e+02, percent-clipped=0.0 2023-06-19 08:59:08,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=427614.0, ans=0.05 2023-06-19 08:59:43,928 INFO [train.py:996] (0/4) Epoch 3, batch 10300, loss[loss=0.3069, simple_loss=0.3619, pruned_loss=0.126, over 21345.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3366, pruned_loss=0.1024, over 4277337.15 frames. ], batch size: 549, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:59:51,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=427734.0, ans=0.125 2023-06-19 09:00:24,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=427794.0, ans=0.0 2023-06-19 09:00:24,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=427794.0, ans=10.0 2023-06-19 09:00:26,664 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-19 09:00:50,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=427854.0, ans=0.0 2023-06-19 09:00:53,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=427914.0, ans=0.0 2023-06-19 09:00:53,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=427914.0, ans=0.04949747468305833 2023-06-19 09:01:38,600 INFO [train.py:996] (0/4) Epoch 3, batch 10350, loss[loss=0.2801, simple_loss=0.356, pruned_loss=0.1021, over 21564.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3391, pruned_loss=0.1026, over 4279841.00 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 09:01:45,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=428034.0, ans=0.125 2023-06-19 09:02:03,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=428094.0, ans=0.0 2023-06-19 09:02:36,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=428154.0, ans=0.0 2023-06-19 09:02:50,535 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.191e+02 3.696e+02 4.662e+02 9.387e+02, threshold=7.392e+02, percent-clipped=8.0 2023-06-19 09:03:11,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.96 vs. limit=6.0 2023-06-19 09:03:20,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=428274.0, ans=0.125 2023-06-19 09:03:26,640 INFO [train.py:996] (0/4) Epoch 3, batch 10400, loss[loss=0.2351, simple_loss=0.2968, pruned_loss=0.08671, over 21787.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.336, pruned_loss=0.1014, over 4273871.18 frames. ], batch size: 282, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:03:32,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=428334.0, ans=0.2 2023-06-19 09:04:23,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=428454.0, ans=0.2 2023-06-19 09:05:11,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=428574.0, ans=0.125 2023-06-19 09:05:12,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=428634.0, ans=0.125 2023-06-19 09:05:19,066 INFO [train.py:996] (0/4) Epoch 3, batch 10450, loss[loss=0.2748, simple_loss=0.3347, pruned_loss=0.1074, over 21181.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3414, pruned_loss=0.1052, over 4273724.79 frames. ], batch size: 159, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:05:44,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-19 09:05:50,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=428694.0, ans=0.2 2023-06-19 09:06:09,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=428754.0, ans=0.0 2023-06-19 09:06:15,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=428754.0, ans=0.0 2023-06-19 09:06:28,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.522e+02 4.182e+02 5.744e+02 1.036e+03, threshold=8.363e+02, percent-clipped=11.0 2023-06-19 09:06:55,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=428874.0, ans=0.125 2023-06-19 09:06:59,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=428874.0, ans=0.0 2023-06-19 09:07:03,363 INFO [train.py:996] (0/4) Epoch 3, batch 10500, loss[loss=0.2326, simple_loss=0.33, pruned_loss=0.06759, over 19748.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3406, pruned_loss=0.104, over 4274305.60 frames. ], batch size: 702, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:07:32,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=428994.0, ans=0.0 2023-06-19 09:07:53,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=429054.0, ans=0.125 2023-06-19 09:08:41,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=429174.0, ans=0.0 2023-06-19 09:08:49,591 INFO [train.py:996] (0/4) Epoch 3, batch 10550, loss[loss=0.3006, simple_loss=0.3949, pruned_loss=0.1031, over 20860.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3343, pruned_loss=0.1042, over 4267229.47 frames. ], batch size: 608, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:09:15,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=429294.0, ans=0.2 2023-06-19 09:09:23,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=429294.0, ans=0.0 2023-06-19 09:10:00,988 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.936e+02 3.551e+02 4.371e+02 5.985e+02, threshold=7.102e+02, percent-clipped=0.0 2023-06-19 09:10:21,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=15.0 2023-06-19 09:10:36,948 INFO [train.py:996] (0/4) Epoch 3, batch 10600, loss[loss=0.2648, simple_loss=0.3481, pruned_loss=0.09071, over 21764.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3292, pruned_loss=0.1024, over 4260051.45 frames. ], batch size: 351, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:11:02,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-06-19 09:11:30,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=429654.0, ans=0.125 2023-06-19 09:11:33,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=429654.0, ans=0.125 2023-06-19 09:11:54,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=429714.0, ans=0.0 2023-06-19 09:12:21,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=429774.0, ans=0.5 2023-06-19 09:12:30,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=429834.0, ans=0.2 2023-06-19 09:12:31,302 INFO [train.py:996] (0/4) Epoch 3, batch 10650, loss[loss=0.1851, simple_loss=0.253, pruned_loss=0.05859, over 21307.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3335, pruned_loss=0.1017, over 4267696.65 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:13:12,124 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:13:14,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=429894.0, ans=0.2 2023-06-19 09:13:41,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.491e+02 4.438e+02 6.074e+02 1.034e+03, threshold=8.876e+02, percent-clipped=13.0 2023-06-19 09:13:49,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-19 09:14:02,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=430074.0, ans=12.0 2023-06-19 09:14:18,009 INFO [train.py:996] (0/4) Epoch 3, batch 10700, loss[loss=0.2926, simple_loss=0.3401, pruned_loss=0.1225, over 20038.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3319, pruned_loss=0.102, over 4253057.32 frames. ], batch size: 702, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:14:32,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=430134.0, ans=0.125 2023-06-19 09:14:59,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=430194.0, ans=0.125 2023-06-19 09:15:01,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=430254.0, ans=0.125 2023-06-19 09:15:05,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=430254.0, ans=0.125 2023-06-19 09:15:08,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=430254.0, ans=0.125 2023-06-19 09:15:08,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=430254.0, ans=0.0 2023-06-19 09:15:13,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=430254.0, ans=0.125 2023-06-19 09:15:15,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=430254.0, ans=0.09899494936611666 2023-06-19 09:15:27,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=430314.0, ans=0.0 2023-06-19 09:15:52,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=430374.0, ans=0.125 2023-06-19 09:16:09,169 INFO [train.py:996] (0/4) Epoch 3, batch 10750, loss[loss=0.3521, simple_loss=0.4276, pruned_loss=0.1383, over 21750.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3421, pruned_loss=0.1069, over 4258115.23 frames. ], batch size: 441, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:16:56,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=430554.0, ans=0.0 2023-06-19 09:17:19,513 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.011e+02 3.555e+02 4.505e+02 9.587e+02, threshold=7.110e+02, percent-clipped=1.0 2023-06-19 09:17:39,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-19 09:18:01,339 INFO [train.py:996] (0/4) Epoch 3, batch 10800, loss[loss=0.295, simple_loss=0.3601, pruned_loss=0.115, over 21361.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3486, pruned_loss=0.1088, over 4265944.92 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:18:08,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-19 09:18:32,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=430794.0, ans=0.1 2023-06-19 09:18:46,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=430854.0, ans=0.2 2023-06-19 09:18:57,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=430854.0, ans=0.0 2023-06-19 09:19:40,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=430974.0, ans=0.125 2023-06-19 09:19:48,721 INFO [train.py:996] (0/4) Epoch 3, batch 10850, loss[loss=0.2317, simple_loss=0.2939, pruned_loss=0.08478, over 21567.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3487, pruned_loss=0.1084, over 4260525.93 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:20:59,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.993e+02 3.612e+02 4.329e+02 8.050e+02, threshold=7.223e+02, percent-clipped=3.0 2023-06-19 09:21:37,059 INFO [train.py:996] (0/4) Epoch 3, batch 10900, loss[loss=0.2447, simple_loss=0.3358, pruned_loss=0.07678, over 21718.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3419, pruned_loss=0.1062, over 4256989.21 frames. ], batch size: 282, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:21:37,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=431334.0, ans=0.2 2023-06-19 09:22:01,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=431394.0, ans=0.0 2023-06-19 09:22:24,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=12.0 2023-06-19 09:22:39,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=431454.0, ans=0.125 2023-06-19 09:23:09,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=431574.0, ans=0.1 2023-06-19 09:23:22,720 INFO [train.py:996] (0/4) Epoch 3, batch 10950, loss[loss=0.2876, simple_loss=0.3298, pruned_loss=0.1227, over 21251.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3366, pruned_loss=0.1035, over 4255847.91 frames. ], batch size: 144, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:23:23,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=431634.0, ans=0.125 2023-06-19 09:23:27,578 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=17.27 vs. limit=15.0 2023-06-19 09:24:30,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.913e+02 3.689e+02 4.516e+02 9.090e+02, threshold=7.379e+02, percent-clipped=2.0 2023-06-19 09:24:30,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=431814.0, ans=0.0 2023-06-19 09:24:33,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-19 09:25:02,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=431874.0, ans=0.125 2023-06-19 09:25:07,368 INFO [train.py:996] (0/4) Epoch 3, batch 11000, loss[loss=0.2847, simple_loss=0.3375, pruned_loss=0.116, over 21701.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3361, pruned_loss=0.1044, over 4249939.21 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:25:28,712 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-72000.pt 2023-06-19 09:25:34,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-19 09:26:18,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=432114.0, ans=0.125 2023-06-19 09:26:26,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=432114.0, ans=0.125 2023-06-19 09:26:36,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=432174.0, ans=0.0 2023-06-19 09:26:53,497 INFO [train.py:996] (0/4) Epoch 3, batch 11050, loss[loss=0.2295, simple_loss=0.2834, pruned_loss=0.0878, over 21574.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3348, pruned_loss=0.1065, over 4240127.00 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:27:35,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-19 09:27:55,633 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.002e+02 3.534e+02 4.546e+02 1.059e+03, threshold=7.067e+02, percent-clipped=5.0 2023-06-19 09:28:18,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=432474.0, ans=0.125 2023-06-19 09:28:36,740 INFO [train.py:996] (0/4) Epoch 3, batch 11100, loss[loss=0.3217, simple_loss=0.3688, pruned_loss=0.1374, over 21594.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.334, pruned_loss=0.1074, over 4234666.09 frames. ], batch size: 414, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:28:47,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=432534.0, ans=0.5 2023-06-19 09:30:19,282 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:30:23,567 INFO [train.py:996] (0/4) Epoch 3, batch 11150, loss[loss=0.2462, simple_loss=0.3363, pruned_loss=0.07807, over 21701.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3319, pruned_loss=0.1069, over 4238575.69 frames. ], batch size: 332, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:30:44,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=432894.0, ans=0.0 2023-06-19 09:30:55,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=432894.0, ans=0.125 2023-06-19 09:31:34,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.954e+02 3.953e+02 5.206e+02 1.006e+03, threshold=7.907e+02, percent-clipped=9.0 2023-06-19 09:31:58,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=433074.0, ans=0.05 2023-06-19 09:32:03,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433074.0, ans=0.1 2023-06-19 09:32:10,331 INFO [train.py:996] (0/4) Epoch 3, batch 11200, loss[loss=0.2714, simple_loss=0.3184, pruned_loss=0.1122, over 21660.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3305, pruned_loss=0.107, over 4249770.82 frames. ], batch size: 333, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:32:24,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=433134.0, ans=0.125 2023-06-19 09:32:26,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=433134.0, ans=0.125 2023-06-19 09:32:27,859 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:33:11,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=433314.0, ans=0.0 2023-06-19 09:33:54,845 INFO [train.py:996] (0/4) Epoch 3, batch 11250, loss[loss=0.3113, simple_loss=0.3725, pruned_loss=0.125, over 21800.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3287, pruned_loss=0.106, over 4256754.73 frames. ], batch size: 118, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:34:08,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=433434.0, ans=0.1 2023-06-19 09:34:41,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=433554.0, ans=0.125 2023-06-19 09:35:00,782 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.899e+02 3.491e+02 4.112e+02 6.627e+02, threshold=6.983e+02, percent-clipped=0.0 2023-06-19 09:35:41,867 INFO [train.py:996] (0/4) Epoch 3, batch 11300, loss[loss=0.2941, simple_loss=0.3429, pruned_loss=0.1226, over 21304.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3292, pruned_loss=0.1055, over 4265158.93 frames. ], batch size: 159, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:35:54,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-06-19 09:36:04,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=433794.0, ans=0.125 2023-06-19 09:36:06,884 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.90 vs. limit=10.0 2023-06-19 09:36:58,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=433914.0, ans=0.125 2023-06-19 09:37:10,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=433974.0, ans=0.125 2023-06-19 09:37:28,349 INFO [train.py:996] (0/4) Epoch 3, batch 11350, loss[loss=0.262, simple_loss=0.3375, pruned_loss=0.09323, over 21631.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3313, pruned_loss=0.1042, over 4269130.45 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:37:36,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.73 vs. limit=5.0 2023-06-19 09:37:54,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=434094.0, ans=0.125 2023-06-19 09:38:45,272 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.245e+02 4.001e+02 4.934e+02 9.082e+02, threshold=8.002e+02, percent-clipped=8.0 2023-06-19 09:39:08,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434274.0, ans=0.1 2023-06-19 09:39:13,776 INFO [train.py:996] (0/4) Epoch 3, batch 11400, loss[loss=0.2838, simple_loss=0.3301, pruned_loss=0.1188, over 21257.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3355, pruned_loss=0.1061, over 4275729.54 frames. ], batch size: 608, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:39:43,642 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-19 09:39:55,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=434394.0, ans=6.0 2023-06-19 09:40:32,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=434514.0, ans=0.2 2023-06-19 09:41:06,650 INFO [train.py:996] (0/4) Epoch 3, batch 11450, loss[loss=0.232, simple_loss=0.2991, pruned_loss=0.08242, over 21277.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3375, pruned_loss=0.1049, over 4281090.98 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:41:12,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=434634.0, ans=0.0 2023-06-19 09:41:23,449 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:41:47,581 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-06-19 09:42:19,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 3.142e+02 3.588e+02 4.835e+02 7.937e+02, threshold=7.176e+02, percent-clipped=0.0 2023-06-19 09:42:21,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434814.0, ans=0.1 2023-06-19 09:42:30,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=434874.0, ans=0.0 2023-06-19 09:42:30,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=434874.0, ans=0.0 2023-06-19 09:42:53,711 INFO [train.py:996] (0/4) Epoch 3, batch 11500, loss[loss=0.2458, simple_loss=0.3302, pruned_loss=0.08067, over 21857.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3413, pruned_loss=0.1064, over 4280241.35 frames. ], batch size: 316, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:42:54,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=434934.0, ans=0.1 2023-06-19 09:44:16,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=435174.0, ans=0.125 2023-06-19 09:44:16,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=435174.0, ans=0.04949747468305833 2023-06-19 09:44:30,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-19 09:44:32,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-19 09:44:43,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=435174.0, ans=0.2 2023-06-19 09:44:43,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.59 vs. limit=15.0 2023-06-19 09:44:46,149 INFO [train.py:996] (0/4) Epoch 3, batch 11550, loss[loss=0.3074, simple_loss=0.3717, pruned_loss=0.1215, over 19966.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3456, pruned_loss=0.106, over 4271629.38 frames. ], batch size: 704, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:44:51,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435234.0, ans=0.0 2023-06-19 09:45:19,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=435294.0, ans=0.0 2023-06-19 09:45:28,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=435294.0, ans=0.025 2023-06-19 09:45:33,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=435354.0, ans=0.0 2023-06-19 09:45:54,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.996e+02 3.693e+02 5.072e+02 8.592e+02, threshold=7.387e+02, percent-clipped=2.0 2023-06-19 09:46:21,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=10.0 2023-06-19 09:46:24,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=435474.0, ans=0.125 2023-06-19 09:46:26,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=435474.0, ans=0.125 2023-06-19 09:46:38,549 INFO [train.py:996] (0/4) Epoch 3, batch 11600, loss[loss=0.3031, simple_loss=0.395, pruned_loss=0.1057, over 21674.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3613, pruned_loss=0.108, over 4269859.31 frames. ], batch size: 247, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:46:56,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=435534.0, ans=0.125 2023-06-19 09:47:02,385 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-19 09:47:08,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=435594.0, ans=0.0 2023-06-19 09:47:19,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=22.5 2023-06-19 09:47:19,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.42 vs. limit=10.0 2023-06-19 09:47:52,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=435714.0, ans=0.1 2023-06-19 09:48:01,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=435774.0, ans=0.125 2023-06-19 09:48:24,159 INFO [train.py:996] (0/4) Epoch 3, batch 11650, loss[loss=0.2614, simple_loss=0.3354, pruned_loss=0.09374, over 21561.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3675, pruned_loss=0.1094, over 4272534.72 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:49:04,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=435954.0, ans=0.125 2023-06-19 09:49:32,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.087e+02 3.546e+02 4.429e+02 8.703e+02, threshold=7.092e+02, percent-clipped=3.0 2023-06-19 09:49:39,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=436014.0, ans=0.125 2023-06-19 09:49:57,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=436074.0, ans=0.125 2023-06-19 09:50:04,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2023-06-19 09:50:10,370 INFO [train.py:996] (0/4) Epoch 3, batch 11700, loss[loss=0.2407, simple_loss=0.3018, pruned_loss=0.08982, over 21618.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3583, pruned_loss=0.1082, over 4272331.44 frames. ], batch size: 332, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:50:23,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-19 09:50:44,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=436194.0, ans=0.0 2023-06-19 09:51:14,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=436314.0, ans=0.125 2023-06-19 09:51:56,354 INFO [train.py:996] (0/4) Epoch 3, batch 11750, loss[loss=0.3007, simple_loss=0.3619, pruned_loss=0.1198, over 21416.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3472, pruned_loss=0.1072, over 4275157.10 frames. ], batch size: 131, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:51:56,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=436434.0, ans=0.0 2023-06-19 09:52:40,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=436554.0, ans=0.2 2023-06-19 09:52:51,099 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-19 09:52:54,486 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-19 09:53:03,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.275e+02 3.279e+02 3.653e+02 4.613e+02 8.659e+02, threshold=7.305e+02, percent-clipped=4.0 2023-06-19 09:53:19,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-19 09:53:41,264 INFO [train.py:996] (0/4) Epoch 3, batch 11800, loss[loss=0.2475, simple_loss=0.3369, pruned_loss=0.07904, over 21380.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3511, pruned_loss=0.1114, over 4280426.14 frames. ], batch size: 211, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:54:26,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-19 09:54:44,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=436914.0, ans=0.0 2023-06-19 09:54:46,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=436914.0, ans=0.125 2023-06-19 09:55:31,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=436974.0, ans=0.2 2023-06-19 09:55:33,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437034.0, ans=0.1 2023-06-19 09:55:33,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=437034.0, ans=0.2 2023-06-19 09:55:34,587 INFO [train.py:996] (0/4) Epoch 3, batch 11850, loss[loss=0.3222, simple_loss=0.4054, pruned_loss=0.1195, over 20789.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3528, pruned_loss=0.1104, over 4288519.75 frames. ], batch size: 607, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:55:46,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=437034.0, ans=0.0 2023-06-19 09:55:49,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=437034.0, ans=0.1 2023-06-19 09:56:20,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=437154.0, ans=0.0 2023-06-19 09:56:20,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=437154.0, ans=0.05 2023-06-19 09:56:37,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=437214.0, ans=0.0 2023-06-19 09:56:44,321 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.974e+02 3.438e+02 3.994e+02 6.906e+02, threshold=6.876e+02, percent-clipped=0.0 2023-06-19 09:57:18,697 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-19 09:57:22,869 INFO [train.py:996] (0/4) Epoch 3, batch 11900, loss[loss=0.2316, simple_loss=0.315, pruned_loss=0.07407, over 21751.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3503, pruned_loss=0.1066, over 4289625.13 frames. ], batch size: 282, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:57:50,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=437394.0, ans=0.1 2023-06-19 09:58:08,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=437454.0, ans=15.0 2023-06-19 09:58:55,890 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-19 09:58:59,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=437574.0, ans=0.02 2023-06-19 09:59:06,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=437574.0, ans=0.0 2023-06-19 09:59:10,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=437634.0, ans=0.2 2023-06-19 09:59:10,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=437634.0, ans=0.1 2023-06-19 09:59:11,145 INFO [train.py:996] (0/4) Epoch 3, batch 11950, loss[loss=0.2248, simple_loss=0.2906, pruned_loss=0.07949, over 21804.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3494, pruned_loss=0.1027, over 4282447.25 frames. ], batch size: 102, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 09:59:15,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-19 09:59:26,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=437634.0, ans=0.1 2023-06-19 09:59:51,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=437754.0, ans=0.2 2023-06-19 09:59:55,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=437754.0, ans=0.0 2023-06-19 10:00:15,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=437814.0, ans=0.2 2023-06-19 10:00:29,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.768e+02 3.443e+02 4.738e+02 7.856e+02, threshold=6.886e+02, percent-clipped=6.0 2023-06-19 10:00:40,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=437874.0, ans=0.125 2023-06-19 10:00:56,222 INFO [train.py:996] (0/4) Epoch 3, batch 12000, loss[loss=0.2655, simple_loss=0.3197, pruned_loss=0.1056, over 21842.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3422, pruned_loss=0.1005, over 4275895.11 frames. ], batch size: 318, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:00:56,223 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 10:01:15,356 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.279, simple_loss=0.3755, pruned_loss=0.09124, over 1796401.00 frames. 2023-06-19 10:01:15,357 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 10:01:19,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=437934.0, ans=0.2 2023-06-19 10:01:38,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=437994.0, ans=0.2 2023-06-19 10:01:44,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=437994.0, ans=0.125 2023-06-19 10:02:31,914 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:02:47,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=438174.0, ans=0.2 2023-06-19 10:02:48,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438174.0, ans=0.1 2023-06-19 10:03:02,166 INFO [train.py:996] (0/4) Epoch 3, batch 12050, loss[loss=0.2774, simple_loss=0.3302, pruned_loss=0.1123, over 21893.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3404, pruned_loss=0.1024, over 4273143.07 frames. ], batch size: 351, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:03:02,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=438234.0, ans=0.2 2023-06-19 10:03:21,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=438294.0, ans=0.035 2023-06-19 10:03:58,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-19 10:04:09,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=438414.0, ans=0.1 2023-06-19 10:04:18,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.247e+02 3.603e+02 4.114e+02 7.694e+02, threshold=7.207e+02, percent-clipped=1.0 2023-06-19 10:04:24,195 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:04:36,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=438474.0, ans=0.0 2023-06-19 10:04:43,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=438474.0, ans=0.0 2023-06-19 10:04:49,672 INFO [train.py:996] (0/4) Epoch 3, batch 12100, loss[loss=0.275, simple_loss=0.3195, pruned_loss=0.1152, over 20767.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3483, pruned_loss=0.1086, over 4273071.13 frames. ], batch size: 607, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:04:53,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=438534.0, ans=0.125 2023-06-19 10:05:15,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=438594.0, ans=0.0 2023-06-19 10:05:26,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=438594.0, ans=0.125 2023-06-19 10:06:38,101 INFO [train.py:996] (0/4) Epoch 3, batch 12150, loss[loss=0.2962, simple_loss=0.3509, pruned_loss=0.1207, over 20662.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3538, pruned_loss=0.1089, over 4276749.99 frames. ], batch size: 607, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:07:00,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=438834.0, ans=0.0 2023-06-19 10:07:19,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=12.0 2023-06-19 10:07:43,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=438954.0, ans=0.025 2023-06-19 10:07:55,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 3.586e+02 4.536e+02 5.596e+02 8.610e+02, threshold=9.073e+02, percent-clipped=5.0 2023-06-19 10:08:02,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=439074.0, ans=0.125 2023-06-19 10:08:33,687 INFO [train.py:996] (0/4) Epoch 3, batch 12200, loss[loss=0.23, simple_loss=0.2873, pruned_loss=0.08636, over 21680.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3501, pruned_loss=0.1073, over 4273055.51 frames. ], batch size: 299, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:08:37,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=439134.0, ans=0.125 2023-06-19 10:08:56,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-19 10:09:27,647 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:10:12,837 INFO [train.py:996] (0/4) Epoch 3, batch 12250, loss[loss=0.215, simple_loss=0.2868, pruned_loss=0.07163, over 21556.00 frames. ], tot_loss[loss=0.274, simple_loss=0.3409, pruned_loss=0.1035, over 4277511.10 frames. ], batch size: 212, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:10:33,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=439494.0, ans=0.125 2023-06-19 10:10:47,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=439494.0, ans=0.125 2023-06-19 10:11:22,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.723e+02 2.671e+02 3.353e+02 4.398e+02 1.093e+03, threshold=6.707e+02, percent-clipped=1.0 2023-06-19 10:11:24,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=439614.0, ans=0.125 2023-06-19 10:11:50,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=439674.0, ans=0.125 2023-06-19 10:11:54,806 INFO [train.py:996] (0/4) Epoch 3, batch 12300, loss[loss=0.232, simple_loss=0.3216, pruned_loss=0.07117, over 21740.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3321, pruned_loss=0.09601, over 4279528.83 frames. ], batch size: 332, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:12:22,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=439794.0, ans=0.125 2023-06-19 10:13:39,849 INFO [train.py:996] (0/4) Epoch 3, batch 12350, loss[loss=0.3228, simple_loss=0.3995, pruned_loss=0.123, over 21338.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3368, pruned_loss=0.09662, over 4279794.52 frames. ], batch size: 548, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:13:51,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-19 10:14:50,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=440214.0, ans=0.125 2023-06-19 10:14:53,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 2.837e+02 3.590e+02 4.995e+02 8.694e+02, threshold=7.180e+02, percent-clipped=5.0 2023-06-19 10:15:22,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=440274.0, ans=0.0 2023-06-19 10:15:30,270 INFO [train.py:996] (0/4) Epoch 3, batch 12400, loss[loss=0.2675, simple_loss=0.3268, pruned_loss=0.1041, over 21549.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3394, pruned_loss=0.102, over 4281954.35 frames. ], batch size: 131, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:15:44,825 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=5.125e-03 2023-06-19 10:16:48,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=440574.0, ans=0.0 2023-06-19 10:17:23,800 INFO [train.py:996] (0/4) Epoch 3, batch 12450, loss[loss=0.3063, simple_loss=0.3613, pruned_loss=0.1257, over 21472.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3442, pruned_loss=0.1065, over 4287706.06 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:17:48,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-19 10:18:16,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=440754.0, ans=0.125 2023-06-19 10:18:35,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.264e+02 3.825e+02 4.731e+02 7.932e+02, threshold=7.651e+02, percent-clipped=1.0 2023-06-19 10:19:02,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=440874.0, ans=0.1 2023-06-19 10:19:02,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=440874.0, ans=0.125 2023-06-19 10:19:12,411 INFO [train.py:996] (0/4) Epoch 3, batch 12500, loss[loss=0.325, simple_loss=0.3936, pruned_loss=0.1282, over 21368.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3567, pruned_loss=0.1111, over 4282864.11 frames. ], batch size: 159, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:19:41,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=440994.0, ans=0.0 2023-06-19 10:21:05,742 INFO [train.py:996] (0/4) Epoch 3, batch 12550, loss[loss=0.2997, simple_loss=0.3645, pruned_loss=0.1174, over 21976.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3614, pruned_loss=0.114, over 4278658.79 frames. ], batch size: 317, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:21:06,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=441234.0, ans=0.125 2023-06-19 10:21:10,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-06-19 10:21:36,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=441294.0, ans=0.1 2023-06-19 10:22:12,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=441414.0, ans=0.1 2023-06-19 10:22:14,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=441414.0, ans=0.125 2023-06-19 10:22:22,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.217e+02 3.804e+02 4.730e+02 9.875e+02, threshold=7.608e+02, percent-clipped=0.0 2023-06-19 10:22:52,423 INFO [train.py:996] (0/4) Epoch 3, batch 12600, loss[loss=0.233, simple_loss=0.3049, pruned_loss=0.08057, over 21510.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3612, pruned_loss=0.1119, over 4280130.32 frames. ], batch size: 212, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:22:55,463 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-19 10:23:02,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=441534.0, ans=0.2 2023-06-19 10:23:19,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=441594.0, ans=0.0 2023-06-19 10:23:53,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=441654.0, ans=0.0 2023-06-19 10:24:12,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=441774.0, ans=0.125 2023-06-19 10:24:35,258 INFO [train.py:996] (0/4) Epoch 3, batch 12650, loss[loss=0.2616, simple_loss=0.3271, pruned_loss=0.09802, over 21849.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3488, pruned_loss=0.1053, over 4274469.22 frames. ], batch size: 391, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:24:52,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=441894.0, ans=0.1 2023-06-19 10:25:11,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=441894.0, ans=0.125 2023-06-19 10:25:47,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.848e+02 3.430e+02 4.434e+02 6.952e+02, threshold=6.860e+02, percent-clipped=1.0 2023-06-19 10:26:04,619 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-19 10:26:07,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=442074.0, ans=0.125 2023-06-19 10:26:17,200 INFO [train.py:996] (0/4) Epoch 3, batch 12700, loss[loss=0.3289, simple_loss=0.3794, pruned_loss=0.1392, over 21494.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3497, pruned_loss=0.1085, over 4279710.28 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:27:27,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=442314.0, ans=0.2 2023-06-19 10:27:55,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=442374.0, ans=0.125 2023-06-19 10:28:03,211 INFO [train.py:996] (0/4) Epoch 3, batch 12750, loss[loss=0.252, simple_loss=0.3296, pruned_loss=0.08719, over 21778.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3531, pruned_loss=0.1104, over 4275354.57 frames. ], batch size: 282, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:29:09,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=442614.0, ans=0.125 2023-06-19 10:29:16,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.440e+02 3.021e+02 3.668e+02 4.359e+02 6.708e+02, threshold=7.336e+02, percent-clipped=0.0 2023-06-19 10:29:46,075 INFO [train.py:996] (0/4) Epoch 3, batch 12800, loss[loss=0.2876, simple_loss=0.349, pruned_loss=0.1131, over 21641.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3522, pruned_loss=0.1106, over 4273859.05 frames. ], batch size: 263, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:30:32,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-06-19 10:30:36,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=442854.0, ans=0.0 2023-06-19 10:30:43,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-19 10:31:22,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=442974.0, ans=0.1 2023-06-19 10:31:40,113 INFO [train.py:996] (0/4) Epoch 3, batch 12850, loss[loss=0.2813, simple_loss=0.3641, pruned_loss=0.09925, over 21749.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3539, pruned_loss=0.112, over 4276484.52 frames. ], batch size: 441, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:31:59,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=443034.0, ans=0.125 2023-06-19 10:32:09,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-19 10:32:34,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=443154.0, ans=0.5 2023-06-19 10:32:34,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-19 10:32:46,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=443214.0, ans=0.1 2023-06-19 10:32:48,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.077e+02 3.595e+02 4.462e+02 6.932e+02, threshold=7.189e+02, percent-clipped=0.0 2023-06-19 10:33:23,130 INFO [train.py:996] (0/4) Epoch 3, batch 12900, loss[loss=0.3176, simple_loss=0.3803, pruned_loss=0.1275, over 20664.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3515, pruned_loss=0.1084, over 4275840.60 frames. ], batch size: 607, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:33:39,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=443394.0, ans=0.95 2023-06-19 10:33:41,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=443394.0, ans=0.125 2023-06-19 10:33:53,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=443394.0, ans=0.125 2023-06-19 10:34:14,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=443454.0, ans=0.0 2023-06-19 10:34:54,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=443574.0, ans=0.125 2023-06-19 10:35:06,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=443574.0, ans=0.125 2023-06-19 10:35:09,453 INFO [train.py:996] (0/4) Epoch 3, batch 12950, loss[loss=0.3015, simple_loss=0.3686, pruned_loss=0.1172, over 21453.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3488, pruned_loss=0.1061, over 4276548.69 frames. ], batch size: 471, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:35:40,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=443694.0, ans=0.1 2023-06-19 10:35:48,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=443754.0, ans=0.125 2023-06-19 10:36:18,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=443814.0, ans=0.125 2023-06-19 10:36:23,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.936e+02 3.406e+02 4.166e+02 8.200e+02, threshold=6.811e+02, percent-clipped=3.0 2023-06-19 10:36:25,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-19 10:36:52,676 INFO [train.py:996] (0/4) Epoch 3, batch 13000, loss[loss=0.2488, simple_loss=0.3306, pruned_loss=0.08346, over 21583.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.348, pruned_loss=0.1058, over 4282115.55 frames. ], batch size: 441, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:36:56,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-19 10:37:05,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=443934.0, ans=0.1 2023-06-19 10:37:24,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=443994.0, ans=0.04949747468305833 2023-06-19 10:37:30,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-19 10:37:31,462 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:37:34,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=444054.0, ans=0.0 2023-06-19 10:37:42,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=444054.0, ans=0.2 2023-06-19 10:37:55,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=444114.0, ans=0.1 2023-06-19 10:38:18,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-19 10:38:32,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=444174.0, ans=0.0 2023-06-19 10:38:35,620 INFO [train.py:996] (0/4) Epoch 3, batch 13050, loss[loss=0.2773, simple_loss=0.333, pruned_loss=0.1108, over 21693.00 frames. ], tot_loss[loss=0.275, simple_loss=0.343, pruned_loss=0.1035, over 4284064.03 frames. ], batch size: 263, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:38:45,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=12.0 2023-06-19 10:39:26,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=444354.0, ans=0.125 2023-06-19 10:39:44,196 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-19 10:39:45,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=444414.0, ans=0.0 2023-06-19 10:39:49,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.662e+02 3.344e+02 3.864e+02 6.973e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-19 10:40:20,653 INFO [train.py:996] (0/4) Epoch 3, batch 13100, loss[loss=0.2925, simple_loss=0.3563, pruned_loss=0.1143, over 21360.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.345, pruned_loss=0.1043, over 4290462.70 frames. ], batch size: 159, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:40:50,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=444594.0, ans=15.0 2023-06-19 10:42:09,733 INFO [train.py:996] (0/4) Epoch 3, batch 13150, loss[loss=0.2728, simple_loss=0.3433, pruned_loss=0.1012, over 21742.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3456, pruned_loss=0.1069, over 4289796.82 frames. ], batch size: 352, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:42:10,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=444834.0, ans=0.0 2023-06-19 10:42:27,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=444894.0, ans=0.07 2023-06-19 10:42:46,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=444894.0, ans=0.0 2023-06-19 10:42:46,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=444894.0, ans=0.0 2023-06-19 10:42:46,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=444894.0, ans=0.125 2023-06-19 10:42:53,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-19 10:43:10,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=444954.0, ans=0.125 2023-06-19 10:43:24,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 3.199e+02 3.918e+02 5.152e+02 8.520e+02, threshold=7.837e+02, percent-clipped=11.0 2023-06-19 10:43:27,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=445014.0, ans=0.125 2023-06-19 10:43:55,607 INFO [train.py:996] (0/4) Epoch 3, batch 13200, loss[loss=0.2905, simple_loss=0.3478, pruned_loss=0.1166, over 21661.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3444, pruned_loss=0.1072, over 4294073.31 frames. ], batch size: 230, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:44:01,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=445134.0, ans=12.0 2023-06-19 10:44:23,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=445194.0, ans=0.125 2023-06-19 10:45:04,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=445314.0, ans=0.125 2023-06-19 10:45:08,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445314.0, ans=0.1 2023-06-19 10:45:33,385 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.83 vs. limit=6.0 2023-06-19 10:45:38,773 INFO [train.py:996] (0/4) Epoch 3, batch 13250, loss[loss=0.291, simple_loss=0.3635, pruned_loss=0.1092, over 21805.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3458, pruned_loss=0.109, over 4290068.80 frames. ], batch size: 351, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:46:12,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445494.0, ans=0.1 2023-06-19 10:46:57,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=445614.0, ans=0.1 2023-06-19 10:46:59,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.966e+02 3.440e+02 4.161e+02 7.151e+02, threshold=6.880e+02, percent-clipped=0.0 2023-06-19 10:47:15,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=445674.0, ans=0.0 2023-06-19 10:47:29,780 INFO [train.py:996] (0/4) Epoch 3, batch 13300, loss[loss=0.2576, simple_loss=0.332, pruned_loss=0.0916, over 21406.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3498, pruned_loss=0.109, over 4288038.98 frames. ], batch size: 194, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:47:49,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=445734.0, ans=0.125 2023-06-19 10:48:21,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=445854.0, ans=0.125 2023-06-19 10:48:22,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=445854.0, ans=0.125 2023-06-19 10:48:40,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-19 10:49:13,524 INFO [train.py:996] (0/4) Epoch 3, batch 13350, loss[loss=0.336, simple_loss=0.4069, pruned_loss=0.1325, over 21623.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3545, pruned_loss=0.1129, over 4291758.64 frames. ], batch size: 414, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:49:48,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=446094.0, ans=0.2 2023-06-19 10:50:02,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=446154.0, ans=0.125 2023-06-19 10:50:20,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 3.225e+02 3.868e+02 4.526e+02 7.710e+02, threshold=7.735e+02, percent-clipped=4.0 2023-06-19 10:50:55,999 INFO [train.py:996] (0/4) Epoch 3, batch 13400, loss[loss=0.3043, simple_loss=0.3624, pruned_loss=0.1231, over 21750.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3557, pruned_loss=0.1146, over 4292209.51 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:51:04,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=446334.0, ans=10.0 2023-06-19 10:51:36,227 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-19 10:52:16,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=446574.0, ans=0.125 2023-06-19 10:52:46,486 INFO [train.py:996] (0/4) Epoch 3, batch 13450, loss[loss=0.2341, simple_loss=0.2914, pruned_loss=0.08838, over 21531.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3559, pruned_loss=0.1162, over 4286597.17 frames. ], batch size: 230, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:52:49,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-19 10:52:53,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=446634.0, ans=0.1 2023-06-19 10:52:58,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=446634.0, ans=0.0 2023-06-19 10:53:56,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.091e+02 3.617e+02 4.599e+02 7.916e+02, threshold=7.234e+02, percent-clipped=2.0 2023-06-19 10:54:22,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-19 10:54:31,223 INFO [train.py:996] (0/4) Epoch 3, batch 13500, loss[loss=0.3386, simple_loss=0.3931, pruned_loss=0.1421, over 21509.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3463, pruned_loss=0.1127, over 4286681.39 frames. ], batch size: 473, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:54:55,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=446994.0, ans=0.125 2023-06-19 10:56:14,925 INFO [train.py:996] (0/4) Epoch 3, batch 13550, loss[loss=0.2943, simple_loss=0.3735, pruned_loss=0.1076, over 21229.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3521, pruned_loss=0.1121, over 4283335.35 frames. ], batch size: 159, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:56:28,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=447234.0, ans=0.95 2023-06-19 10:56:49,803 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:57:11,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=447354.0, ans=0.2 2023-06-19 10:57:35,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=447414.0, ans=0.125 2023-06-19 10:57:36,377 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.954e+02 3.584e+02 4.667e+02 1.065e+03, threshold=7.167e+02, percent-clipped=1.0 2023-06-19 10:58:03,818 INFO [train.py:996] (0/4) Epoch 3, batch 13600, loss[loss=0.2896, simple_loss=0.3382, pruned_loss=0.1206, over 21675.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3532, pruned_loss=0.1121, over 4284491.84 frames. ], batch size: 230, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:58:19,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=447594.0, ans=0.0 2023-06-19 10:58:35,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=447654.0, ans=0.125 2023-06-19 10:59:14,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=447714.0, ans=0.125 2023-06-19 10:59:45,723 INFO [train.py:996] (0/4) Epoch 3, batch 13650, loss[loss=0.2737, simple_loss=0.323, pruned_loss=0.1122, over 21762.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3469, pruned_loss=0.1077, over 4286864.42 frames. ], batch size: 371, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:59:52,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=447834.0, ans=0.0 2023-06-19 11:01:01,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 3.087e+02 4.228e+02 5.339e+02 1.090e+03, threshold=8.456e+02, percent-clipped=10.0 2023-06-19 11:01:28,431 INFO [train.py:996] (0/4) Epoch 3, batch 13700, loss[loss=0.2571, simple_loss=0.3155, pruned_loss=0.09933, over 21682.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.342, pruned_loss=0.1082, over 4288259.12 frames. ], batch size: 247, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:01:44,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=12.0 2023-06-19 11:01:49,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=448194.0, ans=0.1 2023-06-19 11:01:51,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-19 11:02:29,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=448254.0, ans=0.1 2023-06-19 11:03:11,626 INFO [train.py:996] (0/4) Epoch 3, batch 13750, loss[loss=0.2367, simple_loss=0.312, pruned_loss=0.0807, over 21692.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3393, pruned_loss=0.106, over 4281474.61 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:04:00,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=448554.0, ans=0.125 2023-06-19 11:04:23,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=448614.0, ans=0.125 2023-06-19 11:04:28,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-19 11:04:34,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 3.276e+02 4.043e+02 5.146e+02 9.090e+02, threshold=8.085e+02, percent-clipped=5.0 2023-06-19 11:04:56,213 INFO [train.py:996] (0/4) Epoch 3, batch 13800, loss[loss=0.2658, simple_loss=0.3598, pruned_loss=0.08587, over 21601.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3448, pruned_loss=0.1055, over 4274500.85 frames. ], batch size: 263, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:05:33,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-06-19 11:05:45,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=448854.0, ans=0.0 2023-06-19 11:05:59,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448854.0, ans=0.1 2023-06-19 11:06:04,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=448854.0, ans=0.125 2023-06-19 11:06:50,095 INFO [train.py:996] (0/4) Epoch 3, batch 13850, loss[loss=0.3185, simple_loss=0.3752, pruned_loss=0.1309, over 21447.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3502, pruned_loss=0.1065, over 4266113.91 frames. ], batch size: 194, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:07:32,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=449154.0, ans=0.125 2023-06-19 11:08:01,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.934e+02 3.600e+02 4.326e+02 8.652e+02, threshold=7.199e+02, percent-clipped=1.0 2023-06-19 11:08:03,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=449214.0, ans=0.125 2023-06-19 11:08:22,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=449274.0, ans=0.125 2023-06-19 11:08:32,527 INFO [train.py:996] (0/4) Epoch 3, batch 13900, loss[loss=0.3005, simple_loss=0.3574, pruned_loss=0.1218, over 21377.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3546, pruned_loss=0.1111, over 4266902.96 frames. ], batch size: 159, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:08:46,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=449334.0, ans=0.125 2023-06-19 11:08:49,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=449334.0, ans=0.2 2023-06-19 11:08:56,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=449394.0, ans=0.125 2023-06-19 11:09:20,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-19 11:09:30,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=449514.0, ans=0.5 2023-06-19 11:10:09,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=449574.0, ans=0.1 2023-06-19 11:10:12,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449574.0, ans=0.1 2023-06-19 11:10:15,227 INFO [train.py:996] (0/4) Epoch 3, batch 13950, loss[loss=0.2726, simple_loss=0.3361, pruned_loss=0.1046, over 21850.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3551, pruned_loss=0.1137, over 4276113.79 frames. ], batch size: 332, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:10:40,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=449694.0, ans=0.0 2023-06-19 11:10:48,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=449694.0, ans=0.5 2023-06-19 11:11:14,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449814.0, ans=0.1 2023-06-19 11:11:30,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 3.017e+02 3.528e+02 4.292e+02 7.550e+02, threshold=7.057e+02, percent-clipped=1.0 2023-06-19 11:11:51,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449874.0, ans=0.1 2023-06-19 11:11:55,753 INFO [train.py:996] (0/4) Epoch 3, batch 14000, loss[loss=0.2328, simple_loss=0.3134, pruned_loss=0.07613, over 21775.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3506, pruned_loss=0.1107, over 4269977.76 frames. ], batch size: 298, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:12:25,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-19 11:12:43,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=450054.0, ans=0.0 2023-06-19 11:12:55,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=22.5 2023-06-19 11:13:34,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=450174.0, ans=0.07 2023-06-19 11:13:34,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=450174.0, ans=0.125 2023-06-19 11:13:36,934 INFO [train.py:996] (0/4) Epoch 3, batch 14050, loss[loss=0.2381, simple_loss=0.3009, pruned_loss=0.08769, over 21177.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3443, pruned_loss=0.1058, over 4278733.15 frames. ], batch size: 548, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:14:28,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=450354.0, ans=0.0 2023-06-19 11:14:34,124 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.64 vs. limit=15.0 2023-06-19 11:14:54,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.755e+02 3.797e+02 5.310e+02 9.461e+02, threshold=7.595e+02, percent-clipped=8.0 2023-06-19 11:15:02,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=450474.0, ans=0.125 2023-06-19 11:15:09,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-19 11:15:09,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-19 11:15:13,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=450474.0, ans=0.0 2023-06-19 11:15:24,583 INFO [train.py:996] (0/4) Epoch 3, batch 14100, loss[loss=0.2325, simple_loss=0.276, pruned_loss=0.0945, over 20230.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3388, pruned_loss=0.1051, over 4266225.96 frames. ], batch size: 703, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:15:30,471 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-19 11:15:53,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=450594.0, ans=10.0 2023-06-19 11:15:55,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=450594.0, ans=0.0 2023-06-19 11:15:55,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=450594.0, ans=0.125 2023-06-19 11:16:58,780 INFO [train.py:996] (0/4) Epoch 3, batch 14150, loss[loss=0.2601, simple_loss=0.3308, pruned_loss=0.09471, over 21856.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3418, pruned_loss=0.1058, over 4243589.96 frames. ], batch size: 98, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:17:05,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=450834.0, ans=0.0 2023-06-19 11:17:39,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=450954.0, ans=0.1 2023-06-19 11:18:14,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.701e+02 3.295e+02 4.437e+02 7.217e+02, threshold=6.589e+02, percent-clipped=0.0 2023-06-19 11:18:21,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=451074.0, ans=0.0 2023-06-19 11:18:25,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=451074.0, ans=0.1 2023-06-19 11:18:38,257 INFO [train.py:996] (0/4) Epoch 3, batch 14200, loss[loss=0.2832, simple_loss=0.327, pruned_loss=0.1197, over 21372.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3393, pruned_loss=0.1038, over 4249048.92 frames. ], batch size: 471, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:19:39,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=451314.0, ans=0.125 2023-06-19 11:20:20,553 INFO [train.py:996] (0/4) Epoch 3, batch 14250, loss[loss=0.2455, simple_loss=0.3251, pruned_loss=0.08297, over 21711.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3339, pruned_loss=0.1027, over 4240170.76 frames. ], batch size: 247, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:21:39,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.808e+02 3.716e+02 4.608e+02 1.130e+03, threshold=7.432e+02, percent-clipped=7.0 2023-06-19 11:21:58,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=451674.0, ans=0.2 2023-06-19 11:22:04,566 INFO [train.py:996] (0/4) Epoch 3, batch 14300, loss[loss=0.4411, simple_loss=0.5041, pruned_loss=0.189, over 21636.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3369, pruned_loss=0.1033, over 4241754.74 frames. ], batch size: 414, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:22:14,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-19 11:22:51,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=451854.0, ans=0.125 2023-06-19 11:23:27,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=451974.0, ans=0.1 2023-06-19 11:23:46,813 INFO [train.py:996] (0/4) Epoch 3, batch 14350, loss[loss=0.2714, simple_loss=0.3438, pruned_loss=0.0995, over 21717.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3414, pruned_loss=0.1037, over 4248199.28 frames. ], batch size: 389, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:24:25,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.47 vs. limit=15.0 2023-06-19 11:24:26,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=452094.0, ans=0.125 2023-06-19 11:24:39,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=452154.0, ans=0.125 2023-06-19 11:25:04,030 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 3.074e+02 3.738e+02 4.858e+02 1.364e+03, threshold=7.476e+02, percent-clipped=9.0 2023-06-19 11:25:28,338 INFO [train.py:996] (0/4) Epoch 3, batch 14400, loss[loss=0.3458, simple_loss=0.3742, pruned_loss=0.1587, over 21770.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3425, pruned_loss=0.1058, over 4257557.59 frames. ], batch size: 508, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:26:54,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=452574.0, ans=0.0 2023-06-19 11:27:09,097 INFO [train.py:996] (0/4) Epoch 3, batch 14450, loss[loss=0.216, simple_loss=0.2579, pruned_loss=0.08709, over 20834.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3382, pruned_loss=0.1064, over 4254123.34 frames. ], batch size: 608, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:27:33,005 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-06-19 11:27:53,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=452754.0, ans=0.0 2023-06-19 11:28:26,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.178e+02 3.835e+02 4.854e+02 8.477e+02, threshold=7.671e+02, percent-clipped=4.0 2023-06-19 11:28:51,119 INFO [train.py:996] (0/4) Epoch 3, batch 14500, loss[loss=0.2629, simple_loss=0.3243, pruned_loss=0.1008, over 21428.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3349, pruned_loss=0.1055, over 4252912.06 frames. ], batch size: 131, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:28:59,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=452934.0, ans=0.125 2023-06-19 11:29:01,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=452934.0, ans=0.125 2023-06-19 11:29:29,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=452994.0, ans=0.125 2023-06-19 11:29:59,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=453114.0, ans=0.0 2023-06-19 11:30:30,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-19 11:30:34,351 INFO [train.py:996] (0/4) Epoch 3, batch 14550, loss[loss=0.3315, simple_loss=0.3895, pruned_loss=0.1368, over 21343.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3405, pruned_loss=0.1078, over 4262408.13 frames. ], batch size: 159, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:30:51,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=453234.0, ans=0.025 2023-06-19 11:31:40,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.74 vs. limit=22.5 2023-06-19 11:31:57,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.259e+02 4.086e+02 5.797e+02 9.548e+02, threshold=8.171e+02, percent-clipped=6.0 2023-06-19 11:31:57,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=453414.0, ans=0.125 2023-06-19 11:32:16,951 INFO [train.py:996] (0/4) Epoch 3, batch 14600, loss[loss=0.2788, simple_loss=0.323, pruned_loss=0.1173, over 20152.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3486, pruned_loss=0.1127, over 4269735.89 frames. ], batch size: 703, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:32:20,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=453534.0, ans=0.0 2023-06-19 11:32:30,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=453534.0, ans=0.0 2023-06-19 11:32:38,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=453534.0, ans=0.125 2023-06-19 11:32:53,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=453594.0, ans=0.125 2023-06-19 11:33:02,107 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-19 11:33:13,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=453654.0, ans=0.2 2023-06-19 11:33:16,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=453654.0, ans=0.125 2023-06-19 11:33:58,777 INFO [train.py:996] (0/4) Epoch 3, batch 14650, loss[loss=0.2494, simple_loss=0.2905, pruned_loss=0.1042, over 20986.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3512, pruned_loss=0.1127, over 4263018.08 frames. ], batch size: 608, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:33:59,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=453834.0, ans=0.2 2023-06-19 11:33:59,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=453834.0, ans=0.125 2023-06-19 11:34:43,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=453954.0, ans=0.0 2023-06-19 11:35:02,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-19 11:35:24,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 2.524e+02 3.223e+02 4.029e+02 6.900e+02, threshold=6.446e+02, percent-clipped=0.0 2023-06-19 11:35:50,396 INFO [train.py:996] (0/4) Epoch 3, batch 14700, loss[loss=0.2326, simple_loss=0.3233, pruned_loss=0.07091, over 21738.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3437, pruned_loss=0.1051, over 4268127.58 frames. ], batch size: 298, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:36:22,034 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.02 vs. limit=6.0 2023-06-19 11:36:46,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=454254.0, ans=0.0 2023-06-19 11:36:55,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.08 vs. limit=22.5 2023-06-19 11:37:15,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=454374.0, ans=0.125 2023-06-19 11:37:31,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=454374.0, ans=0.2 2023-06-19 11:37:37,995 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-19 11:37:38,448 INFO [train.py:996] (0/4) Epoch 3, batch 14750, loss[loss=0.3632, simple_loss=0.4423, pruned_loss=0.142, over 21245.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3463, pruned_loss=0.1058, over 4273055.31 frames. ], batch size: 548, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:37:47,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=454434.0, ans=0.125 2023-06-19 11:37:58,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=454494.0, ans=0.125 2023-06-19 11:38:43,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=454614.0, ans=0.125 2023-06-19 11:38:51,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 3.078e+02 3.855e+02 4.833e+02 8.936e+02, threshold=7.710e+02, percent-clipped=7.0 2023-06-19 11:39:20,996 INFO [train.py:996] (0/4) Epoch 3, batch 14800, loss[loss=0.2854, simple_loss=0.3642, pruned_loss=0.1033, over 21698.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3575, pruned_loss=0.1122, over 4266133.46 frames. ], batch size: 282, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:39:36,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=454794.0, ans=0.2 2023-06-19 11:39:36,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=454794.0, ans=0.125 2023-06-19 11:39:48,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=454794.0, ans=0.0 2023-06-19 11:41:04,796 INFO [train.py:996] (0/4) Epoch 3, batch 14850, loss[loss=0.2463, simple_loss=0.2986, pruned_loss=0.09698, over 21532.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3505, pruned_loss=0.1119, over 4271532.95 frames. ], batch size: 263, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:42:23,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.130e+02 3.900e+02 4.785e+02 9.691e+02, threshold=7.799e+02, percent-clipped=2.0 2023-06-19 11:42:30,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=455274.0, ans=0.125 2023-06-19 11:42:47,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=455334.0, ans=0.0 2023-06-19 11:42:53,931 INFO [train.py:996] (0/4) Epoch 3, batch 14900, loss[loss=0.3489, simple_loss=0.3949, pruned_loss=0.1515, over 21671.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3519, pruned_loss=0.1129, over 4272607.63 frames. ], batch size: 351, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:43:04,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=455334.0, ans=0.2 2023-06-19 11:43:52,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=455514.0, ans=0.125 2023-06-19 11:44:09,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=455514.0, ans=0.0 2023-06-19 11:44:35,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=455634.0, ans=0.0 2023-06-19 11:44:36,807 INFO [train.py:996] (0/4) Epoch 3, batch 14950, loss[loss=0.2849, simple_loss=0.3548, pruned_loss=0.1075, over 21629.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3544, pruned_loss=0.1138, over 4265009.09 frames. ], batch size: 230, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:44:55,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=455694.0, ans=0.125 2023-06-19 11:45:20,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=455754.0, ans=0.125 2023-06-19 11:45:44,467 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:45:45,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=455814.0, ans=0.0 2023-06-19 11:45:54,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=455814.0, ans=0.0 2023-06-19 11:45:55,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.136e+02 3.778e+02 4.659e+02 7.505e+02, threshold=7.556e+02, percent-clipped=0.0 2023-06-19 11:46:20,062 INFO [train.py:996] (0/4) Epoch 3, batch 15000, loss[loss=0.2936, simple_loss=0.3456, pruned_loss=0.1208, over 21808.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.357, pruned_loss=0.1154, over 4268426.45 frames. ], batch size: 282, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:46:20,063 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 11:46:36,892 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2722, simple_loss=0.3734, pruned_loss=0.08553, over 1796401.00 frames. 2023-06-19 11:46:36,893 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 11:47:03,977 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-76000.pt 2023-06-19 11:47:22,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=456054.0, ans=0.2 2023-06-19 11:47:22,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-19 11:47:44,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=456114.0, ans=0.125 2023-06-19 11:47:46,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=456114.0, ans=0.0 2023-06-19 11:47:58,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-19 11:48:26,274 INFO [train.py:996] (0/4) Epoch 3, batch 15050, loss[loss=0.2834, simple_loss=0.3659, pruned_loss=0.1005, over 21729.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3591, pruned_loss=0.1169, over 4275504.32 frames. ], batch size: 298, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:48:30,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=456234.0, ans=0.125 2023-06-19 11:48:34,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=456234.0, ans=0.125 2023-06-19 11:48:48,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-19 11:49:07,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=456354.0, ans=0.1 2023-06-19 11:49:17,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-19 11:49:43,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.179e+02 3.857e+02 4.850e+02 8.474e+02, threshold=7.714e+02, percent-clipped=3.0 2023-06-19 11:49:44,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=456414.0, ans=0.1 2023-06-19 11:50:08,223 INFO [train.py:996] (0/4) Epoch 3, batch 15100, loss[loss=0.2928, simple_loss=0.3574, pruned_loss=0.1142, over 21719.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3619, pruned_loss=0.1164, over 4275403.10 frames. ], batch size: 332, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:50:25,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=456534.0, ans=0.0 2023-06-19 11:50:25,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=456534.0, ans=0.0 2023-06-19 11:50:40,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.64 vs. limit=10.0 2023-06-19 11:50:54,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=456654.0, ans=0.125 2023-06-19 11:51:07,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-19 11:51:12,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.76 vs. limit=15.0 2023-06-19 11:51:34,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=456774.0, ans=0.125 2023-06-19 11:51:53,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=456774.0, ans=0.1 2023-06-19 11:51:56,406 INFO [train.py:996] (0/4) Epoch 3, batch 15150, loss[loss=0.3303, simple_loss=0.3539, pruned_loss=0.1533, over 21244.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3606, pruned_loss=0.1176, over 4268564.42 frames. ], batch size: 471, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:53:13,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 3.143e+02 3.638e+02 4.416e+02 6.792e+02, threshold=7.275e+02, percent-clipped=0.0 2023-06-19 11:53:38,829 INFO [train.py:996] (0/4) Epoch 3, batch 15200, loss[loss=0.2315, simple_loss=0.286, pruned_loss=0.08848, over 21746.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3493, pruned_loss=0.1128, over 4270723.29 frames. ], batch size: 112, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:53:41,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457134.0, ans=0.1 2023-06-19 11:54:25,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=457254.0, ans=0.2 2023-06-19 11:54:35,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=457254.0, ans=0.125 2023-06-19 11:54:53,029 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:55:20,841 INFO [train.py:996] (0/4) Epoch 3, batch 15250, loss[loss=0.2863, simple_loss=0.3407, pruned_loss=0.116, over 15070.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3423, pruned_loss=0.1096, over 4254270.63 frames. ], batch size: 61, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:55:25,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=15.0 2023-06-19 11:55:52,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=457494.0, ans=0.0 2023-06-19 11:56:43,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.938e+02 3.532e+02 4.223e+02 6.837e+02, threshold=7.064e+02, percent-clipped=0.0 2023-06-19 11:57:08,080 INFO [train.py:996] (0/4) Epoch 3, batch 15300, loss[loss=0.3323, simple_loss=0.3745, pruned_loss=0.145, over 21434.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3455, pruned_loss=0.1127, over 4253363.63 frames. ], batch size: 471, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:57:57,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457854.0, ans=0.1 2023-06-19 11:58:10,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=457914.0, ans=0.0 2023-06-19 11:58:27,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=457974.0, ans=0.2 2023-06-19 11:58:30,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=457974.0, ans=0.0 2023-06-19 11:58:30,870 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-19 11:58:31,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=457974.0, ans=0.0 2023-06-19 11:58:44,350 INFO [train.py:996] (0/4) Epoch 3, batch 15350, loss[loss=0.3862, simple_loss=0.4378, pruned_loss=0.1673, over 21394.00 frames. ], tot_loss[loss=0.2922, simple_loss=0.3517, pruned_loss=0.1163, over 4261226.73 frames. ], batch size: 507, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:58:54,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=458034.0, ans=0.125 2023-06-19 11:59:43,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=458214.0, ans=0.125 2023-06-19 11:59:54,551 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 3.004e+02 3.619e+02 4.702e+02 1.047e+03, threshold=7.238e+02, percent-clipped=6.0 2023-06-19 12:00:24,067 INFO [train.py:996] (0/4) Epoch 3, batch 15400, loss[loss=0.226, simple_loss=0.2728, pruned_loss=0.08962, over 20710.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3498, pruned_loss=0.1131, over 4255404.10 frames. ], batch size: 609, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:00:45,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-06-19 12:00:49,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=458394.0, ans=0.125 2023-06-19 12:00:58,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=458394.0, ans=0.2 2023-06-19 12:01:26,400 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.84 vs. limit=22.5 2023-06-19 12:01:29,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=458514.0, ans=0.5 2023-06-19 12:01:48,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=458574.0, ans=0.125 2023-06-19 12:01:59,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.05 vs. limit=5.0 2023-06-19 12:02:05,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=458634.0, ans=0.1 2023-06-19 12:02:06,925 INFO [train.py:996] (0/4) Epoch 3, batch 15450, loss[loss=0.2959, simple_loss=0.3909, pruned_loss=0.1004, over 19738.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3472, pruned_loss=0.1125, over 4260896.56 frames. ], batch size: 703, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:02:47,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458694.0, ans=0.1 2023-06-19 12:02:55,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=458754.0, ans=0.0 2023-06-19 12:03:23,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.889e+02 3.450e+02 3.978e+02 6.262e+02, threshold=6.899e+02, percent-clipped=0.0 2023-06-19 12:03:32,681 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-19 12:03:45,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=458874.0, ans=0.125 2023-06-19 12:03:54,365 INFO [train.py:996] (0/4) Epoch 3, batch 15500, loss[loss=0.2969, simple_loss=0.3584, pruned_loss=0.1178, over 21830.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3514, pruned_loss=0.1126, over 4262266.22 frames. ], batch size: 282, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:04:21,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=458994.0, ans=0.125 2023-06-19 12:04:22,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=458994.0, ans=0.0 2023-06-19 12:04:41,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=459054.0, ans=0.125 2023-06-19 12:05:26,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459174.0, ans=0.1 2023-06-19 12:05:35,318 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-19 12:05:37,233 INFO [train.py:996] (0/4) Epoch 3, batch 15550, loss[loss=0.2495, simple_loss=0.3232, pruned_loss=0.08793, over 21798.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3513, pruned_loss=0.1103, over 4264761.30 frames. ], batch size: 371, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:05:49,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=459234.0, ans=0.0 2023-06-19 12:06:00,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459294.0, ans=0.1 2023-06-19 12:06:23,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=459354.0, ans=0.125 2023-06-19 12:06:45,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=459414.0, ans=0.025 2023-06-19 12:06:54,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 2.997e+02 3.470e+02 4.241e+02 8.422e+02, threshold=6.941e+02, percent-clipped=1.0 2023-06-19 12:07:17,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459534.0, ans=0.1 2023-06-19 12:07:18,544 INFO [train.py:996] (0/4) Epoch 3, batch 15600, loss[loss=0.2492, simple_loss=0.3199, pruned_loss=0.08926, over 21724.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3453, pruned_loss=0.1081, over 4269386.57 frames. ], batch size: 351, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:07:33,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=459534.0, ans=0.1 2023-06-19 12:07:41,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=459594.0, ans=0.2 2023-06-19 12:07:50,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=459594.0, ans=0.2 2023-06-19 12:08:17,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=459654.0, ans=0.125 2023-06-19 12:08:25,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=459714.0, ans=0.2 2023-06-19 12:08:27,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=459714.0, ans=0.0 2023-06-19 12:08:32,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-19 12:09:06,287 INFO [train.py:996] (0/4) Epoch 3, batch 15650, loss[loss=0.2653, simple_loss=0.3204, pruned_loss=0.1051, over 21324.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3444, pruned_loss=0.1077, over 4269564.74 frames. ], batch size: 160, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:09:14,979 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:09:23,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=459894.0, ans=0.125 2023-06-19 12:09:29,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-19 12:09:34,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=459894.0, ans=0.125 2023-06-19 12:09:37,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=459954.0, ans=0.125 2023-06-19 12:09:48,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=459954.0, ans=0.125 2023-06-19 12:10:22,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.856e+02 3.445e+02 4.569e+02 7.529e+02, threshold=6.891e+02, percent-clipped=2.0 2023-06-19 12:10:32,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.71 vs. limit=15.0 2023-06-19 12:10:47,572 INFO [train.py:996] (0/4) Epoch 3, batch 15700, loss[loss=0.2426, simple_loss=0.2943, pruned_loss=0.09544, over 21621.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3398, pruned_loss=0.1067, over 4267721.39 frames. ], batch size: 298, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:10:51,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=460134.0, ans=0.0 2023-06-19 12:11:07,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=460194.0, ans=0.0 2023-06-19 12:11:52,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-19 12:12:28,123 INFO [train.py:996] (0/4) Epoch 3, batch 15750, loss[loss=0.2661, simple_loss=0.3269, pruned_loss=0.1027, over 21601.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3347, pruned_loss=0.1059, over 4269230.64 frames. ], batch size: 247, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:12:28,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=460434.0, ans=0.0 2023-06-19 12:13:27,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-19 12:13:46,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.862e+02 3.455e+02 4.105e+02 6.683e+02, threshold=6.910e+02, percent-clipped=1.0 2023-06-19 12:14:09,465 INFO [train.py:996] (0/4) Epoch 3, batch 15800, loss[loss=0.2632, simple_loss=0.3168, pruned_loss=0.1048, over 21663.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3299, pruned_loss=0.1049, over 4273975.85 frames. ], batch size: 332, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:14:12,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=460734.0, ans=0.125 2023-06-19 12:14:13,540 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:14:26,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460794.0, ans=0.1 2023-06-19 12:14:47,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=460854.0, ans=0.0 2023-06-19 12:15:35,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=460974.0, ans=0.0 2023-06-19 12:15:52,347 INFO [train.py:996] (0/4) Epoch 3, batch 15850, loss[loss=0.3042, simple_loss=0.3555, pruned_loss=0.1265, over 21935.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3334, pruned_loss=0.1083, over 4261467.43 frames. ], batch size: 317, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:15:58,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=461034.0, ans=0.0 2023-06-19 12:16:12,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-19 12:16:14,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-19 12:17:09,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=461214.0, ans=0.0 2023-06-19 12:17:10,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 2.908e+02 3.639e+02 4.203e+02 7.869e+02, threshold=7.277e+02, percent-clipped=1.0 2023-06-19 12:17:33,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=461334.0, ans=0.0 2023-06-19 12:17:35,036 INFO [train.py:996] (0/4) Epoch 3, batch 15900, loss[loss=0.2669, simple_loss=0.3058, pruned_loss=0.114, over 21514.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3301, pruned_loss=0.1081, over 4272130.16 frames. ], batch size: 212, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:17:47,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-19 12:17:52,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=461394.0, ans=0.0 2023-06-19 12:17:53,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=461394.0, ans=0.125 2023-06-19 12:18:03,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=461394.0, ans=0.05 2023-06-19 12:18:03,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=461394.0, ans=0.0 2023-06-19 12:18:25,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=461454.0, ans=10.0 2023-06-19 12:18:49,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=461514.0, ans=0.125 2023-06-19 12:19:17,510 INFO [train.py:996] (0/4) Epoch 3, batch 15950, loss[loss=0.236, simple_loss=0.3125, pruned_loss=0.07974, over 21212.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3306, pruned_loss=0.1044, over 4259018.77 frames. ], batch size: 159, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:20:36,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.793e+02 3.209e+02 4.219e+02 1.070e+03, threshold=6.418e+02, percent-clipped=5.0 2023-06-19 12:20:59,877 INFO [train.py:996] (0/4) Epoch 3, batch 16000, loss[loss=0.2483, simple_loss=0.3383, pruned_loss=0.07919, over 21788.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3319, pruned_loss=0.1022, over 4262694.43 frames. ], batch size: 351, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:21:00,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=461934.0, ans=0.125 2023-06-19 12:21:13,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.54 vs. limit=10.0 2023-06-19 12:21:19,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=461994.0, ans=0.2 2023-06-19 12:21:22,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-19 12:21:33,212 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:21:37,133 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.37 vs. limit=6.0 2023-06-19 12:21:37,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-19 12:22:29,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=462174.0, ans=0.125 2023-06-19 12:22:36,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=462174.0, ans=0.1 2023-06-19 12:22:42,091 INFO [train.py:996] (0/4) Epoch 3, batch 16050, loss[loss=0.2456, simple_loss=0.3301, pruned_loss=0.0805, over 21422.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3352, pruned_loss=0.09984, over 4266711.10 frames. ], batch size: 194, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:22:50,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462234.0, ans=0.1 2023-06-19 12:23:02,452 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-19 12:23:08,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.65 vs. limit=22.5 2023-06-19 12:23:40,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=462414.0, ans=0.125 2023-06-19 12:23:42,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=462414.0, ans=0.0 2023-06-19 12:23:52,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462414.0, ans=0.1 2023-06-19 12:24:00,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 3.084e+02 3.532e+02 4.519e+02 7.240e+02, threshold=7.063e+02, percent-clipped=4.0 2023-06-19 12:24:20,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=462474.0, ans=0.2 2023-06-19 12:24:23,203 INFO [train.py:996] (0/4) Epoch 3, batch 16100, loss[loss=0.2949, simple_loss=0.3468, pruned_loss=0.1215, over 21874.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3405, pruned_loss=0.1025, over 4278809.40 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:25:32,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=462774.0, ans=0.025 2023-06-19 12:25:35,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=462774.0, ans=0.0 2023-06-19 12:25:43,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=462774.0, ans=0.125 2023-06-19 12:25:57,552 INFO [train.py:996] (0/4) Epoch 3, batch 16150, loss[loss=0.2757, simple_loss=0.325, pruned_loss=0.1132, over 21315.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3425, pruned_loss=0.1052, over 4283447.19 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:25:58,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=462834.0, ans=0.0 2023-06-19 12:25:59,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=462834.0, ans=0.0 2023-06-19 12:26:15,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=462894.0, ans=0.125 2023-06-19 12:26:16,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=462894.0, ans=0.125 2023-06-19 12:26:21,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-19 12:27:07,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=463014.0, ans=0.125 2023-06-19 12:27:07,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=463014.0, ans=0.125 2023-06-19 12:27:16,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 2.948e+02 3.427e+02 4.312e+02 9.423e+02, threshold=6.854e+02, percent-clipped=2.0 2023-06-19 12:27:18,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=463074.0, ans=0.2 2023-06-19 12:27:39,985 INFO [train.py:996] (0/4) Epoch 3, batch 16200, loss[loss=0.2405, simple_loss=0.3205, pruned_loss=0.08024, over 21629.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3461, pruned_loss=0.106, over 4283211.33 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:27:43,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=463134.0, ans=0.125 2023-06-19 12:28:15,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=463254.0, ans=0.0 2023-06-19 12:28:49,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=463314.0, ans=0.125 2023-06-19 12:28:56,995 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=8.0 2023-06-19 12:29:21,923 INFO [train.py:996] (0/4) Epoch 3, batch 16250, loss[loss=0.2132, simple_loss=0.2853, pruned_loss=0.07052, over 21663.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3454, pruned_loss=0.1064, over 4277472.78 frames. ], batch size: 298, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:30:12,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=463554.0, ans=0.2 2023-06-19 12:30:45,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.795e+02 3.241e+02 4.405e+02 7.562e+02, threshold=6.482e+02, percent-clipped=2.0 2023-06-19 12:30:57,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=463674.0, ans=0.1 2023-06-19 12:30:59,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=463674.0, ans=0.0 2023-06-19 12:30:59,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=463674.0, ans=0.125 2023-06-19 12:30:59,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=12.0 2023-06-19 12:31:03,317 INFO [train.py:996] (0/4) Epoch 3, batch 16300, loss[loss=0.2164, simple_loss=0.2821, pruned_loss=0.07538, over 21280.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3374, pruned_loss=0.1017, over 4264875.44 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:31:58,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=463914.0, ans=0.5 2023-06-19 12:32:37,032 INFO [train.py:996] (0/4) Epoch 3, batch 16350, loss[loss=0.3206, simple_loss=0.3894, pruned_loss=0.1259, over 21775.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3374, pruned_loss=0.1029, over 4263861.13 frames. ], batch size: 118, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:32:39,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.59 vs. limit=6.0 2023-06-19 12:34:02,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.064e+02 3.648e+02 5.135e+02 1.076e+03, threshold=7.296e+02, percent-clipped=9.0 2023-06-19 12:34:14,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=464274.0, ans=0.125 2023-06-19 12:34:18,670 INFO [train.py:996] (0/4) Epoch 3, batch 16400, loss[loss=0.268, simple_loss=0.3548, pruned_loss=0.09057, over 21331.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3418, pruned_loss=0.1049, over 4261739.30 frames. ], batch size: 548, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:35:07,362 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:36:00,619 INFO [train.py:996] (0/4) Epoch 3, batch 16450, loss[loss=0.3298, simple_loss=0.3752, pruned_loss=0.1422, over 20707.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3413, pruned_loss=0.1059, over 4268339.24 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:36:01,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464634.0, ans=0.1 2023-06-19 12:36:03,397 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.01 vs. limit=12.0 2023-06-19 12:36:17,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=464694.0, ans=0.125 2023-06-19 12:37:08,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-19 12:37:14,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=464814.0, ans=0.0 2023-06-19 12:37:21,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.011e+02 3.468e+02 3.986e+02 7.351e+02, threshold=6.935e+02, percent-clipped=1.0 2023-06-19 12:37:38,575 INFO [train.py:996] (0/4) Epoch 3, batch 16500, loss[loss=0.24, simple_loss=0.3139, pruned_loss=0.08307, over 21788.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3419, pruned_loss=0.1066, over 4269044.79 frames. ], batch size: 332, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:37:57,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464994.0, ans=0.1 2023-06-19 12:38:07,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=464994.0, ans=0.2 2023-06-19 12:38:25,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=465054.0, ans=0.125 2023-06-19 12:38:47,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=465114.0, ans=0.0 2023-06-19 12:38:51,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2023-06-19 12:39:15,969 INFO [train.py:996] (0/4) Epoch 3, batch 16550, loss[loss=0.2662, simple_loss=0.3575, pruned_loss=0.08748, over 20890.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3367, pruned_loss=0.1026, over 4270699.75 frames. ], batch size: 608, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:39:46,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=465294.0, ans=0.09899494936611666 2023-06-19 12:40:07,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=465354.0, ans=0.125 2023-06-19 12:40:42,242 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.124e+02 3.787e+02 4.304e+02 9.133e+02, threshold=7.574e+02, percent-clipped=3.0 2023-06-19 12:41:09,209 INFO [train.py:996] (0/4) Epoch 3, batch 16600, loss[loss=0.2351, simple_loss=0.3397, pruned_loss=0.0652, over 20782.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3462, pruned_loss=0.1063, over 4273656.87 frames. ], batch size: 608, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:41:56,324 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:42:15,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=465714.0, ans=0.125 2023-06-19 12:42:37,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=465774.0, ans=0.125 2023-06-19 12:42:59,031 INFO [train.py:996] (0/4) Epoch 3, batch 16650, loss[loss=0.3158, simple_loss=0.3845, pruned_loss=0.1235, over 21546.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.358, pruned_loss=0.1095, over 4271431.98 frames. ], batch size: 230, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:43:08,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=465834.0, ans=0.1 2023-06-19 12:43:30,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=465894.0, ans=0.2 2023-06-19 12:43:35,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=465894.0, ans=0.0 2023-06-19 12:43:37,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=465954.0, ans=0.125 2023-06-19 12:43:58,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=466014.0, ans=0.125 2023-06-19 12:44:07,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=466014.0, ans=0.5 2023-06-19 12:44:07,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=466014.0, ans=0.125 2023-06-19 12:44:27,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.247e+02 3.781e+02 4.657e+02 6.369e+02, threshold=7.563e+02, percent-clipped=0.0 2023-06-19 12:44:27,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=466074.0, ans=0.1 2023-06-19 12:44:49,107 INFO [train.py:996] (0/4) Epoch 3, batch 16700, loss[loss=0.2077, simple_loss=0.2634, pruned_loss=0.07605, over 21879.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3572, pruned_loss=0.1096, over 4267284.93 frames. ], batch size: 98, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:45:13,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=466194.0, ans=0.0 2023-06-19 12:45:55,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=466314.0, ans=0.125 2023-06-19 12:46:35,201 INFO [train.py:996] (0/4) Epoch 3, batch 16750, loss[loss=0.2251, simple_loss=0.2714, pruned_loss=0.08943, over 21683.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3597, pruned_loss=0.1126, over 4271046.16 frames. ], batch size: 112, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:46:54,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=466434.0, ans=0.125 2023-06-19 12:47:12,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=466494.0, ans=0.125 2023-06-19 12:48:01,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.898e+02 3.370e+02 4.211e+02 9.702e+02, threshold=6.740e+02, percent-clipped=1.0 2023-06-19 12:48:22,835 INFO [train.py:996] (0/4) Epoch 3, batch 16800, loss[loss=0.2681, simple_loss=0.321, pruned_loss=0.1076, over 21303.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3653, pruned_loss=0.1131, over 4263874.78 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:49:01,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=466794.0, ans=0.125 2023-06-19 12:49:43,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=466974.0, ans=0.05 2023-06-19 12:49:47,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-19 12:50:04,944 INFO [train.py:996] (0/4) Epoch 3, batch 16850, loss[loss=0.2735, simple_loss=0.329, pruned_loss=0.109, over 21894.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3616, pruned_loss=0.1131, over 4265826.74 frames. ], batch size: 371, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:50:23,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=467034.0, ans=0.2 2023-06-19 12:50:24,136 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-19 12:50:25,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=467094.0, ans=0.2 2023-06-19 12:51:02,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-06-19 12:51:06,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=467154.0, ans=0.125 2023-06-19 12:51:25,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.898e+02 3.370e+02 4.330e+02 9.168e+02, threshold=6.739e+02, percent-clipped=5.0 2023-06-19 12:51:41,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=467274.0, ans=10.0 2023-06-19 12:51:45,966 INFO [train.py:996] (0/4) Epoch 3, batch 16900, loss[loss=0.2714, simple_loss=0.3268, pruned_loss=0.108, over 21628.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3563, pruned_loss=0.1119, over 4272297.87 frames. ], batch size: 414, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:52:58,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.69 vs. limit=10.0 2023-06-19 12:53:26,677 INFO [train.py:996] (0/4) Epoch 3, batch 16950, loss[loss=0.271, simple_loss=0.3224, pruned_loss=0.1098, over 21161.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3473, pruned_loss=0.1097, over 4269025.37 frames. ], batch size: 608, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:53:46,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=467694.0, ans=0.125 2023-06-19 12:53:55,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.05 vs. limit=10.0 2023-06-19 12:54:44,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=467814.0, ans=0.0 2023-06-19 12:54:46,699 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.834e+02 3.306e+02 3.951e+02 5.809e+02, threshold=6.612e+02, percent-clipped=0.0 2023-06-19 12:54:48,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=467874.0, ans=0.125 2023-06-19 12:55:08,168 INFO [train.py:996] (0/4) Epoch 3, batch 17000, loss[loss=0.2654, simple_loss=0.3243, pruned_loss=0.1032, over 21712.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3446, pruned_loss=0.1104, over 4278772.10 frames. ], batch size: 230, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:55:08,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=467934.0, ans=0.1 2023-06-19 12:55:08,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=467934.0, ans=0.1 2023-06-19 12:55:12,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=467934.0, ans=0.125 2023-06-19 12:56:03,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=468054.0, ans=0.0 2023-06-19 12:56:11,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468114.0, ans=0.1 2023-06-19 12:56:14,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=468114.0, ans=0.1 2023-06-19 12:56:23,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=468114.0, ans=0.1 2023-06-19 12:56:40,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=468174.0, ans=0.125 2023-06-19 12:56:40,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=468174.0, ans=0.125 2023-06-19 12:56:42,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-19 12:56:49,433 INFO [train.py:996] (0/4) Epoch 3, batch 17050, loss[loss=0.2875, simple_loss=0.3566, pruned_loss=0.1091, over 21169.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3533, pruned_loss=0.1141, over 4288103.52 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:57:52,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=468414.0, ans=0.125 2023-06-19 12:58:14,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.418e+02 3.024e+02 3.438e+02 4.032e+02 7.555e+02, threshold=6.877e+02, percent-clipped=1.0 2023-06-19 12:58:30,302 INFO [train.py:996] (0/4) Epoch 3, batch 17100, loss[loss=0.2725, simple_loss=0.3314, pruned_loss=0.1068, over 21338.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3521, pruned_loss=0.115, over 4280543.90 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:59:33,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=468714.0, ans=0.0 2023-06-19 12:59:38,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-19 13:00:11,713 INFO [train.py:996] (0/4) Epoch 3, batch 17150, loss[loss=0.2704, simple_loss=0.32, pruned_loss=0.1105, over 21576.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.347, pruned_loss=0.1142, over 4291917.51 frames. ], batch size: 548, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:00:58,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=468894.0, ans=0.125 2023-06-19 13:01:38,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 2.874e+02 3.285e+02 3.849e+02 6.375e+02, threshold=6.570e+02, percent-clipped=0.0 2023-06-19 13:02:09,774 INFO [train.py:996] (0/4) Epoch 3, batch 17200, loss[loss=0.3198, simple_loss=0.3802, pruned_loss=0.1297, over 19981.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3477, pruned_loss=0.1141, over 4287782.64 frames. ], batch size: 702, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:02:43,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=469254.0, ans=0.2 2023-06-19 13:02:46,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469254.0, ans=0.125 2023-06-19 13:03:20,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=469314.0, ans=0.125 2023-06-19 13:03:53,540 INFO [train.py:996] (0/4) Epoch 3, batch 17250, loss[loss=0.3222, simple_loss=0.3861, pruned_loss=0.1292, over 21661.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3511, pruned_loss=0.1159, over 4280566.95 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:03:54,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=469434.0, ans=0.04949747468305833 2023-06-19 13:04:41,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=469554.0, ans=0.1 2023-06-19 13:05:06,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=469614.0, ans=0.125 2023-06-19 13:05:20,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.292e+02 3.993e+02 5.117e+02 9.442e+02, threshold=7.987e+02, percent-clipped=7.0 2023-06-19 13:05:22,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=469674.0, ans=0.125 2023-06-19 13:05:37,161 INFO [train.py:996] (0/4) Epoch 3, batch 17300, loss[loss=0.3111, simple_loss=0.3674, pruned_loss=0.1274, over 21816.00 frames. ], tot_loss[loss=0.3005, simple_loss=0.3604, pruned_loss=0.1203, over 4281062.71 frames. ], batch size: 124, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:05:37,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=469734.0, ans=0.0 2023-06-19 13:05:41,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2023-06-19 13:05:43,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-19 13:05:48,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=469734.0, ans=0.0 2023-06-19 13:06:28,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469854.0, ans=0.125 2023-06-19 13:07:15,556 INFO [train.py:996] (0/4) Epoch 3, batch 17350, loss[loss=0.2381, simple_loss=0.3141, pruned_loss=0.08109, over 21724.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.3634, pruned_loss=0.1212, over 4286036.78 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:07:22,726 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:07:25,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2023-06-19 13:08:10,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=470154.0, ans=0.125 2023-06-19 13:08:26,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-19 13:08:27,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=470214.0, ans=0.1 2023-06-19 13:08:29,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=470214.0, ans=0.07 2023-06-19 13:08:42,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.889e+02 3.414e+02 4.320e+02 8.908e+02, threshold=6.829e+02, percent-clipped=3.0 2023-06-19 13:08:44,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-19 13:08:46,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=470274.0, ans=0.1 2023-06-19 13:08:58,919 INFO [train.py:996] (0/4) Epoch 3, batch 17400, loss[loss=0.3351, simple_loss=0.4072, pruned_loss=0.1314, over 21456.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3594, pruned_loss=0.1166, over 4280500.51 frames. ], batch size: 471, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:09:17,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=470334.0, ans=0.0 2023-06-19 13:09:50,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=470394.0, ans=0.125 2023-06-19 13:09:58,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=470454.0, ans=0.125 2023-06-19 13:10:01,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=470454.0, ans=0.125 2023-06-19 13:10:28,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=470574.0, ans=0.125 2023-06-19 13:10:47,916 INFO [train.py:996] (0/4) Epoch 3, batch 17450, loss[loss=0.2405, simple_loss=0.306, pruned_loss=0.08747, over 21248.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3523, pruned_loss=0.1115, over 4276566.21 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:10:53,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=470634.0, ans=0.2 2023-06-19 13:11:26,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=15.0 2023-06-19 13:11:32,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=470694.0, ans=0.125 2023-06-19 13:11:33,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=470754.0, ans=0.125 2023-06-19 13:12:06,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.874e+02 3.534e+02 4.725e+02 8.315e+02, threshold=7.067e+02, percent-clipped=5.0 2023-06-19 13:12:27,753 INFO [train.py:996] (0/4) Epoch 3, batch 17500, loss[loss=0.2751, simple_loss=0.3331, pruned_loss=0.1086, over 21853.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3458, pruned_loss=0.1075, over 4275401.74 frames. ], batch size: 414, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:13:01,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-19 13:13:26,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=471054.0, ans=0.125 2023-06-19 13:13:45,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=471174.0, ans=0.0 2023-06-19 13:14:07,334 INFO [train.py:996] (0/4) Epoch 3, batch 17550, loss[loss=0.2552, simple_loss=0.3332, pruned_loss=0.08857, over 21759.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3443, pruned_loss=0.1053, over 4258023.03 frames. ], batch size: 112, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:14:09,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=471234.0, ans=0.125 2023-06-19 13:15:25,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471474.0, ans=0.1 2023-06-19 13:15:26,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.757e+02 3.626e+02 4.370e+02 8.420e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-19 13:15:48,156 INFO [train.py:996] (0/4) Epoch 3, batch 17600, loss[loss=0.3198, simple_loss=0.3827, pruned_loss=0.1284, over 21607.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3478, pruned_loss=0.1066, over 4267805.90 frames. ], batch size: 389, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:16:09,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471534.0, ans=0.1 2023-06-19 13:16:14,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=471594.0, ans=0.2 2023-06-19 13:16:35,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-19 13:17:35,488 INFO [train.py:996] (0/4) Epoch 3, batch 17650, loss[loss=0.2893, simple_loss=0.3556, pruned_loss=0.1115, over 21514.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3469, pruned_loss=0.1071, over 4259349.02 frames. ], batch size: 509, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:18:55,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=472074.0, ans=0.04949747468305833 2023-06-19 13:18:56,007 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.888e+02 3.326e+02 4.505e+02 7.697e+02, threshold=6.651e+02, percent-clipped=2.0 2023-06-19 13:19:06,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=472074.0, ans=0.125 2023-06-19 13:19:17,609 INFO [train.py:996] (0/4) Epoch 3, batch 17700, loss[loss=0.3384, simple_loss=0.3983, pruned_loss=0.1392, over 21571.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3401, pruned_loss=0.1035, over 4256678.14 frames. ], batch size: 414, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:19:29,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=472134.0, ans=0.5 2023-06-19 13:19:40,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472134.0, ans=0.1 2023-06-19 13:19:43,105 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:20:10,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=472254.0, ans=0.2 2023-06-19 13:20:13,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=472254.0, ans=0.2 2023-06-19 13:21:10,375 INFO [train.py:996] (0/4) Epoch 3, batch 17750, loss[loss=0.3154, simple_loss=0.3828, pruned_loss=0.124, over 21983.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3483, pruned_loss=0.108, over 4264115.92 frames. ], batch size: 317, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:22:04,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472614.0, ans=0.1 2023-06-19 13:22:32,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.975e+02 3.463e+02 4.383e+02 8.374e+02, threshold=6.927e+02, percent-clipped=5.0 2023-06-19 13:22:45,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=472674.0, ans=0.2 2023-06-19 13:22:54,301 INFO [train.py:996] (0/4) Epoch 3, batch 17800, loss[loss=0.2676, simple_loss=0.3406, pruned_loss=0.09733, over 21723.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3465, pruned_loss=0.1064, over 4251934.38 frames. ], batch size: 332, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:23:27,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=472794.0, ans=0.125 2023-06-19 13:23:55,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=472914.0, ans=0.125 2023-06-19 13:24:26,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-19 13:24:36,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=473034.0, ans=0.0 2023-06-19 13:24:36,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=473034.0, ans=0.0 2023-06-19 13:24:37,173 INFO [train.py:996] (0/4) Epoch 3, batch 17850, loss[loss=0.2987, simple_loss=0.3699, pruned_loss=0.1137, over 20710.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3495, pruned_loss=0.1085, over 4258866.68 frames. ], batch size: 607, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:24:59,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=473094.0, ans=0.125 2023-06-19 13:25:00,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=473094.0, ans=0.0 2023-06-19 13:25:16,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=473094.0, ans=0.0 2023-06-19 13:25:25,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=473154.0, ans=0.125 2023-06-19 13:25:51,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=473214.0, ans=0.125 2023-06-19 13:25:54,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=473214.0, ans=0.0 2023-06-19 13:25:59,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=473214.0, ans=0.125 2023-06-19 13:25:59,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=473214.0, ans=0.125 2023-06-19 13:26:02,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.042e+02 3.981e+02 5.013e+02 8.666e+02, threshold=7.962e+02, percent-clipped=5.0 2023-06-19 13:26:18,594 INFO [train.py:996] (0/4) Epoch 3, batch 17900, loss[loss=0.2707, simple_loss=0.3369, pruned_loss=0.1022, over 21362.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3557, pruned_loss=0.1116, over 4261412.77 frames. ], batch size: 131, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:26:54,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=473394.0, ans=0.0 2023-06-19 13:28:02,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=473574.0, ans=0.0 2023-06-19 13:28:06,427 INFO [train.py:996] (0/4) Epoch 3, batch 17950, loss[loss=0.2603, simple_loss=0.3482, pruned_loss=0.08621, over 21634.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3535, pruned_loss=0.1065, over 4258644.14 frames. ], batch size: 389, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:28:26,214 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.68 vs. limit=22.5 2023-06-19 13:28:26,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=473694.0, ans=0.125 2023-06-19 13:29:03,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-19 13:29:14,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-19 13:29:20,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=473814.0, ans=0.125 2023-06-19 13:29:22,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=473814.0, ans=0.125 2023-06-19 13:29:26,498 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.695e+02 3.486e+02 4.539e+02 1.017e+03, threshold=6.972e+02, percent-clipped=4.0 2023-06-19 13:29:40,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=473874.0, ans=0.125 2023-06-19 13:29:47,386 INFO [train.py:996] (0/4) Epoch 3, batch 18000, loss[loss=0.2468, simple_loss=0.3291, pruned_loss=0.08222, over 20775.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3467, pruned_loss=0.1047, over 4260086.67 frames. ], batch size: 607, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:29:47,387 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 13:30:01,106 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.5110, 4.3295, 2.2291, 2.6715], device='cuda:0') 2023-06-19 13:30:08,401 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2748, simple_loss=0.3795, pruned_loss=0.08502, over 1796401.00 frames. 2023-06-19 13:30:08,401 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 13:30:33,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-06-19 13:31:47,974 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-19 13:31:49,943 INFO [train.py:996] (0/4) Epoch 3, batch 18050, loss[loss=0.2404, simple_loss=0.2979, pruned_loss=0.09145, over 21540.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3401, pruned_loss=0.1025, over 4257710.24 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:32:24,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=474294.0, ans=0.125 2023-06-19 13:33:10,554 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.274e+02 3.707e+02 4.391e+02 9.006e+02, threshold=7.414e+02, percent-clipped=2.0 2023-06-19 13:33:32,305 INFO [train.py:996] (0/4) Epoch 3, batch 18100, loss[loss=0.3083, simple_loss=0.3864, pruned_loss=0.1151, over 21825.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3446, pruned_loss=0.1048, over 4263500.42 frames. ], batch size: 372, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:33:54,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=474534.0, ans=0.125 2023-06-19 13:34:01,008 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:34:45,541 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:35:00,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=474774.0, ans=0.125 2023-06-19 13:35:18,759 INFO [train.py:996] (0/4) Epoch 3, batch 18150, loss[loss=0.2554, simple_loss=0.3498, pruned_loss=0.0805, over 20706.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.347, pruned_loss=0.1051, over 4257074.71 frames. ], batch size: 607, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:35:27,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=474834.0, ans=0.125 2023-06-19 13:35:42,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=474894.0, ans=0.0 2023-06-19 13:36:11,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=475014.0, ans=0.125 2023-06-19 13:36:31,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.411e+02 3.099e+02 3.760e+02 4.824e+02 9.400e+02, threshold=7.520e+02, percent-clipped=8.0 2023-06-19 13:36:52,796 INFO [train.py:996] (0/4) Epoch 3, batch 18200, loss[loss=0.2699, simple_loss=0.3174, pruned_loss=0.1112, over 21767.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3412, pruned_loss=0.1048, over 4253496.36 frames. ], batch size: 300, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:37:06,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-19 13:37:56,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=475314.0, ans=0.125 2023-06-19 13:37:56,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475314.0, ans=0.1 2023-06-19 13:38:17,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=475374.0, ans=0.125 2023-06-19 13:38:32,016 INFO [train.py:996] (0/4) Epoch 3, batch 18250, loss[loss=0.3492, simple_loss=0.409, pruned_loss=0.1447, over 19982.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3324, pruned_loss=0.1013, over 4252389.75 frames. ], batch size: 702, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:39:38,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=475614.0, ans=0.125 2023-06-19 13:39:45,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.479e+02 3.020e+02 3.989e+02 8.042e+02, threshold=6.040e+02, percent-clipped=2.0 2023-06-19 13:39:48,298 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:40:06,835 INFO [train.py:996] (0/4) Epoch 3, batch 18300, loss[loss=0.2515, simple_loss=0.3112, pruned_loss=0.0959, over 21848.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3338, pruned_loss=0.1024, over 4262407.87 frames. ], batch size: 118, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:40:07,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=475734.0, ans=0.125 2023-06-19 13:41:43,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-19 13:41:46,935 INFO [train.py:996] (0/4) Epoch 3, batch 18350, loss[loss=0.3078, simple_loss=0.368, pruned_loss=0.1238, over 21597.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3383, pruned_loss=0.1023, over 4247939.42 frames. ], batch size: 414, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:43:08,848 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.933e+02 3.430e+02 4.228e+02 7.523e+02, threshold=6.860e+02, percent-clipped=6.0 2023-06-19 13:43:14,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=476274.0, ans=0.125 2023-06-19 13:43:28,057 INFO [train.py:996] (0/4) Epoch 3, batch 18400, loss[loss=0.2274, simple_loss=0.2986, pruned_loss=0.07808, over 21285.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3333, pruned_loss=0.1012, over 4252694.72 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:43:37,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=476334.0, ans=0.0 2023-06-19 13:44:00,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=22.5 2023-06-19 13:44:06,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=476394.0, ans=0.0 2023-06-19 13:45:02,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=476574.0, ans=0.5 2023-06-19 13:45:08,781 INFO [train.py:996] (0/4) Epoch 3, batch 18450, loss[loss=0.2209, simple_loss=0.3033, pruned_loss=0.06927, over 21896.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3276, pruned_loss=0.096, over 4245659.71 frames. ], batch size: 373, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:45:12,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=476634.0, ans=0.125 2023-06-19 13:46:17,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=476814.0, ans=0.0 2023-06-19 13:46:31,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.812e+02 3.346e+02 4.382e+02 1.092e+03, threshold=6.692e+02, percent-clipped=3.0 2023-06-19 13:46:41,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-19 13:46:44,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=476874.0, ans=0.0 2023-06-19 13:46:49,838 INFO [train.py:996] (0/4) Epoch 3, batch 18500, loss[loss=0.2276, simple_loss=0.2811, pruned_loss=0.08699, over 21398.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3222, pruned_loss=0.09448, over 4242155.19 frames. ], batch size: 144, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:46:54,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-19 13:46:55,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=476934.0, ans=10.0 2023-06-19 13:47:30,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=476994.0, ans=0.125 2023-06-19 13:47:37,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=22.5 2023-06-19 13:47:46,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=477054.0, ans=0.125 2023-06-19 13:47:59,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=477114.0, ans=0.125 2023-06-19 13:48:08,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=477174.0, ans=0.125 2023-06-19 13:48:23,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=477174.0, ans=0.07 2023-06-19 13:48:30,920 INFO [train.py:996] (0/4) Epoch 3, batch 18550, loss[loss=0.2473, simple_loss=0.2956, pruned_loss=0.0995, over 21313.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3212, pruned_loss=0.09385, over 4242396.00 frames. ], batch size: 160, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:49:06,430 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:49:18,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=477354.0, ans=0.125 2023-06-19 13:49:27,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=477354.0, ans=0.0 2023-06-19 13:49:59,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 3.099e+02 3.527e+02 4.215e+02 7.049e+02, threshold=7.053e+02, percent-clipped=1.0 2023-06-19 13:50:13,156 INFO [train.py:996] (0/4) Epoch 3, batch 18600, loss[loss=0.2327, simple_loss=0.2915, pruned_loss=0.0869, over 21812.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3217, pruned_loss=0.09558, over 4240938.08 frames. ], batch size: 118, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:50:22,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-19 13:51:16,327 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:51:59,738 INFO [train.py:996] (0/4) Epoch 3, batch 18650, loss[loss=0.2439, simple_loss=0.2916, pruned_loss=0.09811, over 20337.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3211, pruned_loss=0.09565, over 4251927.96 frames. ], batch size: 703, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:53:20,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=478074.0, ans=0.0 2023-06-19 13:53:21,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.016e+02 3.610e+02 4.241e+02 7.263e+02, threshold=7.220e+02, percent-clipped=2.0 2023-06-19 13:53:23,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=478074.0, ans=0.1 2023-06-19 13:53:33,776 INFO [train.py:996] (0/4) Epoch 3, batch 18700, loss[loss=0.2554, simple_loss=0.3143, pruned_loss=0.09829, over 21803.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3191, pruned_loss=0.09796, over 4250535.72 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:53:49,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.53 vs. limit=15.0 2023-06-19 13:55:15,239 INFO [train.py:996] (0/4) Epoch 3, batch 18750, loss[loss=0.3346, simple_loss=0.3976, pruned_loss=0.1358, over 21643.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3223, pruned_loss=0.1011, over 4265233.70 frames. ], batch size: 414, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:55:26,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=478434.0, ans=0.125 2023-06-19 13:55:30,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=478434.0, ans=0.125 2023-06-19 13:56:25,610 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-19 13:56:43,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.016e+02 3.473e+02 4.351e+02 6.634e+02, threshold=6.946e+02, percent-clipped=0.0 2023-06-19 13:56:47,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=478674.0, ans=0.125 2023-06-19 13:56:56,596 INFO [train.py:996] (0/4) Epoch 3, batch 18800, loss[loss=0.2048, simple_loss=0.2874, pruned_loss=0.06107, over 21581.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3293, pruned_loss=0.1032, over 4263367.21 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:57:36,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=478794.0, ans=0.2 2023-06-19 13:57:50,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-19 13:58:19,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=478914.0, ans=0.125 2023-06-19 13:58:36,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=478974.0, ans=0.0 2023-06-19 13:58:44,204 INFO [train.py:996] (0/4) Epoch 3, batch 18850, loss[loss=0.2368, simple_loss=0.3023, pruned_loss=0.08561, over 21743.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.325, pruned_loss=0.0969, over 4264174.99 frames. ], batch size: 316, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:59:02,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=479034.0, ans=0.125 2023-06-19 13:59:40,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=479154.0, ans=0.125 2023-06-19 13:59:43,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=479214.0, ans=0.125 2023-06-19 13:59:54,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=479214.0, ans=0.125 2023-06-19 14:00:08,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.738e+02 3.222e+02 4.135e+02 8.390e+02, threshold=6.445e+02, percent-clipped=2.0 2023-06-19 14:00:12,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=479274.0, ans=0.0 2023-06-19 14:00:24,834 INFO [train.py:996] (0/4) Epoch 3, batch 18900, loss[loss=0.265, simple_loss=0.3083, pruned_loss=0.1108, over 21566.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3229, pruned_loss=0.09787, over 4266733.03 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:00:28,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=479334.0, ans=0.1 2023-06-19 14:02:01,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=479574.0, ans=0.125 2023-06-19 14:02:07,452 INFO [train.py:996] (0/4) Epoch 3, batch 18950, loss[loss=0.2603, simple_loss=0.314, pruned_loss=0.1032, over 21689.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3239, pruned_loss=0.1003, over 4272884.58 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:02:22,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-06-19 14:02:42,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=479694.0, ans=0.0 2023-06-19 14:02:57,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=479754.0, ans=0.125 2023-06-19 14:03:07,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=479754.0, ans=0.2 2023-06-19 14:03:12,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=479814.0, ans=0.125 2023-06-19 14:03:25,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=479814.0, ans=0.125 2023-06-19 14:03:40,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.897e+02 3.488e+02 4.402e+02 6.601e+02, threshold=6.976e+02, percent-clipped=2.0 2023-06-19 14:03:56,946 INFO [train.py:996] (0/4) Epoch 3, batch 19000, loss[loss=0.2999, simple_loss=0.36, pruned_loss=0.1199, over 21775.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3332, pruned_loss=0.1013, over 4272311.56 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:04:18,136 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-80000.pt 2023-06-19 14:05:05,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=480114.0, ans=0.125 2023-06-19 14:05:26,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=480174.0, ans=0.125 2023-06-19 14:05:39,670 INFO [train.py:996] (0/4) Epoch 3, batch 19050, loss[loss=0.2775, simple_loss=0.3335, pruned_loss=0.1107, over 21959.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3396, pruned_loss=0.1065, over 4278451.21 frames. ], batch size: 113, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:05:46,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=480234.0, ans=0.125 2023-06-19 14:06:41,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480414.0, ans=0.1 2023-06-19 14:06:46,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-19 14:07:04,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.242e+02 3.668e+02 4.263e+02 6.635e+02, threshold=7.336e+02, percent-clipped=0.0 2023-06-19 14:07:19,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=480474.0, ans=0.07 2023-06-19 14:07:21,798 INFO [train.py:996] (0/4) Epoch 3, batch 19100, loss[loss=0.2502, simple_loss=0.2973, pruned_loss=0.1015, over 21345.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3385, pruned_loss=0.1081, over 4278927.88 frames. ], batch size: 211, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:07:27,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=480534.0, ans=0.0 2023-06-19 14:07:27,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=480534.0, ans=0.125 2023-06-19 14:07:49,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=480594.0, ans=0.125 2023-06-19 14:08:20,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=480714.0, ans=0.0 2023-06-19 14:08:25,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=480714.0, ans=0.0 2023-06-19 14:08:56,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.41 vs. limit=6.0 2023-06-19 14:08:59,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=480774.0, ans=0.125 2023-06-19 14:08:59,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480774.0, ans=0.1 2023-06-19 14:09:11,242 INFO [train.py:996] (0/4) Epoch 3, batch 19150, loss[loss=0.2473, simple_loss=0.3362, pruned_loss=0.07923, over 21426.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3403, pruned_loss=0.1092, over 4280278.53 frames. ], batch size: 211, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:09:13,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=480834.0, ans=0.0 2023-06-19 14:09:19,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=480834.0, ans=0.0 2023-06-19 14:09:19,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=480834.0, ans=0.125 2023-06-19 14:09:29,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=480894.0, ans=0.0 2023-06-19 14:09:44,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=480894.0, ans=0.015 2023-06-19 14:10:08,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=480954.0, ans=0.125 2023-06-19 14:10:43,718 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 3.009e+02 3.597e+02 4.510e+02 7.028e+02, threshold=7.194e+02, percent-clipped=0.0 2023-06-19 14:10:49,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=481074.0, ans=0.0 2023-06-19 14:10:55,139 INFO [train.py:996] (0/4) Epoch 3, batch 19200, loss[loss=0.3629, simple_loss=0.4448, pruned_loss=0.1405, over 21626.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3492, pruned_loss=0.1089, over 4276783.53 frames. ], batch size: 441, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:11:35,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-19 14:12:35,859 INFO [train.py:996] (0/4) Epoch 3, batch 19250, loss[loss=0.2361, simple_loss=0.3151, pruned_loss=0.07852, over 21641.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3456, pruned_loss=0.1022, over 4265130.91 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:12:44,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=481434.0, ans=0.125 2023-06-19 14:13:30,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=481614.0, ans=0.2 2023-06-19 14:13:54,157 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 2.436e+02 3.016e+02 3.592e+02 9.679e+02, threshold=6.032e+02, percent-clipped=2.0 2023-06-19 14:14:10,932 INFO [train.py:996] (0/4) Epoch 3, batch 19300, loss[loss=0.2412, simple_loss=0.2994, pruned_loss=0.0915, over 21408.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3426, pruned_loss=0.1026, over 4273795.75 frames. ], batch size: 131, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:14:13,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=481734.0, ans=0.2 2023-06-19 14:14:43,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=481794.0, ans=0.125 2023-06-19 14:15:00,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-19 14:15:25,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.58 vs. limit=10.0 2023-06-19 14:15:46,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=481974.0, ans=0.0 2023-06-19 14:15:54,303 INFO [train.py:996] (0/4) Epoch 3, batch 19350, loss[loss=0.1999, simple_loss=0.2765, pruned_loss=0.06167, over 21579.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3362, pruned_loss=0.09746, over 4268882.89 frames. ], batch size: 195, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:15:56,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=482034.0, ans=0.125 2023-06-19 14:16:57,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=482214.0, ans=0.0 2023-06-19 14:17:13,372 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.752e+02 3.460e+02 4.444e+02 7.574e+02, threshold=6.920e+02, percent-clipped=6.0 2023-06-19 14:17:24,667 INFO [train.py:996] (0/4) Epoch 3, batch 19400, loss[loss=0.2791, simple_loss=0.3472, pruned_loss=0.1055, over 21846.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3316, pruned_loss=0.09512, over 4272647.90 frames. ], batch size: 118, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:17:25,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=482334.0, ans=0.1 2023-06-19 14:17:49,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=482394.0, ans=0.125 2023-06-19 14:18:39,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=482514.0, ans=0.2 2023-06-19 14:19:05,677 INFO [train.py:996] (0/4) Epoch 3, batch 19450, loss[loss=0.2388, simple_loss=0.2907, pruned_loss=0.09345, over 21498.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3312, pruned_loss=0.09853, over 4280756.83 frames. ], batch size: 212, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:19:14,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=482634.0, ans=0.0 2023-06-19 14:19:50,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=482754.0, ans=0.125 2023-06-19 14:20:37,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.381e+02 3.021e+02 3.528e+02 4.324e+02 6.786e+02, threshold=7.055e+02, percent-clipped=0.0 2023-06-19 14:20:52,487 INFO [train.py:996] (0/4) Epoch 3, batch 19500, loss[loss=0.2529, simple_loss=0.3243, pruned_loss=0.09077, over 21808.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3271, pruned_loss=0.1002, over 4284643.14 frames. ], batch size: 372, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:20:56,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=482934.0, ans=0.125 2023-06-19 14:21:07,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=482994.0, ans=10.0 2023-06-19 14:21:54,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=483114.0, ans=0.125 2023-06-19 14:22:18,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-19 14:22:34,952 INFO [train.py:996] (0/4) Epoch 3, batch 19550, loss[loss=0.2336, simple_loss=0.3169, pruned_loss=0.0752, over 21838.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3239, pruned_loss=0.09818, over 4282254.99 frames. ], batch size: 371, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:22:46,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=483234.0, ans=0.125 2023-06-19 14:23:08,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=483354.0, ans=0.0 2023-06-19 14:23:22,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-19 14:23:49,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.25 vs. limit=15.0 2023-06-19 14:24:06,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.887e+02 3.715e+02 4.750e+02 9.269e+02, threshold=7.430e+02, percent-clipped=4.0 2023-06-19 14:24:16,418 INFO [train.py:996] (0/4) Epoch 3, batch 19600, loss[loss=0.3233, simple_loss=0.3681, pruned_loss=0.1392, over 21764.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.328, pruned_loss=0.1008, over 4287566.43 frames. ], batch size: 389, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:24:23,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483534.0, ans=0.1 2023-06-19 14:24:33,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=483594.0, ans=0.125 2023-06-19 14:25:25,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=483714.0, ans=0.2 2023-06-19 14:25:34,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=483714.0, ans=0.1 2023-06-19 14:25:54,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=483774.0, ans=0.125 2023-06-19 14:25:55,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=483774.0, ans=0.125 2023-06-19 14:25:58,814 INFO [train.py:996] (0/4) Epoch 3, batch 19650, loss[loss=0.2786, simple_loss=0.3419, pruned_loss=0.1077, over 21691.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3337, pruned_loss=0.1051, over 4287282.51 frames. ], batch size: 389, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:26:11,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483834.0, ans=0.1 2023-06-19 14:27:34,300 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 2.992e+02 3.430e+02 3.953e+02 7.302e+02, threshold=6.859e+02, percent-clipped=0.0 2023-06-19 14:27:44,524 INFO [train.py:996] (0/4) Epoch 3, batch 19700, loss[loss=0.2439, simple_loss=0.2951, pruned_loss=0.09638, over 21217.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3393, pruned_loss=0.1074, over 4288422.29 frames. ], batch size: 143, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:27:55,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=484134.0, ans=0.1 2023-06-19 14:28:09,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=484134.0, ans=0.125 2023-06-19 14:28:11,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=484194.0, ans=0.125 2023-06-19 14:29:31,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-19 14:29:33,043 INFO [train.py:996] (0/4) Epoch 3, batch 19750, loss[loss=0.2822, simple_loss=0.3592, pruned_loss=0.1026, over 21590.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3484, pruned_loss=0.1093, over 4281613.27 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:29:57,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=484494.0, ans=0.1 2023-06-19 14:29:58,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-19 14:31:05,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.206e+02 3.832e+02 4.660e+02 9.927e+02, threshold=7.664e+02, percent-clipped=2.0 2023-06-19 14:31:15,134 INFO [train.py:996] (0/4) Epoch 3, batch 19800, loss[loss=0.1825, simple_loss=0.2186, pruned_loss=0.07327, over 16581.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3487, pruned_loss=0.1095, over 4282771.29 frames. ], batch size: 60, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:31:39,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484734.0, ans=0.1 2023-06-19 14:31:45,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=484794.0, ans=0.125 2023-06-19 14:31:47,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=484794.0, ans=0.2 2023-06-19 14:32:56,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=484974.0, ans=0.125 2023-06-19 14:33:03,144 INFO [train.py:996] (0/4) Epoch 3, batch 19850, loss[loss=0.2194, simple_loss=0.3033, pruned_loss=0.06775, over 21616.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3411, pruned_loss=0.1039, over 4283686.31 frames. ], batch size: 263, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:33:18,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=485034.0, ans=0.125 2023-06-19 14:33:38,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=485094.0, ans=0.5 2023-06-19 14:33:43,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=485094.0, ans=0.125 2023-06-19 14:33:47,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=485154.0, ans=0.2 2023-06-19 14:33:49,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=485154.0, ans=0.0 2023-06-19 14:34:29,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.663e+02 3.192e+02 3.932e+02 5.931e+02, threshold=6.384e+02, percent-clipped=0.0 2023-06-19 14:34:45,250 INFO [train.py:996] (0/4) Epoch 3, batch 19900, loss[loss=0.1798, simple_loss=0.2482, pruned_loss=0.0557, over 16401.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3381, pruned_loss=0.09922, over 4270936.93 frames. ], batch size: 60, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:35:00,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=485334.0, ans=0.0 2023-06-19 14:35:14,819 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:35:17,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485394.0, ans=0.1 2023-06-19 14:36:05,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=485574.0, ans=0.0 2023-06-19 14:36:33,306 INFO [train.py:996] (0/4) Epoch 3, batch 19950, loss[loss=0.2685, simple_loss=0.3195, pruned_loss=0.1087, over 21681.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3319, pruned_loss=0.09848, over 4260449.10 frames. ], batch size: 333, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:36:56,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=485694.0, ans=0.2 2023-06-19 14:37:00,604 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.95 vs. limit=5.0 2023-06-19 14:37:20,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=485754.0, ans=0.0 2023-06-19 14:37:35,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=485814.0, ans=0.125 2023-06-19 14:37:40,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=485814.0, ans=0.0 2023-06-19 14:37:59,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.893e+02 3.575e+02 4.384e+02 6.859e+02, threshold=7.149e+02, percent-clipped=1.0 2023-06-19 14:38:00,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=485874.0, ans=0.125 2023-06-19 14:38:14,240 INFO [train.py:996] (0/4) Epoch 3, batch 20000, loss[loss=0.2887, simple_loss=0.3561, pruned_loss=0.1107, over 21806.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3327, pruned_loss=0.0993, over 4261832.92 frames. ], batch size: 414, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:38:25,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=485934.0, ans=0.2 2023-06-19 14:38:27,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485934.0, ans=0.1 2023-06-19 14:38:35,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=485994.0, ans=0.04949747468305833 2023-06-19 14:38:52,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=486054.0, ans=0.125 2023-06-19 14:39:24,182 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:39:54,959 INFO [train.py:996] (0/4) Epoch 3, batch 20050, loss[loss=0.2498, simple_loss=0.314, pruned_loss=0.0928, over 21639.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.335, pruned_loss=0.1023, over 4273964.17 frames. ], batch size: 230, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:40:27,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=486294.0, ans=0.2 2023-06-19 14:40:33,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-19 14:41:03,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=486414.0, ans=0.09899494936611666 2023-06-19 14:41:28,462 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.834e+02 3.316e+02 3.890e+02 7.458e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-19 14:41:30,976 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-19 14:41:31,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486474.0, ans=0.1 2023-06-19 14:41:38,318 INFO [train.py:996] (0/4) Epoch 3, batch 20100, loss[loss=0.2821, simple_loss=0.3661, pruned_loss=0.09902, over 21075.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3389, pruned_loss=0.1057, over 4279960.30 frames. ], batch size: 607, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:41:38,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=486534.0, ans=0.125 2023-06-19 14:43:27,791 INFO [train.py:996] (0/4) Epoch 3, batch 20150, loss[loss=0.3359, simple_loss=0.3879, pruned_loss=0.1419, over 21466.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3496, pruned_loss=0.1105, over 4276392.58 frames. ], batch size: 194, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:43:40,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=486834.0, ans=0.2 2023-06-19 14:43:41,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=486834.0, ans=0.125 2023-06-19 14:44:07,317 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-19 14:44:38,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.27 vs. limit=6.0 2023-06-19 14:45:05,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 3.281e+02 3.907e+02 5.074e+02 8.084e+02, threshold=7.814e+02, percent-clipped=7.0 2023-06-19 14:45:13,862 INFO [train.py:996] (0/4) Epoch 3, batch 20200, loss[loss=0.4232, simple_loss=0.4827, pruned_loss=0.1819, over 21442.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3564, pruned_loss=0.1141, over 4274490.39 frames. ], batch size: 507, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:45:45,565 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-19 14:45:56,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=487254.0, ans=0.125 2023-06-19 14:46:21,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.11 vs. limit=10.0 2023-06-19 14:47:01,508 INFO [train.py:996] (0/4) Epoch 3, batch 20250, loss[loss=0.2723, simple_loss=0.3441, pruned_loss=0.1002, over 21844.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3554, pruned_loss=0.1106, over 4281210.36 frames. ], batch size: 316, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:47:14,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=22.5 2023-06-19 14:47:21,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487494.0, ans=0.1 2023-06-19 14:47:54,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=487554.0, ans=0.2 2023-06-19 14:48:29,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.865e+02 3.489e+02 4.461e+02 6.612e+02, threshold=6.978e+02, percent-clipped=0.0 2023-06-19 14:48:43,479 INFO [train.py:996] (0/4) Epoch 3, batch 20300, loss[loss=0.2601, simple_loss=0.331, pruned_loss=0.09463, over 21552.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.352, pruned_loss=0.1072, over 4275179.70 frames. ], batch size: 212, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:48:44,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-19 14:48:45,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=487734.0, ans=0.2 2023-06-19 14:48:51,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=487734.0, ans=0.125 2023-06-19 14:49:23,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=487794.0, ans=0.09899494936611666 2023-06-19 14:49:24,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=487854.0, ans=0.125 2023-06-19 14:49:34,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=487854.0, ans=0.125 2023-06-19 14:49:54,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=487914.0, ans=0.125 2023-06-19 14:49:54,831 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.90 vs. limit=22.5 2023-06-19 14:50:04,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-19 14:50:24,330 INFO [train.py:996] (0/4) Epoch 3, batch 20350, loss[loss=0.2725, simple_loss=0.3326, pruned_loss=0.1062, over 21899.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3514, pruned_loss=0.1069, over 4268050.30 frames. ], batch size: 118, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:51:39,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=488214.0, ans=0.125 2023-06-19 14:51:44,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=488274.0, ans=0.0 2023-06-19 14:51:51,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.942e+02 3.646e+02 4.954e+02 9.108e+02, threshold=7.293e+02, percent-clipped=8.0 2023-06-19 14:52:05,479 INFO [train.py:996] (0/4) Epoch 3, batch 20400, loss[loss=0.3202, simple_loss=0.3844, pruned_loss=0.128, over 20782.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3543, pruned_loss=0.1104, over 4264264.67 frames. ], batch size: 608, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:53:03,529 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-19 14:53:09,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=488514.0, ans=0.125 2023-06-19 14:53:18,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=488514.0, ans=10.0 2023-06-19 14:53:42,598 INFO [train.py:996] (0/4) Epoch 3, batch 20450, loss[loss=0.2379, simple_loss=0.275, pruned_loss=0.1004, over 20099.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3568, pruned_loss=0.1143, over 4263735.25 frames. ], batch size: 703, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:54:20,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2023-06-19 14:55:15,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.996e+02 3.435e+02 4.162e+02 7.102e+02, threshold=6.869e+02, percent-clipped=0.0 2023-06-19 14:55:21,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-19 14:55:22,285 INFO [train.py:996] (0/4) Epoch 3, batch 20500, loss[loss=0.2813, simple_loss=0.3313, pruned_loss=0.1157, over 21869.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3522, pruned_loss=0.1144, over 4257195.81 frames. ], batch size: 371, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:55:34,288 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.26 vs. limit=10.0 2023-06-19 14:55:45,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=488994.0, ans=0.0 2023-06-19 14:56:36,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=489114.0, ans=0.2 2023-06-19 14:57:01,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=489174.0, ans=0.04949747468305833 2023-06-19 14:57:05,178 INFO [train.py:996] (0/4) Epoch 3, batch 20550, loss[loss=0.2563, simple_loss=0.3099, pruned_loss=0.1013, over 21867.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3443, pruned_loss=0.1125, over 4262990.07 frames. ], batch size: 107, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 14:58:39,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 2.727e+02 3.175e+02 3.881e+02 7.747e+02, threshold=6.350e+02, percent-clipped=1.0 2023-06-19 14:58:45,952 INFO [train.py:996] (0/4) Epoch 3, batch 20600, loss[loss=0.2659, simple_loss=0.3284, pruned_loss=0.1017, over 21898.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3445, pruned_loss=0.1097, over 4260977.94 frames. ], batch size: 316, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 14:59:34,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.16 vs. limit=12.0 2023-06-19 14:59:53,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=489714.0, ans=0.0 2023-06-19 15:00:08,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=489714.0, ans=0.125 2023-06-19 15:00:25,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-19 15:00:27,245 INFO [train.py:996] (0/4) Epoch 3, batch 20650, loss[loss=0.2475, simple_loss=0.297, pruned_loss=0.09895, over 21174.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.342, pruned_loss=0.1107, over 4266493.63 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:00:46,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=489894.0, ans=0.2 2023-06-19 15:01:21,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-19 15:01:34,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=490014.0, ans=0.0 2023-06-19 15:02:03,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.806e+02 3.239e+02 3.703e+02 6.671e+02, threshold=6.478e+02, percent-clipped=1.0 2023-06-19 15:02:10,375 INFO [train.py:996] (0/4) Epoch 3, batch 20700, loss[loss=0.2651, simple_loss=0.3289, pruned_loss=0.1007, over 21694.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3323, pruned_loss=0.1055, over 4250868.59 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:02:41,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=490194.0, ans=0.125 2023-06-19 15:02:59,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=490254.0, ans=0.2 2023-06-19 15:03:26,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=490314.0, ans=0.125 2023-06-19 15:03:46,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=490374.0, ans=0.2 2023-06-19 15:03:50,844 INFO [train.py:996] (0/4) Epoch 3, batch 20750, loss[loss=0.2877, simple_loss=0.3806, pruned_loss=0.09737, over 21231.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3356, pruned_loss=0.1042, over 4255494.53 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:04:06,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=490434.0, ans=0.1 2023-06-19 15:04:13,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=490434.0, ans=0.125 2023-06-19 15:04:21,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=490494.0, ans=0.0 2023-06-19 15:04:52,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=490554.0, ans=0.1 2023-06-19 15:05:03,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=490614.0, ans=0.0 2023-06-19 15:05:16,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=490674.0, ans=0.0 2023-06-19 15:05:27,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.202e+02 3.808e+02 5.093e+02 1.097e+03, threshold=7.616e+02, percent-clipped=4.0 2023-06-19 15:05:33,794 INFO [train.py:996] (0/4) Epoch 3, batch 20800, loss[loss=0.2518, simple_loss=0.2991, pruned_loss=0.1022, over 21402.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.34, pruned_loss=0.1053, over 4257696.33 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:05:49,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=490734.0, ans=0.125 2023-06-19 15:06:04,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=490794.0, ans=0.125 2023-06-19 15:06:24,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=490854.0, ans=0.0 2023-06-19 15:06:38,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=490914.0, ans=0.0 2023-06-19 15:07:10,054 INFO [train.py:996] (0/4) Epoch 3, batch 20850, loss[loss=0.2194, simple_loss=0.2814, pruned_loss=0.07866, over 21629.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3306, pruned_loss=0.102, over 4266227.96 frames. ], batch size: 230, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:08:07,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=491154.0, ans=0.125 2023-06-19 15:08:09,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=491154.0, ans=0.125 2023-06-19 15:08:20,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=491214.0, ans=0.125 2023-06-19 15:08:37,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=491274.0, ans=0.125 2023-06-19 15:08:45,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 3.010e+02 3.845e+02 4.738e+02 1.149e+03, threshold=7.690e+02, percent-clipped=6.0 2023-06-19 15:08:56,668 INFO [train.py:996] (0/4) Epoch 3, batch 20900, loss[loss=0.2719, simple_loss=0.3388, pruned_loss=0.1024, over 21820.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3325, pruned_loss=0.1035, over 4265965.39 frames. ], batch size: 124, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:09:06,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=491334.0, ans=0.1 2023-06-19 15:09:07,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-19 15:09:16,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=491394.0, ans=0.125 2023-06-19 15:09:38,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=491454.0, ans=0.125 2023-06-19 15:10:31,175 INFO [train.py:996] (0/4) Epoch 3, batch 20950, loss[loss=0.177, simple_loss=0.2425, pruned_loss=0.05577, over 17056.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3275, pruned_loss=0.09891, over 4261624.98 frames. ], batch size: 64, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:11:46,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=15.0 2023-06-19 15:11:50,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=491814.0, ans=0.2 2023-06-19 15:11:50,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=491814.0, ans=0.0 2023-06-19 15:12:04,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.584e+02 3.033e+02 4.070e+02 6.900e+02, threshold=6.066e+02, percent-clipped=0.0 2023-06-19 15:12:06,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.34 vs. limit=15.0 2023-06-19 15:12:11,018 INFO [train.py:996] (0/4) Epoch 3, batch 21000, loss[loss=0.2642, simple_loss=0.3196, pruned_loss=0.1044, over 21576.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3259, pruned_loss=0.09962, over 4252658.95 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:12:11,019 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 15:12:21,577 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6496, 3.4020, 2.0809, 1.6219], device='cuda:0') 2023-06-19 15:12:29,327 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2787, simple_loss=0.3805, pruned_loss=0.08847, over 1796401.00 frames. 2023-06-19 15:12:29,328 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 15:12:42,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=491934.0, ans=0.1 2023-06-19 15:13:35,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=492114.0, ans=0.125 2023-06-19 15:13:42,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=492114.0, ans=0.125 2023-06-19 15:14:05,462 INFO [train.py:996] (0/4) Epoch 3, batch 21050, loss[loss=0.3045, simple_loss=0.3434, pruned_loss=0.1328, over 21732.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3238, pruned_loss=0.1001, over 4258514.66 frames. ], batch size: 351, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:14:35,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-19 15:15:13,069 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:15:27,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=492414.0, ans=0.0 2023-06-19 15:15:29,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=12.0 2023-06-19 15:15:29,261 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=22.5 2023-06-19 15:15:34,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-19 15:15:39,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.805e+02 3.251e+02 3.947e+02 6.448e+02, threshold=6.502e+02, percent-clipped=2.0 2023-06-19 15:15:45,697 INFO [train.py:996] (0/4) Epoch 3, batch 21100, loss[loss=0.2618, simple_loss=0.3061, pruned_loss=0.1087, over 21513.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.32, pruned_loss=0.09924, over 4256275.37 frames. ], batch size: 196, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:15:57,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=492534.0, ans=0.125 2023-06-19 15:16:24,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=492594.0, ans=12.0 2023-06-19 15:16:35,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492654.0, ans=0.1 2023-06-19 15:17:00,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=492714.0, ans=0.025 2023-06-19 15:17:27,206 INFO [train.py:996] (0/4) Epoch 3, batch 21150, loss[loss=0.2722, simple_loss=0.3106, pruned_loss=0.1169, over 21329.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3179, pruned_loss=0.1006, over 4256352.13 frames. ], batch size: 473, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:18:09,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=492894.0, ans=0.125 2023-06-19 15:18:14,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=492954.0, ans=0.2 2023-06-19 15:18:43,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-19 15:18:50,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-19 15:19:03,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.674e+02 3.023e+02 3.632e+02 5.729e+02, threshold=6.045e+02, percent-clipped=0.0 2023-06-19 15:19:13,128 INFO [train.py:996] (0/4) Epoch 3, batch 21200, loss[loss=0.206, simple_loss=0.2772, pruned_loss=0.06743, over 21425.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3141, pruned_loss=0.1, over 4256401.25 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:19:24,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=493134.0, ans=0.0 2023-06-19 15:19:24,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.02 vs. limit=15.0 2023-06-19 15:19:26,226 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-19 15:20:01,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=493254.0, ans=0.0 2023-06-19 15:20:08,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=493254.0, ans=0.035 2023-06-19 15:20:21,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=493314.0, ans=0.0 2023-06-19 15:20:33,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-19 15:20:38,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=493374.0, ans=0.125 2023-06-19 15:20:44,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=493374.0, ans=0.125 2023-06-19 15:20:49,058 INFO [train.py:996] (0/4) Epoch 3, batch 21250, loss[loss=0.2408, simple_loss=0.3083, pruned_loss=0.08668, over 21669.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3119, pruned_loss=0.09901, over 4265658.83 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:20:49,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=493434.0, ans=0.125 2023-06-19 15:21:07,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493434.0, ans=0.1 2023-06-19 15:21:37,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=493554.0, ans=0.125 2023-06-19 15:22:17,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=493674.0, ans=0.125 2023-06-19 15:22:24,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 3.377e+02 3.991e+02 5.475e+02 9.358e+02, threshold=7.981e+02, percent-clipped=20.0 2023-06-19 15:22:29,378 INFO [train.py:996] (0/4) Epoch 3, batch 21300, loss[loss=0.2569, simple_loss=0.328, pruned_loss=0.09294, over 21824.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3179, pruned_loss=0.1007, over 4257881.64 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:23:01,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-19 15:24:17,169 INFO [train.py:996] (0/4) Epoch 3, batch 21350, loss[loss=0.3069, simple_loss=0.3757, pruned_loss=0.1191, over 21540.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3232, pruned_loss=0.1026, over 4262061.69 frames. ], batch size: 471, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:24:20,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=494034.0, ans=0.07 2023-06-19 15:24:47,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.16 vs. limit=22.5 2023-06-19 15:24:52,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=494094.0, ans=0.0 2023-06-19 15:25:32,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=494214.0, ans=0.1 2023-06-19 15:25:54,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.815e+02 3.178e+02 3.883e+02 6.278e+02, threshold=6.357e+02, percent-clipped=0.0 2023-06-19 15:26:10,371 INFO [train.py:996] (0/4) Epoch 3, batch 21400, loss[loss=0.3348, simple_loss=0.3907, pruned_loss=0.1394, over 21323.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3301, pruned_loss=0.104, over 4269688.52 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:26:14,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-19 15:26:25,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=494394.0, ans=0.09899494936611666 2023-06-19 15:26:28,887 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:26:46,413 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:26:52,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=494454.0, ans=0.125 2023-06-19 15:27:05,981 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=15.0 2023-06-19 15:27:25,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=494574.0, ans=0.2 2023-06-19 15:27:45,379 INFO [train.py:996] (0/4) Epoch 3, batch 21450, loss[loss=0.2678, simple_loss=0.3264, pruned_loss=0.1046, over 21328.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3323, pruned_loss=0.1044, over 4277163.57 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:27:54,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=494634.0, ans=0.125 2023-06-19 15:28:57,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-19 15:29:05,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=494874.0, ans=0.09899494936611666 2023-06-19 15:29:21,106 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.895e+02 3.297e+02 3.892e+02 6.030e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-19 15:29:31,481 INFO [train.py:996] (0/4) Epoch 3, batch 21500, loss[loss=0.2391, simple_loss=0.2953, pruned_loss=0.09144, over 21225.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3299, pruned_loss=0.1059, over 4281854.87 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:30:33,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=495114.0, ans=0.125 2023-06-19 15:30:40,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=495174.0, ans=0.125 2023-06-19 15:31:00,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=495174.0, ans=0.125 2023-06-19 15:31:06,808 INFO [train.py:996] (0/4) Epoch 3, batch 21550, loss[loss=0.2359, simple_loss=0.2959, pruned_loss=0.08795, over 21693.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3225, pruned_loss=0.1029, over 4277932.42 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:31:26,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=495234.0, ans=0.2 2023-06-19 15:31:28,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=495234.0, ans=0.125 2023-06-19 15:31:52,811 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:32:08,690 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=22.5 2023-06-19 15:32:09,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=495414.0, ans=0.125 2023-06-19 15:32:48,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.022e+02 3.623e+02 4.651e+02 8.178e+02, threshold=7.247e+02, percent-clipped=4.0 2023-06-19 15:32:57,595 INFO [train.py:996] (0/4) Epoch 3, batch 21600, loss[loss=0.2659, simple_loss=0.3132, pruned_loss=0.1093, over 21578.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3201, pruned_loss=0.1013, over 4275462.20 frames. ], batch size: 415, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:33:05,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=495534.0, ans=0.125 2023-06-19 15:34:26,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-19 15:34:39,716 INFO [train.py:996] (0/4) Epoch 3, batch 21650, loss[loss=0.3711, simple_loss=0.4363, pruned_loss=0.153, over 21479.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3244, pruned_loss=0.09871, over 4269454.40 frames. ], batch size: 507, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:35:02,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=495894.0, ans=0.125 2023-06-19 15:35:12,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=495894.0, ans=0.0 2023-06-19 15:35:52,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=496074.0, ans=0.125 2023-06-19 15:36:07,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=496074.0, ans=0.2 2023-06-19 15:36:18,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.153e+02 4.282e+02 5.472e+02 1.270e+03, threshold=8.565e+02, percent-clipped=5.0 2023-06-19 15:36:20,155 INFO [train.py:996] (0/4) Epoch 3, batch 21700, loss[loss=0.2579, simple_loss=0.3045, pruned_loss=0.1056, over 20225.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3234, pruned_loss=0.09655, over 4274852.76 frames. ], batch size: 703, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:36:20,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=496134.0, ans=0.04949747468305833 2023-06-19 15:36:20,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=496134.0, ans=0.125 2023-06-19 15:36:39,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=496194.0, ans=0.05 2023-06-19 15:36:46,312 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.537e-03 2023-06-19 15:36:49,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=496194.0, ans=0.125 2023-06-19 15:36:54,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-19 15:37:43,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=496374.0, ans=0.0 2023-06-19 15:37:53,999 INFO [train.py:996] (0/4) Epoch 3, batch 21750, loss[loss=0.2266, simple_loss=0.2794, pruned_loss=0.08685, over 21236.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3191, pruned_loss=0.09635, over 4256622.97 frames. ], batch size: 144, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:38:02,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=496434.0, ans=0.125 2023-06-19 15:38:15,996 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-19 15:38:31,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=496494.0, ans=0.0 2023-06-19 15:38:42,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=496554.0, ans=0.0 2023-06-19 15:39:16,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496674.0, ans=0.1 2023-06-19 15:39:34,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.772e+02 3.220e+02 4.013e+02 6.187e+02, threshold=6.439e+02, percent-clipped=0.0 2023-06-19 15:39:41,173 INFO [train.py:996] (0/4) Epoch 3, batch 21800, loss[loss=0.2373, simple_loss=0.2889, pruned_loss=0.09282, over 21509.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3168, pruned_loss=0.09772, over 4238668.07 frames. ], batch size: 263, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:40:06,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496794.0, ans=0.1 2023-06-19 15:40:23,087 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-19 15:41:18,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=12.0 2023-06-19 15:41:23,678 INFO [train.py:996] (0/4) Epoch 3, batch 21850, loss[loss=0.3017, simple_loss=0.3505, pruned_loss=0.1264, over 21873.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3242, pruned_loss=0.09865, over 4250289.26 frames. ], batch size: 107, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:41:25,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=497034.0, ans=0.125 2023-06-19 15:41:48,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=497094.0, ans=0.125 2023-06-19 15:42:17,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.42 vs. limit=6.0 2023-06-19 15:42:54,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=497274.0, ans=0.0 2023-06-19 15:43:01,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=497274.0, ans=0.125 2023-06-19 15:43:01,776 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-19 15:43:07,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.253e+02 3.918e+02 5.054e+02 8.247e+02, threshold=7.836e+02, percent-clipped=6.0 2023-06-19 15:43:08,888 INFO [train.py:996] (0/4) Epoch 3, batch 21900, loss[loss=0.2389, simple_loss=0.3025, pruned_loss=0.08763, over 21792.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3252, pruned_loss=0.1, over 4264592.72 frames. ], batch size: 112, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:43:38,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=497454.0, ans=0.0 2023-06-19 15:44:49,962 INFO [train.py:996] (0/4) Epoch 3, batch 21950, loss[loss=0.1969, simple_loss=0.2658, pruned_loss=0.06403, over 21493.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3177, pruned_loss=0.09713, over 4271819.79 frames. ], batch size: 212, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:45:06,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=497694.0, ans=0.125 2023-06-19 15:45:18,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-19 15:45:46,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=497814.0, ans=0.0 2023-06-19 15:46:31,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.810e+02 3.300e+02 3.802e+02 6.470e+02, threshold=6.601e+02, percent-clipped=0.0 2023-06-19 15:46:32,709 INFO [train.py:996] (0/4) Epoch 3, batch 22000, loss[loss=0.2389, simple_loss=0.3108, pruned_loss=0.08347, over 21484.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3138, pruned_loss=0.09553, over 4272977.09 frames. ], batch size: 473, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:47:23,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498054.0, ans=0.1 2023-06-19 15:47:36,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=498114.0, ans=0.125 2023-06-19 15:47:37,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.05 vs. limit=5.0 2023-06-19 15:47:56,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=498174.0, ans=0.125 2023-06-19 15:48:16,232 INFO [train.py:996] (0/4) Epoch 3, batch 22050, loss[loss=0.2725, simple_loss=0.3476, pruned_loss=0.0987, over 21720.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3207, pruned_loss=0.09762, over 4277229.19 frames. ], batch size: 247, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:48:39,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=498294.0, ans=0.125 2023-06-19 15:49:42,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=498474.0, ans=0.1 2023-06-19 15:49:42,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=498474.0, ans=0.125 2023-06-19 15:49:47,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=498474.0, ans=0.125 2023-06-19 15:49:58,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.520e+02 4.276e+02 5.874e+02 8.679e+02, threshold=8.552e+02, percent-clipped=13.0 2023-06-19 15:49:58,821 INFO [train.py:996] (0/4) Epoch 3, batch 22100, loss[loss=0.2701, simple_loss=0.3282, pruned_loss=0.106, over 21924.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3317, pruned_loss=0.1035, over 4263944.61 frames. ], batch size: 316, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:50:00,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=498534.0, ans=0.125 2023-06-19 15:50:12,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=498534.0, ans=0.125 2023-06-19 15:50:22,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=498594.0, ans=0.5 2023-06-19 15:50:35,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.08 vs. limit=10.0 2023-06-19 15:50:56,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=498714.0, ans=0.0 2023-06-19 15:51:05,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=498714.0, ans=0.125 2023-06-19 15:51:33,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=16.99 vs. limit=15.0 2023-06-19 15:51:38,281 INFO [train.py:996] (0/4) Epoch 3, batch 22150, loss[loss=0.2638, simple_loss=0.3442, pruned_loss=0.09172, over 21475.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3346, pruned_loss=0.1058, over 4272839.99 frames. ], batch size: 211, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:51:54,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=498894.0, ans=10.0 2023-06-19 15:51:58,335 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:52:11,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=498954.0, ans=0.0 2023-06-19 15:52:46,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-19 15:53:12,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-19 15:53:18,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.814e+02 3.451e+02 4.432e+02 8.221e+02, threshold=6.902e+02, percent-clipped=0.0 2023-06-19 15:53:18,975 INFO [train.py:996] (0/4) Epoch 3, batch 22200, loss[loss=0.3543, simple_loss=0.3917, pruned_loss=0.1585, over 21760.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3375, pruned_loss=0.1067, over 4279672.80 frames. ], batch size: 508, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:53:38,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=499194.0, ans=0.015 2023-06-19 15:54:03,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=499254.0, ans=0.125 2023-06-19 15:54:50,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=499374.0, ans=0.125 2023-06-19 15:55:01,253 INFO [train.py:996] (0/4) Epoch 3, batch 22250, loss[loss=0.292, simple_loss=0.3482, pruned_loss=0.118, over 21772.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3445, pruned_loss=0.1092, over 4286701.85 frames. ], batch size: 247, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:55:01,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=499434.0, ans=0.2 2023-06-19 15:55:08,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=499434.0, ans=0.2 2023-06-19 15:55:58,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.45 vs. limit=10.0 2023-06-19 15:56:38,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=499674.0, ans=0.125 2023-06-19 15:56:41,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.175e+02 3.826e+02 4.848e+02 6.426e+02, threshold=7.653e+02, percent-clipped=0.0 2023-06-19 15:56:41,239 INFO [train.py:996] (0/4) Epoch 3, batch 22300, loss[loss=0.2566, simple_loss=0.3136, pruned_loss=0.09979, over 21389.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3469, pruned_loss=0.1113, over 4289615.53 frames. ], batch size: 211, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:56:43,514 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-19 15:57:50,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=499914.0, ans=0.0 2023-06-19 15:57:51,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=499914.0, ans=0.0 2023-06-19 15:58:21,550 INFO [train.py:996] (0/4) Epoch 3, batch 22350, loss[loss=0.2705, simple_loss=0.3359, pruned_loss=0.1026, over 21917.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.343, pruned_loss=0.1107, over 4296424.19 frames. ], batch size: 333, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:58:30,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=500034.0, ans=0.125 2023-06-19 15:59:08,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=500154.0, ans=0.125 2023-06-19 15:59:20,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-19 15:59:37,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=500214.0, ans=0.0 2023-06-19 15:59:49,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=500274.0, ans=0.125 2023-06-19 16:00:02,199 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.767e+02 3.274e+02 4.023e+02 7.731e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-19 16:00:02,230 INFO [train.py:996] (0/4) Epoch 3, batch 22400, loss[loss=0.2338, simple_loss=0.3147, pruned_loss=0.07638, over 21629.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3391, pruned_loss=0.1064, over 4291167.33 frames. ], batch size: 247, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:00:14,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=500334.0, ans=0.2 2023-06-19 16:00:17,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=500394.0, ans=0.125 2023-06-19 16:00:45,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500454.0, ans=0.1 2023-06-19 16:00:48,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=500454.0, ans=0.025 2023-06-19 16:01:17,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=500514.0, ans=0.025 2023-06-19 16:01:42,924 INFO [train.py:996] (0/4) Epoch 3, batch 22450, loss[loss=0.2263, simple_loss=0.2894, pruned_loss=0.08157, over 21177.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.332, pruned_loss=0.1052, over 4289470.80 frames. ], batch size: 549, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:01:51,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=500634.0, ans=0.125 2023-06-19 16:02:06,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=500694.0, ans=0.125 2023-06-19 16:02:08,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=500694.0, ans=0.5 2023-06-19 16:02:36,314 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-19 16:03:22,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=500874.0, ans=0.125 2023-06-19 16:03:27,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 3.119e+02 3.860e+02 5.027e+02 1.347e+03, threshold=7.719e+02, percent-clipped=7.0 2023-06-19 16:03:27,194 INFO [train.py:996] (0/4) Epoch 3, batch 22500, loss[loss=0.2795, simple_loss=0.3268, pruned_loss=0.1161, over 20091.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3286, pruned_loss=0.1056, over 4287065.69 frames. ], batch size: 702, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:03:28,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=500934.0, ans=0.0 2023-06-19 16:04:14,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501054.0, ans=0.1 2023-06-19 16:04:14,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=501054.0, ans=0.04949747468305833 2023-06-19 16:05:08,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=501234.0, ans=0.125 2023-06-19 16:05:09,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=501234.0, ans=0.0 2023-06-19 16:05:10,042 INFO [train.py:996] (0/4) Epoch 3, batch 22550, loss[loss=0.2383, simple_loss=0.3073, pruned_loss=0.08464, over 21923.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.332, pruned_loss=0.1057, over 4289723.32 frames. ], batch size: 299, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:05:27,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=501234.0, ans=0.0 2023-06-19 16:06:39,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=501474.0, ans=0.035 2023-06-19 16:07:06,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 3.038e+02 3.713e+02 4.829e+02 9.473e+02, threshold=7.425e+02, percent-clipped=2.0 2023-06-19 16:07:06,064 INFO [train.py:996] (0/4) Epoch 3, batch 22600, loss[loss=0.2483, simple_loss=0.3025, pruned_loss=0.09701, over 21800.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3343, pruned_loss=0.1064, over 4287130.72 frames. ], batch size: 112, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:07:06,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501534.0, ans=0.1 2023-06-19 16:07:25,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-19 16:07:43,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=501594.0, ans=0.2 2023-06-19 16:08:04,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=501714.0, ans=0.125 2023-06-19 16:08:23,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=501774.0, ans=0.125 2023-06-19 16:08:40,619 INFO [train.py:996] (0/4) Epoch 3, batch 22650, loss[loss=0.2616, simple_loss=0.312, pruned_loss=0.1056, over 21475.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3315, pruned_loss=0.1051, over 4286772.00 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:09:16,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=501894.0, ans=0.2 2023-06-19 16:09:28,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501954.0, ans=0.1 2023-06-19 16:10:06,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=502074.0, ans=0.125 2023-06-19 16:10:23,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.918e+02 3.422e+02 4.341e+02 8.662e+02, threshold=6.843e+02, percent-clipped=1.0 2023-06-19 16:10:23,402 INFO [train.py:996] (0/4) Epoch 3, batch 22700, loss[loss=0.2428, simple_loss=0.2985, pruned_loss=0.09354, over 21317.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3253, pruned_loss=0.1049, over 4275591.32 frames. ], batch size: 131, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:11:59,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502374.0, ans=0.1 2023-06-19 16:12:02,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502374.0, ans=0.1 2023-06-19 16:12:10,400 INFO [train.py:996] (0/4) Epoch 3, batch 22750, loss[loss=0.2972, simple_loss=0.3506, pruned_loss=0.1219, over 21200.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3257, pruned_loss=0.1064, over 4272799.65 frames. ], batch size: 143, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:12:13,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-19 16:12:16,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=22.5 2023-06-19 16:12:20,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=502434.0, ans=10.0 2023-06-19 16:12:29,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=502434.0, ans=0.125 2023-06-19 16:13:34,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=502674.0, ans=0.2 2023-06-19 16:13:51,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.384e+02 3.991e+02 5.040e+02 7.219e+02, threshold=7.983e+02, percent-clipped=3.0 2023-06-19 16:13:51,365 INFO [train.py:996] (0/4) Epoch 3, batch 22800, loss[loss=0.2788, simple_loss=0.3374, pruned_loss=0.1101, over 21337.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3317, pruned_loss=0.1099, over 4271791.85 frames. ], batch size: 159, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:14:05,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=502734.0, ans=0.0 2023-06-19 16:14:05,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=502734.0, ans=0.0 2023-06-19 16:14:07,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=502734.0, ans=0.0 2023-06-19 16:14:50,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=502914.0, ans=0.04949747468305833 2023-06-19 16:15:14,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=502974.0, ans=0.0 2023-06-19 16:15:32,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-19 16:15:32,983 INFO [train.py:996] (0/4) Epoch 3, batch 22850, loss[loss=0.263, simple_loss=0.3192, pruned_loss=0.1034, over 21236.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3283, pruned_loss=0.1085, over 4268239.73 frames. ], batch size: 548, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:15:47,342 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-19 16:15:56,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-19 16:17:16,751 INFO [train.py:996] (0/4) Epoch 3, batch 22900, loss[loss=0.2248, simple_loss=0.2783, pruned_loss=0.08562, over 21758.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3318, pruned_loss=0.1079, over 4271139.43 frames. ], batch size: 112, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:17:18,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.187e+02 3.862e+02 4.458e+02 8.142e+02, threshold=7.724e+02, percent-clipped=1.0 2023-06-19 16:17:24,735 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-06-19 16:17:41,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=503334.0, ans=0.125 2023-06-19 16:17:43,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-19 16:19:04,718 INFO [train.py:996] (0/4) Epoch 3, batch 22950, loss[loss=0.2432, simple_loss=0.3362, pruned_loss=0.07509, over 21229.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3448, pruned_loss=0.106, over 4271028.57 frames. ], batch size: 143, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:19:27,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=503694.0, ans=0.0 2023-06-19 16:19:33,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-06-19 16:19:35,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=503694.0, ans=0.0 2023-06-19 16:19:51,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=503754.0, ans=0.125 2023-06-19 16:20:45,483 INFO [train.py:996] (0/4) Epoch 3, batch 23000, loss[loss=0.236, simple_loss=0.302, pruned_loss=0.08506, over 21798.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3435, pruned_loss=0.1037, over 4275296.42 frames. ], batch size: 282, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:20:51,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.906e+02 3.294e+02 4.043e+02 6.729e+02, threshold=6.588e+02, percent-clipped=0.0 2023-06-19 16:21:07,265 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-84000.pt 2023-06-19 16:21:10,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=503994.0, ans=0.95 2023-06-19 16:21:33,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=504054.0, ans=0.125 2023-06-19 16:22:33,414 INFO [train.py:996] (0/4) Epoch 3, batch 23050, loss[loss=0.3467, simple_loss=0.3933, pruned_loss=0.15, over 21700.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3459, pruned_loss=0.1068, over 4281447.82 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:23:34,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=504414.0, ans=0.125 2023-06-19 16:24:14,601 INFO [train.py:996] (0/4) Epoch 3, batch 23100, loss[loss=0.2176, simple_loss=0.2754, pruned_loss=0.07994, over 19981.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.341, pruned_loss=0.1078, over 4267721.58 frames. ], batch size: 703, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:24:16,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.949e+02 3.465e+02 4.322e+02 6.088e+02, threshold=6.930e+02, percent-clipped=0.0 2023-06-19 16:24:20,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=504534.0, ans=0.125 2023-06-19 16:24:31,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=504594.0, ans=0.0 2023-06-19 16:25:10,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=504654.0, ans=0.2 2023-06-19 16:25:49,788 INFO [train.py:996] (0/4) Epoch 3, batch 23150, loss[loss=0.2904, simple_loss=0.3408, pruned_loss=0.12, over 21806.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3343, pruned_loss=0.1066, over 4276826.16 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:26:46,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=505014.0, ans=0.0 2023-06-19 16:26:57,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=505014.0, ans=0.125 2023-06-19 16:27:29,915 INFO [train.py:996] (0/4) Epoch 3, batch 23200, loss[loss=0.2957, simple_loss=0.3487, pruned_loss=0.1214, over 21898.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3336, pruned_loss=0.1075, over 4289950.89 frames. ], batch size: 391, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:27:31,412 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.188e+02 3.765e+02 4.583e+02 7.279e+02, threshold=7.530e+02, percent-clipped=1.0 2023-06-19 16:28:08,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=505254.0, ans=0.125 2023-06-19 16:28:18,759 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:29:11,847 INFO [train.py:996] (0/4) Epoch 3, batch 23250, loss[loss=0.286, simple_loss=0.3393, pruned_loss=0.1163, over 21514.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3351, pruned_loss=0.1091, over 4291467.88 frames. ], batch size: 548, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:29:22,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=505434.0, ans=0.2 2023-06-19 16:29:24,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=505434.0, ans=0.0 2023-06-19 16:29:51,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-19 16:29:56,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=505554.0, ans=0.0 2023-06-19 16:30:28,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=505614.0, ans=0.05 2023-06-19 16:30:37,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=505674.0, ans=0.125 2023-06-19 16:30:44,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=505674.0, ans=0.0 2023-06-19 16:30:48,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=505674.0, ans=0.0 2023-06-19 16:30:55,066 INFO [train.py:996] (0/4) Epoch 3, batch 23300, loss[loss=0.3484, simple_loss=0.4029, pruned_loss=0.1469, over 21711.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3436, pruned_loss=0.1109, over 4290928.70 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:30:56,672 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.151e+02 3.585e+02 4.227e+02 7.319e+02, threshold=7.169e+02, percent-clipped=0.0 2023-06-19 16:30:58,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=505734.0, ans=0.0 2023-06-19 16:31:29,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=505794.0, ans=0.2 2023-06-19 16:32:02,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=505914.0, ans=0.0 2023-06-19 16:32:15,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=505914.0, ans=0.0 2023-06-19 16:32:17,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=505914.0, ans=0.0 2023-06-19 16:32:25,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=505974.0, ans=0.0 2023-06-19 16:32:29,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-19 16:32:38,663 INFO [train.py:996] (0/4) Epoch 3, batch 23350, loss[loss=0.3033, simple_loss=0.3569, pruned_loss=0.1248, over 19980.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3491, pruned_loss=0.1107, over 4289888.13 frames. ], batch size: 702, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:33:01,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-19 16:33:39,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=506154.0, ans=0.125 2023-06-19 16:33:58,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.94 vs. limit=12.0 2023-06-19 16:34:08,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=506274.0, ans=0.125 2023-06-19 16:34:10,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=506274.0, ans=0.125 2023-06-19 16:34:21,310 INFO [train.py:996] (0/4) Epoch 3, batch 23400, loss[loss=0.2664, simple_loss=0.3256, pruned_loss=0.1036, over 21783.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3428, pruned_loss=0.1065, over 4286295.16 frames. ], batch size: 247, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:34:22,875 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.587e+02 3.042e+02 3.768e+02 6.854e+02, threshold=6.085e+02, percent-clipped=0.0 2023-06-19 16:35:07,037 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-06-19 16:35:42,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=506514.0, ans=0.2 2023-06-19 16:36:07,916 INFO [train.py:996] (0/4) Epoch 3, batch 23450, loss[loss=0.2843, simple_loss=0.3394, pruned_loss=0.1146, over 21323.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3439, pruned_loss=0.1096, over 4284656.65 frames. ], batch size: 176, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:36:29,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.23 vs. limit=12.0 2023-06-19 16:36:37,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506694.0, ans=0.1 2023-06-19 16:36:46,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=506754.0, ans=0.125 2023-06-19 16:37:49,350 INFO [train.py:996] (0/4) Epoch 3, batch 23500, loss[loss=0.2988, simple_loss=0.345, pruned_loss=0.1263, over 21250.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.344, pruned_loss=0.1118, over 4290215.95 frames. ], batch size: 143, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:37:50,957 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.271e+02 4.126e+02 5.318e+02 8.868e+02, threshold=8.252e+02, percent-clipped=14.0 2023-06-19 16:39:31,003 INFO [train.py:996] (0/4) Epoch 3, batch 23550, loss[loss=0.2782, simple_loss=0.3254, pruned_loss=0.1155, over 21528.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3374, pruned_loss=0.1101, over 4287351.92 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:39:49,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.89 vs. limit=22.5 2023-06-19 16:39:53,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=507294.0, ans=0.125 2023-06-19 16:40:09,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-19 16:40:39,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507414.0, ans=0.1 2023-06-19 16:40:46,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507414.0, ans=0.1 2023-06-19 16:41:17,625 INFO [train.py:996] (0/4) Epoch 3, batch 23600, loss[loss=0.2699, simple_loss=0.3355, pruned_loss=0.1022, over 21327.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3363, pruned_loss=0.1093, over 4281148.41 frames. ], batch size: 159, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:41:19,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 3.135e+02 3.693e+02 4.651e+02 9.053e+02, threshold=7.385e+02, percent-clipped=1.0 2023-06-19 16:41:19,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=507534.0, ans=0.035 2023-06-19 16:41:42,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=507594.0, ans=10.0 2023-06-19 16:42:52,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=507774.0, ans=0.025 2023-06-19 16:42:52,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=507774.0, ans=0.125 2023-06-19 16:43:00,385 INFO [train.py:996] (0/4) Epoch 3, batch 23650, loss[loss=0.3002, simple_loss=0.372, pruned_loss=0.1142, over 20734.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3359, pruned_loss=0.1069, over 4281454.02 frames. ], batch size: 607, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:43:06,684 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-19 16:44:48,419 INFO [train.py:996] (0/4) Epoch 3, batch 23700, loss[loss=0.3025, simple_loss=0.3633, pruned_loss=0.1208, over 21736.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3394, pruned_loss=0.1068, over 4285783.83 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:44:49,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.801e+02 3.226e+02 4.051e+02 6.982e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-19 16:45:03,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=508134.0, ans=0.0 2023-06-19 16:45:07,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=508134.0, ans=22.5 2023-06-19 16:45:11,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=508194.0, ans=0.125 2023-06-19 16:45:16,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=508194.0, ans=0.125 2023-06-19 16:45:24,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=508254.0, ans=0.125 2023-06-19 16:45:30,463 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-19 16:45:31,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=508254.0, ans=0.125 2023-06-19 16:46:01,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-19 16:46:03,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=508314.0, ans=0.0 2023-06-19 16:46:36,740 INFO [train.py:996] (0/4) Epoch 3, batch 23750, loss[loss=0.2546, simple_loss=0.3327, pruned_loss=0.08827, over 21260.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3418, pruned_loss=0.1074, over 4283613.01 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:46:38,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=508434.0, ans=0.125 2023-06-19 16:46:47,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=508434.0, ans=0.125 2023-06-19 16:46:47,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=508434.0, ans=0.09899494936611666 2023-06-19 16:46:50,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=508434.0, ans=0.125 2023-06-19 16:46:52,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=508494.0, ans=0.125 2023-06-19 16:46:55,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=508494.0, ans=0.0 2023-06-19 16:47:07,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=508494.0, ans=0.125 2023-06-19 16:47:08,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=508554.0, ans=0.125 2023-06-19 16:48:21,143 INFO [train.py:996] (0/4) Epoch 3, batch 23800, loss[loss=0.2863, simple_loss=0.3738, pruned_loss=0.09942, over 21806.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3391, pruned_loss=0.104, over 4281594.26 frames. ], batch size: 316, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:48:22,761 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.690e+02 3.256e+02 4.075e+02 6.648e+02, threshold=6.511e+02, percent-clipped=1.0 2023-06-19 16:48:34,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-19 16:50:04,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=509034.0, ans=0.2 2023-06-19 16:50:05,859 INFO [train.py:996] (0/4) Epoch 3, batch 23850, loss[loss=0.2852, simple_loss=0.3525, pruned_loss=0.109, over 21497.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3499, pruned_loss=0.1071, over 4281730.08 frames. ], batch size: 211, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:51:01,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=509154.0, ans=0.125 2023-06-19 16:51:10,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=509154.0, ans=0.125 2023-06-19 16:51:16,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509214.0, ans=0.1 2023-06-19 16:51:27,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=509214.0, ans=0.0 2023-06-19 16:51:36,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-19 16:51:43,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=509274.0, ans=0.2 2023-06-19 16:51:48,167 INFO [train.py:996] (0/4) Epoch 3, batch 23900, loss[loss=0.2851, simple_loss=0.3393, pruned_loss=0.1155, over 21815.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3565, pruned_loss=0.1098, over 4279510.31 frames. ], batch size: 107, lr: 1.03e-02, grad_scale: 16.0 2023-06-19 16:51:51,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 3.185e+02 4.046e+02 5.288e+02 1.128e+03, threshold=8.092e+02, percent-clipped=13.0 2023-06-19 16:53:03,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=509514.0, ans=0.0 2023-06-19 16:53:28,852 INFO [train.py:996] (0/4) Epoch 3, batch 23950, loss[loss=0.2496, simple_loss=0.3104, pruned_loss=0.09441, over 21817.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3489, pruned_loss=0.1093, over 4265716.38 frames. ], batch size: 98, lr: 1.03e-02, grad_scale: 16.0 2023-06-19 16:53:48,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-19 16:55:01,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=509874.0, ans=0.125 2023-06-19 16:55:15,657 INFO [train.py:996] (0/4) Epoch 3, batch 24000, loss[loss=0.3409, simple_loss=0.3914, pruned_loss=0.1452, over 21392.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3509, pruned_loss=0.1129, over 4263989.00 frames. ], batch size: 549, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:55:15,659 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 16:55:31,887 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2855, simple_loss=0.3833, pruned_loss=0.09389, over 1796401.00 frames. 2023-06-19 16:55:31,888 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 16:55:35,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.049e+02 3.553e+02 4.728e+02 8.625e+02, threshold=7.107e+02, percent-clipped=2.0 2023-06-19 16:56:03,535 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:56:04,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=509994.0, ans=0.125 2023-06-19 16:56:11,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=509994.0, ans=0.2 2023-06-19 16:57:10,316 INFO [train.py:996] (0/4) Epoch 3, batch 24050, loss[loss=0.2472, simple_loss=0.3262, pruned_loss=0.0841, over 21493.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3522, pruned_loss=0.1133, over 4264359.11 frames. ], batch size: 194, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:57:12,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=510234.0, ans=0.0 2023-06-19 16:58:06,312 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.08 vs. limit=15.0 2023-06-19 16:58:09,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.68 vs. limit=10.0 2023-06-19 16:58:42,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=510474.0, ans=0.0 2023-06-19 16:58:53,209 INFO [train.py:996] (0/4) Epoch 3, batch 24100, loss[loss=0.3058, simple_loss=0.368, pruned_loss=0.1218, over 21863.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3534, pruned_loss=0.1115, over 4271227.31 frames. ], batch size: 316, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:58:56,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.988e+02 3.709e+02 5.089e+02 1.009e+03, threshold=7.417e+02, percent-clipped=9.0 2023-06-19 17:00:30,549 INFO [train.py:996] (0/4) Epoch 3, batch 24150, loss[loss=0.2695, simple_loss=0.3194, pruned_loss=0.1098, over 21192.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3533, pruned_loss=0.114, over 4276975.75 frames. ], batch size: 608, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:00:42,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.22 vs. limit=10.0 2023-06-19 17:01:02,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=510894.0, ans=0.0 2023-06-19 17:01:10,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=510954.0, ans=0.2 2023-06-19 17:02:14,047 INFO [train.py:996] (0/4) Epoch 3, batch 24200, loss[loss=0.3084, simple_loss=0.3856, pruned_loss=0.1156, over 21784.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3552, pruned_loss=0.1153, over 4279712.42 frames. ], batch size: 371, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:02:17,193 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.540e+02 3.206e+02 3.739e+02 4.662e+02 8.285e+02, threshold=7.479e+02, percent-clipped=1.0 2023-06-19 17:02:32,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=511194.0, ans=0.025 2023-06-19 17:03:14,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-06-19 17:03:52,833 INFO [train.py:996] (0/4) Epoch 3, batch 24250, loss[loss=0.226, simple_loss=0.3219, pruned_loss=0.06502, over 21646.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3493, pruned_loss=0.1069, over 4274487.63 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:04:10,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=511494.0, ans=0.04949747468305833 2023-06-19 17:04:42,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=511554.0, ans=0.0 2023-06-19 17:04:44,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=511614.0, ans=0.07 2023-06-19 17:05:25,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-19 17:05:28,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=511674.0, ans=0.0 2023-06-19 17:05:33,902 INFO [train.py:996] (0/4) Epoch 3, batch 24300, loss[loss=0.1974, simple_loss=0.269, pruned_loss=0.06287, over 21454.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3389, pruned_loss=0.09863, over 4269791.37 frames. ], batch size: 194, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:05:37,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.288e+02 2.786e+02 3.535e+02 7.213e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-19 17:05:40,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=511734.0, ans=0.2 2023-06-19 17:05:58,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=511794.0, ans=0.2 2023-06-19 17:06:19,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.91 vs. limit=6.0 2023-06-19 17:06:37,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=511914.0, ans=0.025 2023-06-19 17:06:57,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=511914.0, ans=0.125 2023-06-19 17:07:16,696 INFO [train.py:996] (0/4) Epoch 3, batch 24350, loss[loss=0.2632, simple_loss=0.3226, pruned_loss=0.1019, over 21674.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3368, pruned_loss=0.09963, over 4273081.30 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:08:18,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=512154.0, ans=0.5 2023-06-19 17:09:06,347 INFO [train.py:996] (0/4) Epoch 3, batch 24400, loss[loss=0.3046, simple_loss=0.3565, pruned_loss=0.1264, over 21355.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3431, pruned_loss=0.1046, over 4272602.03 frames. ], batch size: 471, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:09:09,718 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 3.270e+02 4.131e+02 5.260e+02 7.879e+02, threshold=8.262e+02, percent-clipped=18.0 2023-06-19 17:10:12,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-19 17:10:23,466 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:10:48,984 INFO [train.py:996] (0/4) Epoch 3, batch 24450, loss[loss=0.2466, simple_loss=0.34, pruned_loss=0.07664, over 21736.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3466, pruned_loss=0.1063, over 4272858.97 frames. ], batch size: 332, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:10:56,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=512634.0, ans=0.05 2023-06-19 17:11:56,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=512814.0, ans=0.125 2023-06-19 17:12:30,308 INFO [train.py:996] (0/4) Epoch 3, batch 24500, loss[loss=0.332, simple_loss=0.3715, pruned_loss=0.1462, over 21762.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3455, pruned_loss=0.1059, over 4277894.95 frames. ], batch size: 508, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:12:33,622 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.872e+02 3.383e+02 4.151e+02 6.413e+02, threshold=6.766e+02, percent-clipped=0.0 2023-06-19 17:12:34,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=512934.0, ans=0.125 2023-06-19 17:12:37,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=512934.0, ans=0.0 2023-06-19 17:13:11,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=512994.0, ans=0.125 2023-06-19 17:13:20,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=513054.0, ans=0.1 2023-06-19 17:13:26,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=513054.0, ans=0.125 2023-06-19 17:13:37,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=513114.0, ans=0.125 2023-06-19 17:13:42,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=513114.0, ans=0.0 2023-06-19 17:14:12,269 INFO [train.py:996] (0/4) Epoch 3, batch 24550, loss[loss=0.3362, simple_loss=0.3837, pruned_loss=0.1444, over 21228.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3482, pruned_loss=0.1089, over 4281147.16 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:15:22,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-19 17:15:23,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=513414.0, ans=0.125 2023-06-19 17:15:25,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=513414.0, ans=0.0 2023-06-19 17:15:41,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=513474.0, ans=10.0 2023-06-19 17:15:54,315 INFO [train.py:996] (0/4) Epoch 3, batch 24600, loss[loss=0.2433, simple_loss=0.3055, pruned_loss=0.09051, over 21582.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3443, pruned_loss=0.11, over 4275823.17 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:15:57,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.888e+02 3.572e+02 4.375e+02 7.058e+02, threshold=7.144e+02, percent-clipped=1.0 2023-06-19 17:15:57,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=513534.0, ans=0.125 2023-06-19 17:16:11,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=513534.0, ans=0.0 2023-06-19 17:16:56,397 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-19 17:17:13,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=513714.0, ans=0.09899494936611666 2023-06-19 17:17:20,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=513774.0, ans=0.1 2023-06-19 17:17:35,837 INFO [train.py:996] (0/4) Epoch 3, batch 24650, loss[loss=0.2542, simple_loss=0.3081, pruned_loss=0.1001, over 21763.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3356, pruned_loss=0.1079, over 4275811.16 frames. ], batch size: 118, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:17:55,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=513834.0, ans=0.125 2023-06-19 17:17:58,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=513894.0, ans=0.0 2023-06-19 17:18:48,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=514014.0, ans=0.0 2023-06-19 17:18:49,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=514014.0, ans=0.125 2023-06-19 17:19:10,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=514074.0, ans=0.125 2023-06-19 17:19:13,024 INFO [train.py:996] (0/4) Epoch 3, batch 24700, loss[loss=0.2294, simple_loss=0.2943, pruned_loss=0.08228, over 21204.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3343, pruned_loss=0.1055, over 4272033.41 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:19:16,030 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.157e+02 3.618e+02 4.336e+02 6.867e+02, threshold=7.236e+02, percent-clipped=0.0 2023-06-19 17:19:34,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=514134.0, ans=0.125 2023-06-19 17:20:27,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=514314.0, ans=0.125 2023-06-19 17:20:55,375 INFO [train.py:996] (0/4) Epoch 3, batch 24750, loss[loss=0.2483, simple_loss=0.2995, pruned_loss=0.09855, over 21618.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3276, pruned_loss=0.1017, over 4267922.59 frames. ], batch size: 415, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:21:20,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-19 17:22:13,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=514614.0, ans=0.125 2023-06-19 17:22:15,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=514614.0, ans=0.125 2023-06-19 17:22:24,647 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.93 vs. limit=12.0 2023-06-19 17:22:29,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=514674.0, ans=0.125 2023-06-19 17:22:34,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-19 17:22:36,164 INFO [train.py:996] (0/4) Epoch 3, batch 24800, loss[loss=0.2367, simple_loss=0.2782, pruned_loss=0.09757, over 21025.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3211, pruned_loss=0.1013, over 4273068.22 frames. ], batch size: 608, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:22:39,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.753e+02 3.144e+02 3.669e+02 5.851e+02, threshold=6.289e+02, percent-clipped=0.0 2023-06-19 17:23:57,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=514914.0, ans=0.125 2023-06-19 17:24:18,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-19 17:24:19,452 INFO [train.py:996] (0/4) Epoch 3, batch 24850, loss[loss=0.2413, simple_loss=0.2906, pruned_loss=0.096, over 21314.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.324, pruned_loss=0.1039, over 4278375.07 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:24:32,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=515034.0, ans=0.0 2023-06-19 17:25:18,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=515154.0, ans=0.125 2023-06-19 17:25:35,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=515214.0, ans=0.0 2023-06-19 17:26:02,433 INFO [train.py:996] (0/4) Epoch 3, batch 24900, loss[loss=0.3344, simple_loss=0.3856, pruned_loss=0.1416, over 21483.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3282, pruned_loss=0.1054, over 4279404.82 frames. ], batch size: 194, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:26:11,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.873e+02 3.627e+02 4.468e+02 7.935e+02, threshold=7.253e+02, percent-clipped=5.0 2023-06-19 17:26:46,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.47 vs. limit=15.0 2023-06-19 17:27:00,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=515454.0, ans=0.0 2023-06-19 17:27:20,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=515514.0, ans=0.125 2023-06-19 17:27:32,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=515574.0, ans=0.125 2023-06-19 17:27:40,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=515574.0, ans=0.125 2023-06-19 17:27:57,504 INFO [train.py:996] (0/4) Epoch 3, batch 24950, loss[loss=0.2819, simple_loss=0.3527, pruned_loss=0.1056, over 20643.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3376, pruned_loss=0.111, over 4281021.41 frames. ], batch size: 607, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:28:21,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=515694.0, ans=0.0 2023-06-19 17:29:46,124 INFO [train.py:996] (0/4) Epoch 3, batch 25000, loss[loss=0.2537, simple_loss=0.3142, pruned_loss=0.09656, over 21279.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3456, pruned_loss=0.1137, over 4278487.41 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:29:46,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=515934.0, ans=0.125 2023-06-19 17:29:49,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.951e+02 3.694e+02 4.326e+02 9.045e+02, threshold=7.388e+02, percent-clipped=1.0 2023-06-19 17:30:07,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=515994.0, ans=0.2 2023-06-19 17:31:24,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=516174.0, ans=0.015 2023-06-19 17:31:28,983 INFO [train.py:996] (0/4) Epoch 3, batch 25050, loss[loss=0.2249, simple_loss=0.2777, pruned_loss=0.08604, over 21665.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3374, pruned_loss=0.1117, over 4267034.00 frames. ], batch size: 248, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:32:38,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=516414.0, ans=0.0 2023-06-19 17:32:38,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=516414.0, ans=0.125 2023-06-19 17:32:55,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=516474.0, ans=0.0 2023-06-19 17:33:11,008 INFO [train.py:996] (0/4) Epoch 3, batch 25100, loss[loss=0.3036, simple_loss=0.3319, pruned_loss=0.1377, over 21331.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.331, pruned_loss=0.1098, over 4263875.52 frames. ], batch size: 473, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:33:13,886 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.071e+02 3.524e+02 4.196e+02 8.233e+02, threshold=7.049e+02, percent-clipped=3.0 2023-06-19 17:33:39,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=516594.0, ans=0.035 2023-06-19 17:33:56,661 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:34:04,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=516714.0, ans=0.0 2023-06-19 17:34:37,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.15 vs. limit=15.0 2023-06-19 17:34:47,478 INFO [train.py:996] (0/4) Epoch 3, batch 25150, loss[loss=0.3147, simple_loss=0.3922, pruned_loss=0.1186, over 21437.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.335, pruned_loss=0.1066, over 4271161.36 frames. ], batch size: 471, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:35:02,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=516834.0, ans=0.1 2023-06-19 17:35:16,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=516894.0, ans=0.125 2023-06-19 17:35:24,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=516954.0, ans=0.1 2023-06-19 17:35:31,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=516954.0, ans=0.125 2023-06-19 17:35:39,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=516954.0, ans=0.2 2023-06-19 17:35:47,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=517014.0, ans=10.0 2023-06-19 17:35:47,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=517014.0, ans=0.0 2023-06-19 17:36:13,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=517074.0, ans=0.125 2023-06-19 17:36:23,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=517074.0, ans=0.125 2023-06-19 17:36:29,183 INFO [train.py:996] (0/4) Epoch 3, batch 25200, loss[loss=0.2693, simple_loss=0.3386, pruned_loss=0.1, over 21697.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3346, pruned_loss=0.1037, over 4272553.56 frames. ], batch size: 298, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:36:32,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.662e+02 3.153e+02 4.538e+02 8.599e+02, threshold=6.306e+02, percent-clipped=6.0 2023-06-19 17:36:46,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-19 17:37:51,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2023-06-19 17:38:10,297 INFO [train.py:996] (0/4) Epoch 3, batch 25250, loss[loss=0.2221, simple_loss=0.3031, pruned_loss=0.07053, over 16713.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3304, pruned_loss=0.1008, over 4260376.00 frames. ], batch size: 62, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:39:56,965 INFO [train.py:996] (0/4) Epoch 3, batch 25300, loss[loss=0.2953, simple_loss=0.3704, pruned_loss=0.1101, over 21445.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3274, pruned_loss=0.09999, over 4253034.59 frames. ], batch size: 131, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:39:59,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=517734.0, ans=0.2 2023-06-19 17:40:00,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.138e+02 3.706e+02 4.437e+02 8.805e+02, threshold=7.413e+02, percent-clipped=6.0 2023-06-19 17:40:10,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=517734.0, ans=0.0 2023-06-19 17:40:14,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517794.0, ans=0.1 2023-06-19 17:40:26,521 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:40:27,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.89 vs. limit=10.0 2023-06-19 17:40:49,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=517914.0, ans=0.125 2023-06-19 17:41:02,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=517914.0, ans=0.04949747468305833 2023-06-19 17:41:17,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=517974.0, ans=0.5 2023-06-19 17:41:37,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=517974.0, ans=0.1 2023-06-19 17:41:40,355 INFO [train.py:996] (0/4) Epoch 3, batch 25350, loss[loss=0.1978, simple_loss=0.2726, pruned_loss=0.06152, over 21381.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.329, pruned_loss=0.09902, over 4243257.04 frames. ], batch size: 194, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:41:52,632 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:43:21,817 INFO [train.py:996] (0/4) Epoch 3, batch 25400, loss[loss=0.2215, simple_loss=0.278, pruned_loss=0.08256, over 21473.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3241, pruned_loss=0.09787, over 4246173.72 frames. ], batch size: 212, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:43:24,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.818e+02 3.409e+02 4.580e+02 8.063e+02, threshold=6.817e+02, percent-clipped=2.0 2023-06-19 17:43:53,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=518394.0, ans=0.125 2023-06-19 17:43:56,324 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:44:58,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518574.0, ans=0.1 2023-06-19 17:45:02,528 INFO [train.py:996] (0/4) Epoch 3, batch 25450, loss[loss=0.2598, simple_loss=0.3278, pruned_loss=0.09592, over 21810.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3254, pruned_loss=0.1001, over 4255854.64 frames. ], batch size: 118, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:45:06,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=518634.0, ans=0.125 2023-06-19 17:45:22,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518694.0, ans=0.1 2023-06-19 17:45:22,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=518694.0, ans=0.0 2023-06-19 17:45:48,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=518754.0, ans=0.0 2023-06-19 17:45:55,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-06-19 17:46:02,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=518814.0, ans=0.0 2023-06-19 17:46:03,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=518814.0, ans=0.95 2023-06-19 17:46:38,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=518874.0, ans=0.125 2023-06-19 17:46:46,187 INFO [train.py:996] (0/4) Epoch 3, batch 25500, loss[loss=0.2819, simple_loss=0.3483, pruned_loss=0.1077, over 21260.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3259, pruned_loss=0.0954, over 4258164.68 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:46:49,362 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 2.642e+02 3.063e+02 3.580e+02 7.751e+02, threshold=6.127e+02, percent-clipped=1.0 2023-06-19 17:46:49,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=518934.0, ans=0.125 2023-06-19 17:47:04,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=518994.0, ans=0.2 2023-06-19 17:48:26,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=519174.0, ans=0.0 2023-06-19 17:48:31,160 INFO [train.py:996] (0/4) Epoch 3, batch 25550, loss[loss=0.2648, simple_loss=0.3304, pruned_loss=0.09959, over 19997.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3323, pruned_loss=0.09598, over 4248987.04 frames. ], batch size: 702, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:49:12,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=519354.0, ans=0.0 2023-06-19 17:49:47,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=519414.0, ans=0.0 2023-06-19 17:49:55,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-19 17:49:57,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.76 vs. limit=6.0 2023-06-19 17:50:01,536 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-19 17:50:20,664 INFO [train.py:996] (0/4) Epoch 3, batch 25600, loss[loss=0.3109, simple_loss=0.3661, pruned_loss=0.1279, over 20777.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3359, pruned_loss=0.09688, over 4255332.28 frames. ], batch size: 608, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:50:21,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=519534.0, ans=0.2 2023-06-19 17:50:23,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.720e+02 3.211e+02 3.853e+02 6.629e+02, threshold=6.421e+02, percent-clipped=1.0 2023-06-19 17:50:47,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=519594.0, ans=0.125 2023-06-19 17:51:34,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-19 17:51:57,695 INFO [train.py:996] (0/4) Epoch 3, batch 25650, loss[loss=0.2779, simple_loss=0.322, pruned_loss=0.1169, over 21238.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.338, pruned_loss=0.1013, over 4243417.21 frames. ], batch size: 471, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:52:31,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=519894.0, ans=0.0 2023-06-19 17:52:59,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519954.0, ans=0.1 2023-06-19 17:53:39,114 INFO [train.py:996] (0/4) Epoch 3, batch 25700, loss[loss=0.2709, simple_loss=0.3367, pruned_loss=0.1026, over 21371.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3353, pruned_loss=0.1028, over 4257776.25 frames. ], batch size: 131, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:53:46,989 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.093e+02 3.779e+02 4.609e+02 9.934e+02, threshold=7.559e+02, percent-clipped=6.0 2023-06-19 17:54:02,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=520194.0, ans=0.125 2023-06-19 17:54:05,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=520194.0, ans=0.125 2023-06-19 17:55:28,424 INFO [train.py:996] (0/4) Epoch 3, batch 25750, loss[loss=0.4326, simple_loss=0.4854, pruned_loss=0.1899, over 21739.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3429, pruned_loss=0.1068, over 4255839.82 frames. ], batch size: 441, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:56:02,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=520494.0, ans=0.0 2023-06-19 17:56:17,715 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-06-19 17:56:37,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=520614.0, ans=0.125 2023-06-19 17:57:15,548 INFO [train.py:996] (0/4) Epoch 3, batch 25800, loss[loss=0.3536, simple_loss=0.4061, pruned_loss=0.1506, over 21445.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3573, pruned_loss=0.1121, over 4258644.15 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:57:25,032 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.639e+02 4.483e+02 6.036e+02 1.254e+03, threshold=8.967e+02, percent-clipped=11.0 2023-06-19 17:57:27,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-19 17:58:05,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=520854.0, ans=0.125 2023-06-19 17:58:19,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-19 17:58:32,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=520914.0, ans=0.125 2023-06-19 17:58:33,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=520914.0, ans=0.125 2023-06-19 17:59:02,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-19 17:59:05,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=521034.0, ans=0.1 2023-06-19 17:59:06,186 INFO [train.py:996] (0/4) Epoch 3, batch 25850, loss[loss=0.2723, simple_loss=0.343, pruned_loss=0.1008, over 21746.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.358, pruned_loss=0.1116, over 4262227.52 frames. ], batch size: 389, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:59:33,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-19 18:00:56,504 INFO [train.py:996] (0/4) Epoch 3, batch 25900, loss[loss=0.3465, simple_loss=0.4262, pruned_loss=0.1335, over 21700.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3605, pruned_loss=0.1129, over 4262532.56 frames. ], batch size: 414, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:01:01,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.484e+02 3.101e+02 3.467e+02 4.368e+02 8.294e+02, threshold=6.933e+02, percent-clipped=0.0 2023-06-19 18:02:00,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=521514.0, ans=0.5 2023-06-19 18:02:39,537 INFO [train.py:996] (0/4) Epoch 3, batch 25950, loss[loss=0.264, simple_loss=0.3251, pruned_loss=0.1015, over 21196.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3642, pruned_loss=0.115, over 4267032.02 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:02:50,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-19 18:03:05,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=521694.0, ans=0.0 2023-06-19 18:03:08,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=521694.0, ans=0.2 2023-06-19 18:03:35,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=521754.0, ans=0.0 2023-06-19 18:04:16,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=521874.0, ans=0.125 2023-06-19 18:04:24,064 INFO [train.py:996] (0/4) Epoch 3, batch 26000, loss[loss=0.3101, simple_loss=0.3671, pruned_loss=0.1265, over 20695.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3654, pruned_loss=0.114, over 4266629.06 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:04:35,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.072e+02 3.699e+02 4.692e+02 7.013e+02, threshold=7.398e+02, percent-clipped=1.0 2023-06-19 18:05:05,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522054.0, ans=0.1 2023-06-19 18:05:44,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=522114.0, ans=0.2 2023-06-19 18:05:56,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=522174.0, ans=0.125 2023-06-19 18:06:06,945 INFO [train.py:996] (0/4) Epoch 3, batch 26050, loss[loss=0.2687, simple_loss=0.3237, pruned_loss=0.1068, over 21949.00 frames. ], tot_loss[loss=0.295, simple_loss=0.364, pruned_loss=0.113, over 4265671.33 frames. ], batch size: 283, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:06:43,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=522354.0, ans=0.125 2023-06-19 18:07:15,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=522414.0, ans=0.125 2023-06-19 18:07:49,582 INFO [train.py:996] (0/4) Epoch 3, batch 26100, loss[loss=0.2684, simple_loss=0.3264, pruned_loss=0.1052, over 21883.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3586, pruned_loss=0.1136, over 4266674.67 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:08:00,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-19 18:08:01,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.956e+02 3.379e+02 4.537e+02 7.018e+02, threshold=6.758e+02, percent-clipped=0.0 2023-06-19 18:09:06,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=522714.0, ans=0.5 2023-06-19 18:09:38,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=522834.0, ans=0.125 2023-06-19 18:09:39,650 INFO [train.py:996] (0/4) Epoch 3, batch 26150, loss[loss=0.2729, simple_loss=0.334, pruned_loss=0.1059, over 19996.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3536, pruned_loss=0.1134, over 4269848.04 frames. ], batch size: 702, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:10:31,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=522954.0, ans=0.1 2023-06-19 18:10:58,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=523014.0, ans=0.125 2023-06-19 18:11:23,944 INFO [train.py:996] (0/4) Epoch 3, batch 26200, loss[loss=0.2614, simple_loss=0.3131, pruned_loss=0.1049, over 20032.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3541, pruned_loss=0.1114, over 4274132.01 frames. ], batch size: 703, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:11:27,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=523134.0, ans=0.125 2023-06-19 18:11:30,761 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 3.095e+02 3.569e+02 4.232e+02 6.752e+02, threshold=7.138e+02, percent-clipped=0.0 2023-06-19 18:11:47,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=523194.0, ans=0.0 2023-06-19 18:12:29,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=523314.0, ans=0.125 2023-06-19 18:12:32,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=523314.0, ans=0.125 2023-06-19 18:13:06,901 INFO [train.py:996] (0/4) Epoch 3, batch 26250, loss[loss=0.2806, simple_loss=0.3453, pruned_loss=0.1079, over 21553.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3592, pruned_loss=0.1113, over 4275488.98 frames. ], batch size: 548, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:13:15,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=523434.0, ans=0.0 2023-06-19 18:14:07,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=523614.0, ans=0.2 2023-06-19 18:14:44,573 INFO [train.py:996] (0/4) Epoch 3, batch 26300, loss[loss=0.3175, simple_loss=0.368, pruned_loss=0.1334, over 21883.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3551, pruned_loss=0.1118, over 4282342.92 frames. ], batch size: 124, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:14:51,298 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.134e+02 3.781e+02 4.659e+02 7.680e+02, threshold=7.563e+02, percent-clipped=3.0 2023-06-19 18:14:52,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-19 18:15:00,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=523734.0, ans=0.0 2023-06-19 18:16:01,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=523914.0, ans=10.0 2023-06-19 18:16:39,681 INFO [train.py:996] (0/4) Epoch 3, batch 26350, loss[loss=0.3272, simple_loss=0.375, pruned_loss=0.1397, over 21875.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3544, pruned_loss=0.1132, over 4284502.91 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:16:41,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=524034.0, ans=0.0 2023-06-19 18:16:43,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524034.0, ans=0.1 2023-06-19 18:17:03,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.43 vs. limit=15.0 2023-06-19 18:17:35,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=524154.0, ans=0.125 2023-06-19 18:17:37,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=524214.0, ans=0.125 2023-06-19 18:18:16,305 INFO [train.py:996] (0/4) Epoch 3, batch 26400, loss[loss=0.2493, simple_loss=0.2999, pruned_loss=0.09932, over 21263.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3487, pruned_loss=0.1136, over 4276926.04 frames. ], batch size: 160, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:18:28,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.857e+02 3.384e+02 4.347e+02 8.285e+02, threshold=6.769e+02, percent-clipped=0.0 2023-06-19 18:18:33,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=22.5 2023-06-19 18:20:12,396 INFO [train.py:996] (0/4) Epoch 3, batch 26450, loss[loss=0.3269, simple_loss=0.415, pruned_loss=0.1194, over 21727.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3477, pruned_loss=0.1124, over 4271791.96 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:20:28,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-19 18:21:13,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=524814.0, ans=0.2 2023-06-19 18:21:53,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=524874.0, ans=0.125 2023-06-19 18:21:56,778 INFO [train.py:996] (0/4) Epoch 3, batch 26500, loss[loss=0.2555, simple_loss=0.317, pruned_loss=0.09706, over 21631.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3496, pruned_loss=0.1107, over 4275710.22 frames. ], batch size: 230, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:22:01,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-19 18:22:04,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.354e+02 4.139e+02 5.566e+02 7.518e+02, threshold=8.277e+02, percent-clipped=7.0 2023-06-19 18:22:22,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524994.0, ans=0.1 2023-06-19 18:22:26,081 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:22:40,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=525054.0, ans=0.125 2023-06-19 18:23:12,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525114.0, ans=0.1 2023-06-19 18:23:42,741 INFO [train.py:996] (0/4) Epoch 3, batch 26550, loss[loss=0.2788, simple_loss=0.3762, pruned_loss=0.09066, over 19796.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3467, pruned_loss=0.1073, over 4263909.19 frames. ], batch size: 703, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:23:59,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-19 18:24:05,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=525294.0, ans=0.05 2023-06-19 18:24:18,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=525294.0, ans=0.125 2023-06-19 18:24:26,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=525354.0, ans=0.015 2023-06-19 18:24:40,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=525354.0, ans=0.0 2023-06-19 18:25:01,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=525414.0, ans=0.2 2023-06-19 18:25:03,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=525414.0, ans=0.2 2023-06-19 18:25:30,095 INFO [train.py:996] (0/4) Epoch 3, batch 26600, loss[loss=0.3132, simple_loss=0.3574, pruned_loss=0.1345, over 19983.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3438, pruned_loss=0.1029, over 4264587.92 frames. ], batch size: 703, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:25:38,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 3.044e+02 3.646e+02 4.264e+02 8.431e+02, threshold=7.292e+02, percent-clipped=1.0 2023-06-19 18:25:45,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=525594.0, ans=0.0 2023-06-19 18:26:27,693 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:26:54,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2023-06-19 18:27:02,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=525774.0, ans=0.0 2023-06-19 18:27:13,248 INFO [train.py:996] (0/4) Epoch 3, batch 26650, loss[loss=0.2594, simple_loss=0.3176, pruned_loss=0.1006, over 21889.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3367, pruned_loss=0.1016, over 4264818.27 frames. ], batch size: 107, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:27:15,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525834.0, ans=0.1 2023-06-19 18:27:52,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=525894.0, ans=0.125 2023-06-19 18:28:07,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=525954.0, ans=0.125 2023-06-19 18:28:11,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=525954.0, ans=0.125 2023-06-19 18:28:30,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=526014.0, ans=0.1 2023-06-19 18:28:55,364 INFO [train.py:996] (0/4) Epoch 3, batch 26700, loss[loss=0.3528, simple_loss=0.3786, pruned_loss=0.1634, over 21784.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3298, pruned_loss=0.09844, over 4272266.45 frames. ], batch size: 508, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:29:03,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 2.681e+02 3.249e+02 4.280e+02 9.861e+02, threshold=6.499e+02, percent-clipped=1.0 2023-06-19 18:29:39,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=526194.0, ans=0.125 2023-06-19 18:29:47,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=526254.0, ans=0.1 2023-06-19 18:29:52,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=526254.0, ans=0.125 2023-06-19 18:29:52,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=526254.0, ans=0.05 2023-06-19 18:30:17,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=526314.0, ans=0.125 2023-06-19 18:30:26,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=526374.0, ans=0.125 2023-06-19 18:30:38,048 INFO [train.py:996] (0/4) Epoch 3, batch 26750, loss[loss=0.2947, simple_loss=0.3699, pruned_loss=0.1098, over 20718.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3282, pruned_loss=0.09669, over 4274887.37 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:30:38,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=526434.0, ans=0.0 2023-06-19 18:30:53,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-19 18:31:12,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=526494.0, ans=0.125 2023-06-19 18:31:14,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=526494.0, ans=0.0 2023-06-19 18:31:28,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-19 18:32:07,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=526674.0, ans=0.025 2023-06-19 18:32:35,851 INFO [train.py:996] (0/4) Epoch 3, batch 26800, loss[loss=0.307, simple_loss=0.3732, pruned_loss=0.1204, over 21226.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3385, pruned_loss=0.1031, over 4276543.34 frames. ], batch size: 143, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:32:49,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 3.066e+02 3.643e+02 4.361e+02 8.068e+02, threshold=7.286e+02, percent-clipped=5.0 2023-06-19 18:32:54,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=526734.0, ans=0.0 2023-06-19 18:32:54,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=526734.0, ans=0.125 2023-06-19 18:34:23,793 INFO [train.py:996] (0/4) Epoch 3, batch 26850, loss[loss=0.2468, simple_loss=0.3088, pruned_loss=0.09241, over 21792.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3405, pruned_loss=0.1063, over 4263115.69 frames. ], batch size: 118, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:34:39,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=527094.0, ans=0.0 2023-06-19 18:34:48,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=527094.0, ans=0.125 2023-06-19 18:34:53,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=527094.0, ans=0.0 2023-06-19 18:35:08,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527154.0, ans=0.1 2023-06-19 18:35:29,419 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:35:51,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=527274.0, ans=0.125 2023-06-19 18:36:05,755 INFO [train.py:996] (0/4) Epoch 3, batch 26900, loss[loss=0.2168, simple_loss=0.2735, pruned_loss=0.08004, over 21353.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3305, pruned_loss=0.1041, over 4256090.82 frames. ], batch size: 160, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:36:08,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=527334.0, ans=0.0 2023-06-19 18:36:14,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.946e+02 3.321e+02 4.106e+02 6.345e+02, threshold=6.642e+02, percent-clipped=0.0 2023-06-19 18:36:18,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=527334.0, ans=0.0 2023-06-19 18:36:22,422 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-19 18:36:40,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-19 18:37:20,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=527514.0, ans=0.125 2023-06-19 18:37:45,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=527574.0, ans=0.0 2023-06-19 18:37:45,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=15.0 2023-06-19 18:37:49,043 INFO [train.py:996] (0/4) Epoch 3, batch 26950, loss[loss=0.3153, simple_loss=0.3879, pruned_loss=0.1214, over 21712.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3304, pruned_loss=0.104, over 4238290.90 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:38:09,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=527694.0, ans=0.125 2023-06-19 18:38:28,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.24 vs. limit=15.0 2023-06-19 18:38:32,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-19 18:39:16,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527874.0, ans=0.1 2023-06-19 18:39:32,590 INFO [train.py:996] (0/4) Epoch 3, batch 27000, loss[loss=0.336, simple_loss=0.3956, pruned_loss=0.1382, over 21458.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3309, pruned_loss=0.1015, over 4250378.46 frames. ], batch size: 508, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:39:32,591 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 18:39:49,116 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.2602, simple_loss=0.3579, pruned_loss=0.0813, over 1796401.00 frames. 2023-06-19 18:39:49,117 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 18:39:57,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=527934.0, ans=0.0 2023-06-19 18:39:59,039 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.939e+02 3.560e+02 4.603e+02 8.017e+02, threshold=7.120e+02, percent-clipped=5.0 2023-06-19 18:40:10,739 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-88000.pt 2023-06-19 18:40:32,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=528054.0, ans=0.125 2023-06-19 18:40:44,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=528054.0, ans=0.05 2023-06-19 18:41:19,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=528174.0, ans=0.125 2023-06-19 18:41:33,643 INFO [train.py:996] (0/4) Epoch 3, batch 27050, loss[loss=0.2446, simple_loss=0.3475, pruned_loss=0.07089, over 20763.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3322, pruned_loss=0.09707, over 4255954.51 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:41:57,896 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-19 18:43:03,902 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:43:05,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=528474.0, ans=0.05 2023-06-19 18:43:16,451 INFO [train.py:996] (0/4) Epoch 3, batch 27100, loss[loss=0.2338, simple_loss=0.3244, pruned_loss=0.07162, over 20987.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.333, pruned_loss=0.09783, over 4267997.95 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:43:30,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.773e+02 3.198e+02 4.013e+02 8.418e+02, threshold=6.395e+02, percent-clipped=2.0 2023-06-19 18:43:50,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=528594.0, ans=0.0 2023-06-19 18:44:15,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=528654.0, ans=0.2 2023-06-19 18:44:15,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.69 vs. limit=15.0 2023-06-19 18:44:47,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=528774.0, ans=0.125 2023-06-19 18:44:47,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528774.0, ans=0.1 2023-06-19 18:44:49,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=528774.0, ans=0.2 2023-06-19 18:45:00,955 INFO [train.py:996] (0/4) Epoch 3, batch 27150, loss[loss=0.2852, simple_loss=0.3707, pruned_loss=0.09983, over 21744.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.344, pruned_loss=0.1009, over 4268264.54 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:45:38,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=528894.0, ans=0.0 2023-06-19 18:46:13,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=529014.0, ans=0.0 2023-06-19 18:46:26,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=529074.0, ans=0.125 2023-06-19 18:46:49,408 INFO [train.py:996] (0/4) Epoch 3, batch 27200, loss[loss=0.3291, simple_loss=0.4075, pruned_loss=0.1254, over 21664.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3541, pruned_loss=0.1051, over 4277690.09 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:46:56,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=529134.0, ans=0.0 2023-06-19 18:46:59,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.363e+02 3.936e+02 4.684e+02 8.685e+02, threshold=7.872e+02, percent-clipped=10.0 2023-06-19 18:47:40,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=529254.0, ans=0.1 2023-06-19 18:47:56,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529314.0, ans=0.1 2023-06-19 18:48:05,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=529314.0, ans=0.125 2023-06-19 18:48:33,220 INFO [train.py:996] (0/4) Epoch 3, batch 27250, loss[loss=0.3469, simple_loss=0.4, pruned_loss=0.1469, over 21856.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3571, pruned_loss=0.1099, over 4272612.54 frames. ], batch size: 118, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:48:56,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=529434.0, ans=0.125 2023-06-19 18:48:58,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-19 18:49:23,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=529554.0, ans=0.0 2023-06-19 18:49:28,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=529554.0, ans=10.0 2023-06-19 18:49:30,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=529554.0, ans=0.2 2023-06-19 18:49:37,126 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:49:38,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=529614.0, ans=0.125 2023-06-19 18:49:50,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=529614.0, ans=0.0 2023-06-19 18:50:15,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=529674.0, ans=0.07 2023-06-19 18:50:27,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=529734.0, ans=0.125 2023-06-19 18:50:28,591 INFO [train.py:996] (0/4) Epoch 3, batch 27300, loss[loss=0.3223, simple_loss=0.3933, pruned_loss=0.1256, over 21710.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3598, pruned_loss=0.1116, over 4272877.08 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:50:43,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.140e+02 3.530e+02 4.339e+02 7.752e+02, threshold=7.060e+02, percent-clipped=0.0 2023-06-19 18:50:50,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=529794.0, ans=0.0 2023-06-19 18:50:50,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=529794.0, ans=0.125 2023-06-19 18:51:41,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=529914.0, ans=0.0 2023-06-19 18:52:17,073 INFO [train.py:996] (0/4) Epoch 3, batch 27350, loss[loss=0.2596, simple_loss=0.3444, pruned_loss=0.08744, over 21599.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3621, pruned_loss=0.113, over 4275915.12 frames. ], batch size: 230, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:53:11,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=530214.0, ans=0.2 2023-06-19 18:53:11,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.39 vs. limit=22.5 2023-06-19 18:53:31,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=530274.0, ans=0.125 2023-06-19 18:53:43,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=530274.0, ans=0.0 2023-06-19 18:53:58,610 INFO [train.py:996] (0/4) Epoch 3, batch 27400, loss[loss=0.258, simple_loss=0.3107, pruned_loss=0.1027, over 21619.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3574, pruned_loss=0.1123, over 4281080.39 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:54:08,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=530334.0, ans=0.0 2023-06-19 18:54:09,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.21 vs. limit=5.0 2023-06-19 18:54:09,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.048e+02 3.444e+02 4.008e+02 7.916e+02, threshold=6.888e+02, percent-clipped=1.0 2023-06-19 18:54:10,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530334.0, ans=0.1 2023-06-19 18:55:03,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=12.0 2023-06-19 18:55:39,664 INFO [train.py:996] (0/4) Epoch 3, batch 27450, loss[loss=0.2573, simple_loss=0.3445, pruned_loss=0.08502, over 21712.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3503, pruned_loss=0.1103, over 4280168.19 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:55:44,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-19 18:55:49,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-19 18:56:11,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=530754.0, ans=0.0 2023-06-19 18:57:06,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.94 vs. limit=12.0 2023-06-19 18:57:21,080 INFO [train.py:996] (0/4) Epoch 3, batch 27500, loss[loss=0.3006, simple_loss=0.3527, pruned_loss=0.1242, over 21768.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3485, pruned_loss=0.11, over 4284890.14 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:57:32,525 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.096e+02 3.719e+02 4.715e+02 7.955e+02, threshold=7.439e+02, percent-clipped=2.0 2023-06-19 18:58:28,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=531114.0, ans=0.2 2023-06-19 18:58:35,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=531114.0, ans=0.0 2023-06-19 18:59:01,872 INFO [train.py:996] (0/4) Epoch 3, batch 27550, loss[loss=0.2206, simple_loss=0.2923, pruned_loss=0.07448, over 21371.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3412, pruned_loss=0.1058, over 4280845.31 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:59:35,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.12 vs. limit=15.0 2023-06-19 18:59:55,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-19 19:00:09,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-19 19:00:42,108 INFO [train.py:996] (0/4) Epoch 3, batch 27600, loss[loss=0.2521, simple_loss=0.3116, pruned_loss=0.09627, over 21573.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3342, pruned_loss=0.1044, over 4284102.62 frames. ], batch size: 391, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:00:52,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=531534.0, ans=0.125 2023-06-19 19:00:53,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.652e+02 3.386e+02 4.273e+02 7.001e+02, threshold=6.773e+02, percent-clipped=0.0 2023-06-19 19:00:55,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=531534.0, ans=0.2 2023-06-19 19:00:55,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=531534.0, ans=0.5 2023-06-19 19:00:58,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=531594.0, ans=0.2 2023-06-19 19:01:22,443 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.47 vs. limit=22.5 2023-06-19 19:02:23,379 INFO [train.py:996] (0/4) Epoch 3, batch 27650, loss[loss=0.2797, simple_loss=0.3433, pruned_loss=0.1081, over 21339.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3278, pruned_loss=0.1036, over 4276871.03 frames. ], batch size: 159, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:02:40,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=531894.0, ans=0.0 2023-06-19 19:02:49,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=531894.0, ans=0.125 2023-06-19 19:02:58,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=531954.0, ans=0.2 2023-06-19 19:03:20,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-19 19:03:55,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=22.5 2023-06-19 19:04:03,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=532074.0, ans=0.125 2023-06-19 19:04:05,901 INFO [train.py:996] (0/4) Epoch 3, batch 27700, loss[loss=0.3451, simple_loss=0.4181, pruned_loss=0.1361, over 20883.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3287, pruned_loss=0.1013, over 4279424.67 frames. ], batch size: 608, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:04:16,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.29 vs. limit=22.5 2023-06-19 19:04:17,187 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.929e+02 3.310e+02 4.348e+02 7.080e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-19 19:05:47,587 INFO [train.py:996] (0/4) Epoch 3, batch 27750, loss[loss=0.248, simple_loss=0.3411, pruned_loss=0.07748, over 21281.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3331, pruned_loss=0.1012, over 4279238.46 frames. ], batch size: 548, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:06:12,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=532494.0, ans=0.0 2023-06-19 19:06:36,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=532554.0, ans=0.0 2023-06-19 19:06:39,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2023-06-19 19:07:03,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=532614.0, ans=0.0 2023-06-19 19:07:29,264 INFO [train.py:996] (0/4) Epoch 3, batch 27800, loss[loss=0.2676, simple_loss=0.3274, pruned_loss=0.1039, over 21877.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.332, pruned_loss=0.1019, over 4284241.12 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:07:40,254 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.741e+02 3.231e+02 4.040e+02 7.271e+02, threshold=6.461e+02, percent-clipped=1.0 2023-06-19 19:08:32,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=532914.0, ans=0.1 2023-06-19 19:09:12,029 INFO [train.py:996] (0/4) Epoch 3, batch 27850, loss[loss=0.338, simple_loss=0.4058, pruned_loss=0.1351, over 21573.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3322, pruned_loss=0.1039, over 4292939.80 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:09:29,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-19 19:09:35,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=533094.0, ans=10.0 2023-06-19 19:10:06,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=533154.0, ans=0.05 2023-06-19 19:10:20,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=533214.0, ans=0.2 2023-06-19 19:10:45,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=533274.0, ans=0.125 2023-06-19 19:10:58,478 INFO [train.py:996] (0/4) Epoch 3, batch 27900, loss[loss=0.2432, simple_loss=0.3212, pruned_loss=0.08264, over 21416.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3407, pruned_loss=0.1051, over 4288238.76 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:11:12,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=533334.0, ans=0.125 2023-06-19 19:11:16,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.014e+02 3.653e+02 4.966e+02 8.433e+02, threshold=7.306e+02, percent-clipped=7.0 2023-06-19 19:12:22,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=533514.0, ans=0.125 2023-06-19 19:12:47,278 INFO [train.py:996] (0/4) Epoch 3, batch 27950, loss[loss=0.2515, simple_loss=0.3385, pruned_loss=0.08225, over 21730.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3399, pruned_loss=0.1006, over 4279488.02 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:13:08,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=533694.0, ans=0.125 2023-06-19 19:13:15,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-19 19:13:23,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=533694.0, ans=0.125 2023-06-19 19:13:44,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.73 vs. limit=6.0 2023-06-19 19:13:56,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=533814.0, ans=0.125 2023-06-19 19:14:04,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=533814.0, ans=0.2 2023-06-19 19:14:19,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=533874.0, ans=0.125 2023-06-19 19:14:24,770 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:14:28,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=533874.0, ans=0.125 2023-06-19 19:14:29,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=533934.0, ans=0.0 2023-06-19 19:14:30,715 INFO [train.py:996] (0/4) Epoch 3, batch 28000, loss[loss=0.2807, simple_loss=0.3429, pruned_loss=0.1092, over 21863.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3376, pruned_loss=0.09851, over 4283313.01 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:14:48,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.713e+02 3.305e+02 4.071e+02 8.310e+02, threshold=6.609e+02, percent-clipped=2.0 2023-06-19 19:14:49,669 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:14:54,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=533994.0, ans=0.125 2023-06-19 19:15:11,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=533994.0, ans=0.2 2023-06-19 19:15:23,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=534054.0, ans=0.1 2023-06-19 19:15:23,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=534054.0, ans=0.0 2023-06-19 19:15:32,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-06-19 19:15:48,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=534114.0, ans=0.125 2023-06-19 19:16:20,858 INFO [train.py:996] (0/4) Epoch 3, batch 28050, loss[loss=0.2226, simple_loss=0.2745, pruned_loss=0.0853, over 21172.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3349, pruned_loss=0.09956, over 4284792.53 frames. ], batch size: 143, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:16:21,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=534234.0, ans=0.125 2023-06-19 19:16:43,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.41 vs. limit=6.0 2023-06-19 19:17:24,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=534414.0, ans=0.1 2023-06-19 19:17:43,998 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:18:02,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=534534.0, ans=0.1 2023-06-19 19:18:03,152 INFO [train.py:996] (0/4) Epoch 3, batch 28100, loss[loss=0.2475, simple_loss=0.293, pruned_loss=0.101, over 21180.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3343, pruned_loss=0.1012, over 4277747.54 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:18:21,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.997e+02 3.623e+02 4.325e+02 7.130e+02, threshold=7.246e+02, percent-clipped=1.0 2023-06-19 19:18:49,950 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:19:06,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=534714.0, ans=0.125 2023-06-19 19:19:44,136 INFO [train.py:996] (0/4) Epoch 3, batch 28150, loss[loss=0.2713, simple_loss=0.3032, pruned_loss=0.1197, over 21553.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3283, pruned_loss=0.1013, over 4282632.14 frames. ], batch size: 512, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:20:10,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=534894.0, ans=0.2 2023-06-19 19:21:27,400 INFO [train.py:996] (0/4) Epoch 3, batch 28200, loss[loss=0.2914, simple_loss=0.3462, pruned_loss=0.1183, over 21713.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3271, pruned_loss=0.1036, over 4276340.43 frames. ], batch size: 282, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:21:50,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.171e+02 3.933e+02 4.825e+02 1.002e+03, threshold=7.866e+02, percent-clipped=3.0 2023-06-19 19:21:52,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=535194.0, ans=0.2 2023-06-19 19:22:08,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=535254.0, ans=0.125 2023-06-19 19:23:19,294 INFO [train.py:996] (0/4) Epoch 3, batch 28250, loss[loss=0.3163, simple_loss=0.3396, pruned_loss=0.1465, over 21288.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3336, pruned_loss=0.1077, over 4268924.41 frames. ], batch size: 507, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:23:42,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-19 19:24:00,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-06-19 19:24:59,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=535734.0, ans=0.125 2023-06-19 19:25:00,432 INFO [train.py:996] (0/4) Epoch 3, batch 28300, loss[loss=0.2401, simple_loss=0.3228, pruned_loss=0.07873, over 21191.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3306, pruned_loss=0.1049, over 4260426.41 frames. ], batch size: 548, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:25:13,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.857e+02 3.337e+02 4.140e+02 8.167e+02, threshold=6.674e+02, percent-clipped=3.0 2023-06-19 19:25:16,081 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:25:26,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=535794.0, ans=0.0 2023-06-19 19:26:43,628 INFO [train.py:996] (0/4) Epoch 3, batch 28350, loss[loss=0.2436, simple_loss=0.3082, pruned_loss=0.08944, over 21665.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3248, pruned_loss=0.09757, over 4266869.82 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:26:44,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-19 19:26:57,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=536034.0, ans=0.0 2023-06-19 19:27:11,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=536094.0, ans=0.125 2023-06-19 19:27:21,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=536154.0, ans=0.125 2023-06-19 19:27:42,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=536214.0, ans=0.0 2023-06-19 19:28:25,759 INFO [train.py:996] (0/4) Epoch 3, batch 28400, loss[loss=0.2805, simple_loss=0.3334, pruned_loss=0.1138, over 21657.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3216, pruned_loss=0.09717, over 4264678.20 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:28:35,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-19 19:28:44,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.716e+02 3.452e+02 4.220e+02 6.740e+02, threshold=6.905e+02, percent-clipped=2.0 2023-06-19 19:28:46,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=536394.0, ans=0.1 2023-06-19 19:28:54,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=536394.0, ans=0.1 2023-06-19 19:29:18,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.92 vs. limit=6.0 2023-06-19 19:29:19,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=536454.0, ans=0.2 2023-06-19 19:29:25,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-19 19:30:03,713 INFO [train.py:996] (0/4) Epoch 3, batch 28450, loss[loss=0.3206, simple_loss=0.3711, pruned_loss=0.1351, over 21886.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3287, pruned_loss=0.1028, over 4273998.35 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:30:10,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=536634.0, ans=0.125 2023-06-19 19:30:27,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=536694.0, ans=0.2 2023-06-19 19:30:29,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-19 19:30:33,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=536694.0, ans=0.125 2023-06-19 19:31:11,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=536814.0, ans=0.0 2023-06-19 19:31:17,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=536814.0, ans=0.0 2023-06-19 19:31:31,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-19 19:31:36,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=536874.0, ans=0.125 2023-06-19 19:31:42,106 INFO [train.py:996] (0/4) Epoch 3, batch 28500, loss[loss=0.3308, simple_loss=0.3927, pruned_loss=0.1345, over 21477.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3309, pruned_loss=0.105, over 4277781.38 frames. ], batch size: 131, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:32:01,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.952e+02 3.630e+02 4.610e+02 9.107e+02, threshold=7.260e+02, percent-clipped=2.0 2023-06-19 19:32:09,716 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-19 19:32:25,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-19 19:32:33,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=537054.0, ans=0.1 2023-06-19 19:32:58,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-06-19 19:33:10,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=537174.0, ans=0.125 2023-06-19 19:33:30,096 INFO [train.py:996] (0/4) Epoch 3, batch 28550, loss[loss=0.272, simple_loss=0.3602, pruned_loss=0.0919, over 21415.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3386, pruned_loss=0.1075, over 4280197.43 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:33:30,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=537234.0, ans=0.0 2023-06-19 19:34:03,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=537294.0, ans=0.0 2023-06-19 19:35:08,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=537474.0, ans=0.125 2023-06-19 19:35:14,032 INFO [train.py:996] (0/4) Epoch 3, batch 28600, loss[loss=0.287, simple_loss=0.3572, pruned_loss=0.1084, over 21538.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3459, pruned_loss=0.11, over 4279073.12 frames. ], batch size: 112, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:35:24,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=537534.0, ans=0.0 2023-06-19 19:35:38,661 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.072e+02 3.686e+02 4.724e+02 8.342e+02, threshold=7.372e+02, percent-clipped=3.0 2023-06-19 19:35:39,870 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0 2023-06-19 19:35:58,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=537594.0, ans=0.0 2023-06-19 19:36:21,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=537714.0, ans=0.1 2023-06-19 19:37:00,664 INFO [train.py:996] (0/4) Epoch 3, batch 28650, loss[loss=0.2743, simple_loss=0.3142, pruned_loss=0.1172, over 21887.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3406, pruned_loss=0.1091, over 4273554.08 frames. ], batch size: 107, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:37:49,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=537954.0, ans=0.0 2023-06-19 19:37:54,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=537954.0, ans=0.125 2023-06-19 19:38:20,211 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:38:37,697 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:38:42,155 INFO [train.py:996] (0/4) Epoch 3, batch 28700, loss[loss=0.2972, simple_loss=0.3579, pruned_loss=0.1182, over 21942.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3383, pruned_loss=0.1098, over 4271465.80 frames. ], batch size: 372, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:38:59,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=538134.0, ans=0.0 2023-06-19 19:39:01,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.971e+02 3.318e+02 4.254e+02 6.959e+02, threshold=6.637e+02, percent-clipped=0.0 2023-06-19 19:39:02,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538194.0, ans=0.1 2023-06-19 19:39:41,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=538314.0, ans=0.05 2023-06-19 19:40:14,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=538374.0, ans=0.125 2023-06-19 19:40:24,361 INFO [train.py:996] (0/4) Epoch 3, batch 28750, loss[loss=0.2518, simple_loss=0.3402, pruned_loss=0.08174, over 21848.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3385, pruned_loss=0.1094, over 4277011.70 frames. ], batch size: 371, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:40:58,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.24 vs. limit=10.0 2023-06-19 19:41:08,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=538554.0, ans=0.125 2023-06-19 19:41:51,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=538674.0, ans=0.0 2023-06-19 19:42:11,960 INFO [train.py:996] (0/4) Epoch 3, batch 28800, loss[loss=0.329, simple_loss=0.3817, pruned_loss=0.1382, over 21857.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3418, pruned_loss=0.1096, over 4282521.74 frames. ], batch size: 282, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:42:14,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=538734.0, ans=0.125 2023-06-19 19:42:31,961 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.000e+02 3.700e+02 5.247e+02 1.056e+03, threshold=7.400e+02, percent-clipped=15.0 2023-06-19 19:42:40,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=538794.0, ans=0.2 2023-06-19 19:43:05,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=538854.0, ans=0.125 2023-06-19 19:43:12,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=538914.0, ans=0.125 2023-06-19 19:43:20,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538914.0, ans=0.1 2023-06-19 19:43:59,889 INFO [train.py:996] (0/4) Epoch 3, batch 28850, loss[loss=0.3171, simple_loss=0.3694, pruned_loss=0.1324, over 21565.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3479, pruned_loss=0.1126, over 4276058.17 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:44:29,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-19 19:45:04,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-19 19:45:21,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=539274.0, ans=0.125 2023-06-19 19:45:43,233 INFO [train.py:996] (0/4) Epoch 3, batch 28900, loss[loss=0.305, simple_loss=0.3597, pruned_loss=0.1252, over 21914.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.353, pruned_loss=0.1152, over 4280152.18 frames. ], batch size: 316, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:45:55,413 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:45:58,272 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 3.259e+02 3.859e+02 4.928e+02 8.850e+02, threshold=7.718e+02, percent-clipped=4.0 2023-06-19 19:46:01,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=539394.0, ans=0.125 2023-06-19 19:46:32,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=539454.0, ans=0.0 2023-06-19 19:46:34,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=539454.0, ans=0.0 2023-06-19 19:47:07,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=539514.0, ans=0.125 2023-06-19 19:47:26,896 INFO [train.py:996] (0/4) Epoch 3, batch 28950, loss[loss=0.2568, simple_loss=0.3439, pruned_loss=0.08488, over 21820.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3505, pruned_loss=0.1139, over 4274056.58 frames. ], batch size: 316, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:47:49,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.56 vs. limit=12.0 2023-06-19 19:48:15,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=539754.0, ans=0.0 2023-06-19 19:48:42,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-19 19:49:13,108 INFO [train.py:996] (0/4) Epoch 3, batch 29000, loss[loss=0.2806, simple_loss=0.3725, pruned_loss=0.09437, over 20805.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3532, pruned_loss=0.1131, over 4270267.59 frames. ], batch size: 608, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:49:24,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=539934.0, ans=0.2 2023-06-19 19:49:27,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.902e+02 3.366e+02 4.190e+02 7.172e+02, threshold=6.731e+02, percent-clipped=0.0 2023-06-19 19:50:29,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-19 19:50:51,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.97 vs. limit=10.0 2023-06-19 19:50:55,868 INFO [train.py:996] (0/4) Epoch 3, batch 29050, loss[loss=0.3412, simple_loss=0.372, pruned_loss=0.1552, over 21782.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3537, pruned_loss=0.1141, over 4275605.52 frames. ], batch size: 508, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:51:03,681 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-06-19 19:51:48,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=540354.0, ans=0.125 2023-06-19 19:52:04,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.38 vs. limit=15.0 2023-06-19 19:52:14,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=540414.0, ans=0.125 2023-06-19 19:52:38,133 INFO [train.py:996] (0/4) Epoch 3, batch 29100, loss[loss=0.2188, simple_loss=0.2849, pruned_loss=0.07637, over 21778.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3439, pruned_loss=0.1101, over 4267814.46 frames. ], batch size: 351, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:52:57,600 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.942e+02 3.636e+02 4.444e+02 9.761e+02, threshold=7.273e+02, percent-clipped=4.0 2023-06-19 19:53:22,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540654.0, ans=0.1 2023-06-19 19:53:35,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540654.0, ans=0.1 2023-06-19 19:54:11,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=540774.0, ans=0.0 2023-06-19 19:54:18,769 INFO [train.py:996] (0/4) Epoch 3, batch 29150, loss[loss=0.266, simple_loss=0.3374, pruned_loss=0.09728, over 21229.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3403, pruned_loss=0.1073, over 4269279.42 frames. ], batch size: 548, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:54:22,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=540834.0, ans=0.2 2023-06-19 19:55:09,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=540954.0, ans=0.1 2023-06-19 19:55:39,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.59 vs. limit=15.0 2023-06-19 19:55:57,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=541134.0, ans=0.125 2023-06-19 19:55:58,507 INFO [train.py:996] (0/4) Epoch 3, batch 29200, loss[loss=0.2474, simple_loss=0.2934, pruned_loss=0.1007, over 21455.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3352, pruned_loss=0.1063, over 4266695.89 frames. ], batch size: 195, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:56:18,621 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.074e+02 3.815e+02 4.848e+02 9.248e+02, threshold=7.630e+02, percent-clipped=3.0 2023-06-19 19:56:29,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=541194.0, ans=0.125 2023-06-19 19:57:27,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=541374.0, ans=0.125 2023-06-19 19:57:34,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.52 vs. limit=5.0 2023-06-19 19:57:36,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=541374.0, ans=0.0 2023-06-19 19:57:41,203 INFO [train.py:996] (0/4) Epoch 3, batch 29250, loss[loss=0.2706, simple_loss=0.341, pruned_loss=0.1001, over 21427.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3313, pruned_loss=0.1027, over 4260749.81 frames. ], batch size: 195, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 19:57:45,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=541434.0, ans=0.0 2023-06-19 19:58:37,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=541554.0, ans=0.125 2023-06-19 19:59:01,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=541614.0, ans=0.125 2023-06-19 19:59:01,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=541614.0, ans=0.125 2023-06-19 19:59:28,978 INFO [train.py:996] (0/4) Epoch 3, batch 29300, loss[loss=0.2982, simple_loss=0.3657, pruned_loss=0.1154, over 21697.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3352, pruned_loss=0.1028, over 4263252.34 frames. ], batch size: 298, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 19:59:32,886 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-19 19:59:34,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-19 19:59:35,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=541734.0, ans=0.125 2023-06-19 19:59:49,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.065e+02 3.693e+02 4.587e+02 7.138e+02, threshold=7.387e+02, percent-clipped=0.0 2023-06-19 20:00:18,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.45 vs. limit=15.0 2023-06-19 20:01:11,344 INFO [train.py:996] (0/4) Epoch 3, batch 29350, loss[loss=0.2865, simple_loss=0.3632, pruned_loss=0.1049, over 21610.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3316, pruned_loss=0.1024, over 4263468.49 frames. ], batch size: 442, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:01:46,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=542094.0, ans=0.0 2023-06-19 20:02:22,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=542214.0, ans=0.0 2023-06-19 20:02:25,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=542214.0, ans=0.125 2023-06-19 20:02:48,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=542274.0, ans=0.0 2023-06-19 20:02:51,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=542274.0, ans=0.125 2023-06-19 20:02:59,339 INFO [train.py:996] (0/4) Epoch 3, batch 29400, loss[loss=0.2397, simple_loss=0.3159, pruned_loss=0.08173, over 20013.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3314, pruned_loss=0.0999, over 4265935.62 frames. ], batch size: 703, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:03:00,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=542334.0, ans=0.125 2023-06-19 20:03:16,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=542334.0, ans=0.05 2023-06-19 20:03:20,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542394.0, ans=0.1 2023-06-19 20:03:21,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.918e+02 3.507e+02 4.489e+02 7.938e+02, threshold=7.015e+02, percent-clipped=2.0 2023-06-19 20:03:23,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=542394.0, ans=0.2 2023-06-19 20:04:43,536 INFO [train.py:996] (0/4) Epoch 3, batch 29450, loss[loss=0.241, simple_loss=0.3288, pruned_loss=0.07661, over 19944.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3295, pruned_loss=0.09876, over 4263997.08 frames. ], batch size: 703, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:04:55,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=542634.0, ans=0.125 2023-06-19 20:06:21,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=542874.0, ans=0.0 2023-06-19 20:06:29,871 INFO [train.py:996] (0/4) Epoch 3, batch 29500, loss[loss=0.2744, simple_loss=0.3301, pruned_loss=0.1093, over 21939.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3337, pruned_loss=0.1027, over 4267836.30 frames. ], batch size: 333, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:06:45,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.087e+02 3.959e+02 5.251e+02 8.059e+02, threshold=7.918e+02, percent-clipped=6.0 2023-06-19 20:06:47,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=542994.0, ans=0.125 2023-06-19 20:07:16,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.24 vs. limit=15.0 2023-06-19 20:07:24,048 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:07:27,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-19 20:07:48,453 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-19 20:08:10,035 INFO [train.py:996] (0/4) Epoch 3, batch 29550, loss[loss=0.2515, simple_loss=0.3144, pruned_loss=0.09434, over 21859.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3342, pruned_loss=0.1044, over 4273010.97 frames. ], batch size: 298, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:08:21,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=543234.0, ans=0.2 2023-06-19 20:09:54,131 INFO [train.py:996] (0/4) Epoch 3, batch 29600, loss[loss=0.2975, simple_loss=0.3687, pruned_loss=0.1131, over 21474.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3407, pruned_loss=0.1069, over 4273981.41 frames. ], batch size: 194, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:10:15,952 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.027e+02 3.599e+02 4.338e+02 7.072e+02, threshold=7.197e+02, percent-clipped=0.0 2023-06-19 20:10:18,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-19 20:10:19,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=543594.0, ans=0.125 2023-06-19 20:11:22,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=543774.0, ans=0.125 2023-06-19 20:11:34,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=543774.0, ans=0.125 2023-06-19 20:11:36,798 INFO [train.py:996] (0/4) Epoch 3, batch 29650, loss[loss=0.2133, simple_loss=0.2807, pruned_loss=0.07296, over 21586.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3367, pruned_loss=0.1025, over 4273779.65 frames. ], batch size: 230, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:12:15,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=543954.0, ans=0.1 2023-06-19 20:12:49,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=544014.0, ans=0.2 2023-06-19 20:13:08,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-06-19 20:13:17,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=544074.0, ans=0.0 2023-06-19 20:13:20,512 INFO [train.py:996] (0/4) Epoch 3, batch 29700, loss[loss=0.2965, simple_loss=0.3745, pruned_loss=0.1093, over 21184.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3391, pruned_loss=0.103, over 4280173.71 frames. ], batch size: 143, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:13:20,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544134.0, ans=0.1 2023-06-19 20:13:38,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=544134.0, ans=0.04949747468305833 2023-06-19 20:13:41,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.649e+02 2.987e+02 3.970e+02 7.304e+02, threshold=5.973e+02, percent-clipped=1.0 2023-06-19 20:13:45,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=544194.0, ans=0.2 2023-06-19 20:13:45,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=544194.0, ans=0.2 2023-06-19 20:14:33,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-19 20:14:57,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=544374.0, ans=0.035 2023-06-19 20:15:01,846 INFO [train.py:996] (0/4) Epoch 3, batch 29750, loss[loss=0.2488, simple_loss=0.3269, pruned_loss=0.08534, over 21327.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3434, pruned_loss=0.1031, over 4275704.74 frames. ], batch size: 159, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:15:40,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=544554.0, ans=0.0 2023-06-19 20:16:27,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=544674.0, ans=0.125 2023-06-19 20:16:38,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=544674.0, ans=0.125 2023-06-19 20:16:47,589 INFO [train.py:996] (0/4) Epoch 3, batch 29800, loss[loss=0.2746, simple_loss=0.3313, pruned_loss=0.109, over 21453.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3465, pruned_loss=0.1055, over 4284364.01 frames. ], batch size: 194, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:16:53,108 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:16:57,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=544734.0, ans=0.0 2023-06-19 20:17:04,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=544794.0, ans=0.125 2023-06-19 20:17:05,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.342e+02 4.045e+02 4.978e+02 1.039e+03, threshold=8.090e+02, percent-clipped=10.0 2023-06-19 20:17:16,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=544794.0, ans=0.0 2023-06-19 20:17:23,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544854.0, ans=0.1 2023-06-19 20:17:28,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=544854.0, ans=0.125 2023-06-19 20:17:40,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=544914.0, ans=0.0 2023-06-19 20:17:42,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=544914.0, ans=0.2 2023-06-19 20:17:52,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=544914.0, ans=0.0 2023-06-19 20:18:15,588 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.52 vs. limit=15.0 2023-06-19 20:18:22,270 INFO [train.py:996] (0/4) Epoch 3, batch 29850, loss[loss=0.2457, simple_loss=0.3051, pruned_loss=0.09312, over 21752.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3425, pruned_loss=0.103, over 4287413.80 frames. ], batch size: 231, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:19:09,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545154.0, ans=0.1 2023-06-19 20:19:18,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545154.0, ans=0.1 2023-06-19 20:20:08,647 INFO [train.py:996] (0/4) Epoch 3, batch 29900, loss[loss=0.2176, simple_loss=0.2945, pruned_loss=0.07041, over 20826.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3391, pruned_loss=0.1034, over 4290575.34 frames. ], batch size: 608, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:20:26,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.681e+02 3.110e+02 3.688e+02 5.256e+02, threshold=6.220e+02, percent-clipped=0.0 2023-06-19 20:20:32,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=545394.0, ans=0.125 2023-06-19 20:21:09,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-19 20:21:37,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=545574.0, ans=0.125 2023-06-19 20:21:40,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=545574.0, ans=0.125 2023-06-19 20:21:46,641 INFO [train.py:996] (0/4) Epoch 3, batch 29950, loss[loss=0.3178, simple_loss=0.3696, pruned_loss=0.133, over 21519.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3448, pruned_loss=0.1094, over 4293105.44 frames. ], batch size: 194, lr: 9.99e-03, grad_scale: 16.0 2023-06-19 20:21:48,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=545634.0, ans=0.2 2023-06-19 20:22:22,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=545694.0, ans=0.1 2023-06-19 20:22:59,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-19 20:23:29,430 INFO [train.py:996] (0/4) Epoch 3, batch 30000, loss[loss=0.2404, simple_loss=0.3277, pruned_loss=0.07656, over 21796.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3476, pruned_loss=0.1096, over 4290161.63 frames. ], batch size: 282, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:23:29,431 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 20:23:45,897 INFO [train.py:1028] (0/4) Epoch 3, validation: loss=0.254, simple_loss=0.3581, pruned_loss=0.075, over 1796401.00 frames. 2023-06-19 20:23:45,898 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 20:23:49,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-19 20:24:08,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545994.0, ans=0.1 2023-06-19 20:24:15,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.901e+02 3.447e+02 4.272e+02 9.118e+02, threshold=6.893e+02, percent-clipped=6.0 2023-06-19 20:25:12,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=546114.0, ans=0.125 2023-06-19 20:25:43,269 INFO [train.py:996] (0/4) Epoch 3, batch 30050, loss[loss=0.2182, simple_loss=0.3497, pruned_loss=0.04338, over 19825.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3521, pruned_loss=0.1069, over 4285442.91 frames. ], batch size: 702, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:25:52,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=546234.0, ans=0.125 2023-06-19 20:25:53,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=546234.0, ans=0.1 2023-06-19 20:26:22,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-19 20:27:23,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-19 20:27:23,649 INFO [train.py:996] (0/4) Epoch 3, batch 30100, loss[loss=0.2694, simple_loss=0.3164, pruned_loss=0.1112, over 21867.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.348, pruned_loss=0.1056, over 4275919.44 frames. ], batch size: 98, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:27:34,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-19 20:27:46,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.970e+02 3.475e+02 4.229e+02 7.609e+02, threshold=6.950e+02, percent-clipped=3.0 2023-06-19 20:28:02,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=546594.0, ans=0.125 2023-06-19 20:28:18,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=546654.0, ans=0.0 2023-06-19 20:29:11,582 INFO [train.py:996] (0/4) Epoch 3, batch 30150, loss[loss=0.311, simple_loss=0.3443, pruned_loss=0.1388, over 20136.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3438, pruned_loss=0.1075, over 4257684.72 frames. ], batch size: 702, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:29:12,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=546834.0, ans=0.125 2023-06-19 20:31:01,137 INFO [train.py:996] (0/4) Epoch 3, batch 30200, loss[loss=0.2856, simple_loss=0.3538, pruned_loss=0.1087, over 21729.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.345, pruned_loss=0.1055, over 4261055.29 frames. ], batch size: 124, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:31:06,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-06-19 20:31:20,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.884e+02 3.477e+02 4.360e+02 6.992e+02, threshold=6.953e+02, percent-clipped=1.0 2023-06-19 20:32:40,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-19 20:32:45,976 INFO [train.py:996] (0/4) Epoch 3, batch 30250, loss[loss=0.3939, simple_loss=0.4584, pruned_loss=0.1647, over 21532.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3544, pruned_loss=0.1081, over 4264951.39 frames. ], batch size: 471, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:32:49,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=547434.0, ans=0.1 2023-06-19 20:33:09,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=547494.0, ans=0.125 2023-06-19 20:33:15,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=547494.0, ans=0.0 2023-06-19 20:33:54,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=547614.0, ans=0.1 2023-06-19 20:34:17,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-19 20:34:29,328 INFO [train.py:996] (0/4) Epoch 3, batch 30300, loss[loss=0.2561, simple_loss=0.3111, pruned_loss=0.1005, over 21746.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3519, pruned_loss=0.1077, over 4265961.57 frames. ], batch size: 317, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:34:36,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=547734.0, ans=0.025 2023-06-19 20:34:52,628 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.189e+02 3.746e+02 4.977e+02 8.102e+02, threshold=7.493e+02, percent-clipped=4.0 2023-06-19 20:35:03,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=547794.0, ans=0.0 2023-06-19 20:35:15,454 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:35:49,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=547914.0, ans=0.025 2023-06-19 20:36:13,963 INFO [train.py:996] (0/4) Epoch 3, batch 30350, loss[loss=0.3648, simple_loss=0.4281, pruned_loss=0.1508, over 21569.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.349, pruned_loss=0.1084, over 4263310.70 frames. ], batch size: 473, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:36:23,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=548034.0, ans=0.2 2023-06-19 20:36:37,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=548034.0, ans=0.0 2023-06-19 20:36:50,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=548094.0, ans=0.0 2023-06-19 20:36:53,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=548154.0, ans=0.1 2023-06-19 20:37:43,113 INFO [train.py:996] (0/4) Epoch 3, batch 30400, loss[loss=0.2681, simple_loss=0.3122, pruned_loss=0.112, over 20274.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3406, pruned_loss=0.1055, over 4250176.00 frames. ], batch size: 703, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:37:59,841 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.466e+02 4.166e+02 5.135e+02 9.055e+02, threshold=8.331e+02, percent-clipped=4.0 2023-06-19 20:39:04,641 INFO [train.py:996] (0/4) Epoch 3, batch 30450, loss[loss=0.3343, simple_loss=0.4155, pruned_loss=0.1266, over 19939.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3434, pruned_loss=0.1063, over 4193503.29 frames. ], batch size: 702, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:39:06,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=548634.0, ans=0.125 2023-06-19 20:39:23,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=548694.0, ans=0.125 2023-06-19 20:39:26,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=548694.0, ans=22.5 2023-06-19 20:39:54,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=548814.0, ans=0.0 2023-06-19 20:39:55,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=548814.0, ans=0.1 2023-06-19 20:39:55,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=548814.0, ans=0.125 2023-06-19 20:40:02,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=548814.0, ans=0.04949747468305833 2023-06-19 20:40:07,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=548874.0, ans=0.05 2023-06-19 20:40:09,613 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-19 20:40:14,235 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-3.pt 2023-06-19 20:41:58,149 INFO [train.py:996] (0/4) Epoch 4, batch 0, loss[loss=0.3192, simple_loss=0.3629, pruned_loss=0.1377, over 21537.00 frames. ], tot_loss[loss=0.3192, simple_loss=0.3629, pruned_loss=0.1377, over 21537.00 frames. ], batch size: 391, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:41:58,150 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 20:42:15,979 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2612, simple_loss=0.3698, pruned_loss=0.07632, over 1796401.00 frames. 2023-06-19 20:42:15,980 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 20:42:45,423 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.032e+02 5.518e+02 8.293e+02 1.240e+03 3.012e+03, threshold=1.659e+03, percent-clipped=49.0 2023-06-19 20:42:51,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-19 20:43:06,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.39 vs. limit=15.0 2023-06-19 20:43:20,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-19 20:43:48,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=549144.0, ans=0.5 2023-06-19 20:43:52,605 INFO [train.py:996] (0/4) Epoch 4, batch 50, loss[loss=0.2699, simple_loss=0.3443, pruned_loss=0.09775, over 21597.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3498, pruned_loss=0.1077, over 963698.75 frames. ], batch size: 230, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:44:09,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=549204.0, ans=0.125 2023-06-19 20:45:04,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=549384.0, ans=0.04949747468305833 2023-06-19 20:45:22,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=549444.0, ans=0.1 2023-06-19 20:45:30,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=549444.0, ans=0.125 2023-06-19 20:45:33,228 INFO [train.py:996] (0/4) Epoch 4, batch 100, loss[loss=0.3192, simple_loss=0.4087, pruned_loss=0.1149, over 21455.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3691, pruned_loss=0.11, over 1692405.05 frames. ], batch size: 211, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:45:54,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=549564.0, ans=0.1 2023-06-19 20:46:08,878 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.893e+02 3.441e+02 3.943e+02 7.428e+02, threshold=6.883e+02, percent-clipped=0.0 2023-06-19 20:46:11,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=549564.0, ans=0.0 2023-06-19 20:46:12,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=549564.0, ans=0.125 2023-06-19 20:46:24,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=549624.0, ans=0.2 2023-06-19 20:46:28,105 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.88 vs. limit=6.0 2023-06-19 20:46:56,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=549744.0, ans=0.04949747468305833 2023-06-19 20:46:59,937 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:47:09,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=549744.0, ans=0.1 2023-06-19 20:47:13,446 INFO [train.py:996] (0/4) Epoch 4, batch 150, loss[loss=0.3147, simple_loss=0.3971, pruned_loss=0.1162, over 21820.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3659, pruned_loss=0.1073, over 2266771.51 frames. ], batch size: 316, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:47:49,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=549864.0, ans=0.05 2023-06-19 20:47:54,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=549924.0, ans=10.0 2023-06-19 20:48:35,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=549984.0, ans=0.0 2023-06-19 20:48:38,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-19 20:48:41,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=550044.0, ans=0.125 2023-06-19 20:48:42,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=550044.0, ans=0.125 2023-06-19 20:48:53,410 INFO [train.py:996] (0/4) Epoch 4, batch 200, loss[loss=0.3164, simple_loss=0.3737, pruned_loss=0.1296, over 21482.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3605, pruned_loss=0.1059, over 2716223.45 frames. ], batch size: 131, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:48:54,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=22.5 2023-06-19 20:49:13,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=550164.0, ans=0.0 2023-06-19 20:49:29,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.787e+02 3.303e+02 4.395e+02 6.398e+02, threshold=6.606e+02, percent-clipped=0.0 2023-06-19 20:49:33,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=550164.0, ans=0.0 2023-06-19 20:49:34,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=550224.0, ans=0.2 2023-06-19 20:49:36,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=550224.0, ans=0.0 2023-06-19 20:49:42,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=550224.0, ans=0.125 2023-06-19 20:49:56,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=550284.0, ans=0.0 2023-06-19 20:50:20,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-19 20:50:35,698 INFO [train.py:996] (0/4) Epoch 4, batch 250, loss[loss=0.2451, simple_loss=0.3045, pruned_loss=0.09288, over 21736.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.356, pruned_loss=0.1058, over 3059332.35 frames. ], batch size: 124, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:51:00,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=550464.0, ans=0.0 2023-06-19 20:51:24,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=550524.0, ans=0.125 2023-06-19 20:52:19,241 INFO [train.py:996] (0/4) Epoch 4, batch 300, loss[loss=0.2488, simple_loss=0.3355, pruned_loss=0.08112, over 21696.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.3519, pruned_loss=0.1058, over 3323191.11 frames. ], batch size: 298, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:52:30,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=550704.0, ans=0.1 2023-06-19 20:52:57,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.088e+02 3.665e+02 5.063e+02 1.079e+03, threshold=7.330e+02, percent-clipped=8.0 2023-06-19 20:52:58,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=550764.0, ans=0.125 2023-06-19 20:54:05,669 INFO [train.py:996] (0/4) Epoch 4, batch 350, loss[loss=0.2767, simple_loss=0.37, pruned_loss=0.09175, over 21649.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.347, pruned_loss=0.106, over 3539563.65 frames. ], batch size: 389, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:54:56,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=551124.0, ans=0.0 2023-06-19 20:55:00,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=551124.0, ans=0.0 2023-06-19 20:55:52,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2023-06-19 20:55:54,948 INFO [train.py:996] (0/4) Epoch 4, batch 400, loss[loss=0.2475, simple_loss=0.2957, pruned_loss=0.09971, over 21245.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3383, pruned_loss=0.1021, over 3699748.84 frames. ], batch size: 160, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:56:25,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=22.5 2023-06-19 20:56:26,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.883e+02 3.575e+02 4.503e+02 7.615e+02, threshold=7.149e+02, percent-clipped=2.0 2023-06-19 20:56:35,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=551364.0, ans=0.125 2023-06-19 20:56:44,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=551424.0, ans=0.1 2023-06-19 20:57:12,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=551484.0, ans=0.0 2023-06-19 20:57:16,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=551544.0, ans=0.125 2023-06-19 20:57:36,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=551604.0, ans=0.125 2023-06-19 20:57:37,299 INFO [train.py:996] (0/4) Epoch 4, batch 450, loss[loss=0.2201, simple_loss=0.267, pruned_loss=0.08664, over 20261.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3342, pruned_loss=0.09999, over 3829307.73 frames. ], batch size: 702, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:58:00,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=551664.0, ans=0.125 2023-06-19 20:58:03,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=551664.0, ans=0.2 2023-06-19 20:58:22,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=551724.0, ans=0.125 2023-06-19 20:58:22,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=551724.0, ans=0.0 2023-06-19 20:58:34,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=551724.0, ans=0.1 2023-06-19 20:58:55,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=551784.0, ans=0.125 2023-06-19 20:59:19,208 INFO [train.py:996] (0/4) Epoch 4, batch 500, loss[loss=0.2699, simple_loss=0.3695, pruned_loss=0.08512, over 21634.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3352, pruned_loss=0.09853, over 3933852.07 frames. ], batch size: 441, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:59:49,019 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-92000.pt 2023-06-19 20:59:53,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 2.948e+02 3.424e+02 4.506e+02 6.960e+02, threshold=6.848e+02, percent-clipped=0.0 2023-06-19 21:00:50,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-06-19 21:01:01,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=552204.0, ans=0.125 2023-06-19 21:01:02,302 INFO [train.py:996] (0/4) Epoch 4, batch 550, loss[loss=0.293, simple_loss=0.3944, pruned_loss=0.09576, over 21758.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3402, pruned_loss=0.09951, over 4020258.53 frames. ], batch size: 351, lr: 8.58e-03, grad_scale: 16.0 2023-06-19 21:01:11,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=552204.0, ans=0.2 2023-06-19 21:01:41,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=552264.0, ans=0.125 2023-06-19 21:01:49,855 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:02:04,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=552324.0, ans=0.125 2023-06-19 21:02:12,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=552384.0, ans=0.0 2023-06-19 21:02:45,461 INFO [train.py:996] (0/4) Epoch 4, batch 600, loss[loss=0.2835, simple_loss=0.3568, pruned_loss=0.1051, over 21683.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3445, pruned_loss=0.1, over 4076951.68 frames. ], batch size: 247, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:03:08,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=552564.0, ans=0.1 2023-06-19 21:03:17,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 3.276e+02 3.981e+02 4.951e+02 8.718e+02, threshold=7.962e+02, percent-clipped=3.0 2023-06-19 21:03:36,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=552624.0, ans=0.125 2023-06-19 21:03:38,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552624.0, ans=0.1 2023-06-19 21:03:44,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=552624.0, ans=0.125 2023-06-19 21:04:28,061 INFO [train.py:996] (0/4) Epoch 4, batch 650, loss[loss=0.2843, simple_loss=0.3421, pruned_loss=0.1132, over 21814.00 frames. ], tot_loss[loss=0.274, simple_loss=0.3451, pruned_loss=0.1015, over 4124587.45 frames. ], batch size: 102, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:04:53,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=552864.0, ans=0.0 2023-06-19 21:04:54,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=552864.0, ans=0.0 2023-06-19 21:06:10,738 INFO [train.py:996] (0/4) Epoch 4, batch 700, loss[loss=0.2179, simple_loss=0.2759, pruned_loss=0.08, over 21634.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.344, pruned_loss=0.1021, over 4150839.63 frames. ], batch size: 247, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:06:25,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=553164.0, ans=0.125 2023-06-19 21:06:40,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=553164.0, ans=0.0 2023-06-19 21:06:43,343 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.524e+02 3.407e+02 4.015e+02 5.310e+02 1.031e+03, threshold=8.030e+02, percent-clipped=3.0 2023-06-19 21:06:52,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=15.0 2023-06-19 21:07:19,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=553284.0, ans=0.2 2023-06-19 21:07:52,973 INFO [train.py:996] (0/4) Epoch 4, batch 750, loss[loss=0.2379, simple_loss=0.301, pruned_loss=0.08741, over 21752.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3428, pruned_loss=0.102, over 4180324.99 frames. ], batch size: 298, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:08:00,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=553404.0, ans=0.07 2023-06-19 21:08:03,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=553404.0, ans=0.125 2023-06-19 21:08:22,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=553464.0, ans=0.0 2023-06-19 21:08:49,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=553524.0, ans=0.0 2023-06-19 21:09:23,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=553644.0, ans=0.1 2023-06-19 21:09:34,473 INFO [train.py:996] (0/4) Epoch 4, batch 800, loss[loss=0.2757, simple_loss=0.3791, pruned_loss=0.08615, over 21341.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3404, pruned_loss=0.1026, over 4211980.49 frames. ], batch size: 548, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:09:48,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=553704.0, ans=0.2 2023-06-19 21:09:52,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=553764.0, ans=0.0 2023-06-19 21:10:07,073 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.089e+02 3.541e+02 4.418e+02 8.046e+02, threshold=7.083e+02, percent-clipped=1.0 2023-06-19 21:10:16,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-19 21:10:59,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-19 21:11:18,281 INFO [train.py:996] (0/4) Epoch 4, batch 850, loss[loss=0.2843, simple_loss=0.3449, pruned_loss=0.1119, over 21881.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3374, pruned_loss=0.1018, over 4229639.30 frames. ], batch size: 351, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:11:31,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-19 21:11:35,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.50 vs. limit=10.0 2023-06-19 21:11:44,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-19 21:11:48,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=554064.0, ans=0.0 2023-06-19 21:13:02,687 INFO [train.py:996] (0/4) Epoch 4, batch 900, loss[loss=0.2477, simple_loss=0.3353, pruned_loss=0.08002, over 21790.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3368, pruned_loss=0.1019, over 4244932.64 frames. ], batch size: 332, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:13:39,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=15.0 2023-06-19 21:13:40,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 3.017e+02 3.559e+02 4.118e+02 8.031e+02, threshold=7.118e+02, percent-clipped=1.0 2023-06-19 21:13:46,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-19 21:13:55,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-19 21:14:32,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=554544.0, ans=0.0 2023-06-19 21:14:45,094 INFO [train.py:996] (0/4) Epoch 4, batch 950, loss[loss=0.2611, simple_loss=0.3327, pruned_loss=0.09472, over 21796.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3332, pruned_loss=0.0995, over 4260490.82 frames. ], batch size: 414, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:15:44,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554724.0, ans=0.1 2023-06-19 21:16:27,604 INFO [train.py:996] (0/4) Epoch 4, batch 1000, loss[loss=0.3078, simple_loss=0.3738, pruned_loss=0.1209, over 21851.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.333, pruned_loss=0.0999, over 4272922.20 frames. ], batch size: 316, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:16:37,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=554904.0, ans=0.125 2023-06-19 21:17:12,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.951e+02 3.502e+02 4.133e+02 7.133e+02, threshold=7.004e+02, percent-clipped=1.0 2023-06-19 21:17:59,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=555144.0, ans=0.0 2023-06-19 21:18:10,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-06-19 21:18:15,456 INFO [train.py:996] (0/4) Epoch 4, batch 1050, loss[loss=0.2824, simple_loss=0.358, pruned_loss=0.1034, over 21835.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3329, pruned_loss=0.1, over 4282997.61 frames. ], batch size: 316, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:18:24,702 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-19 21:18:34,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=555204.0, ans=0.5 2023-06-19 21:18:37,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=555264.0, ans=0.125 2023-06-19 21:19:58,904 INFO [train.py:996] (0/4) Epoch 4, batch 1100, loss[loss=0.2496, simple_loss=0.3355, pruned_loss=0.08184, over 21786.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3328, pruned_loss=0.09881, over 4279758.40 frames. ], batch size: 371, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:20:23,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=555564.0, ans=0.125 2023-06-19 21:20:39,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.35 vs. limit=12.0 2023-06-19 21:20:39,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 3.086e+02 3.737e+02 4.742e+02 7.537e+02, threshold=7.473e+02, percent-clipped=2.0 2023-06-19 21:21:19,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=555684.0, ans=0.125 2023-06-19 21:21:38,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=555744.0, ans=0.5 2023-06-19 21:21:43,859 INFO [train.py:996] (0/4) Epoch 4, batch 1150, loss[loss=0.2141, simple_loss=0.2565, pruned_loss=0.08589, over 16693.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3349, pruned_loss=0.09885, over 4284341.55 frames. ], batch size: 60, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:21:59,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=555804.0, ans=0.0 2023-06-19 21:22:39,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=555924.0, ans=0.015 2023-06-19 21:23:02,806 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.67 vs. limit=12.0 2023-06-19 21:23:33,582 INFO [train.py:996] (0/4) Epoch 4, batch 1200, loss[loss=0.2896, simple_loss=0.3657, pruned_loss=0.1067, over 21501.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3356, pruned_loss=0.1002, over 4284163.15 frames. ], batch size: 131, lr: 8.55e-03, grad_scale: 32.0 2023-06-19 21:23:41,577 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-19 21:23:58,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-19 21:24:08,664 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 2.755e+02 3.087e+02 3.854e+02 6.716e+02, threshold=6.173e+02, percent-clipped=0.0 2023-06-19 21:24:19,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=556224.0, ans=0.125 2023-06-19 21:25:17,530 INFO [train.py:996] (0/4) Epoch 4, batch 1250, loss[loss=0.3025, simple_loss=0.3639, pruned_loss=0.1205, over 21693.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3369, pruned_loss=0.1011, over 4288803.68 frames. ], batch size: 351, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:25:20,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=556404.0, ans=0.125 2023-06-19 21:26:38,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=556644.0, ans=0.125 2023-06-19 21:27:02,104 INFO [train.py:996] (0/4) Epoch 4, batch 1300, loss[loss=0.2358, simple_loss=0.3186, pruned_loss=0.07656, over 21293.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3377, pruned_loss=0.1017, over 4289735.90 frames. ], batch size: 176, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:27:36,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.941e+02 3.345e+02 4.151e+02 1.109e+03, threshold=6.689e+02, percent-clipped=6.0 2023-06-19 21:27:53,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=556824.0, ans=0.125 2023-06-19 21:28:17,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=556884.0, ans=0.125 2023-06-19 21:28:44,733 INFO [train.py:996] (0/4) Epoch 4, batch 1350, loss[loss=0.3061, simple_loss=0.3533, pruned_loss=0.1295, over 21792.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3378, pruned_loss=0.1023, over 4288220.27 frames. ], batch size: 507, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:28:50,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=557004.0, ans=0.0 2023-06-19 21:29:01,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=22.5 2023-06-19 21:29:36,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=557124.0, ans=0.0 2023-06-19 21:30:18,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=557244.0, ans=0.1 2023-06-19 21:30:27,840 INFO [train.py:996] (0/4) Epoch 4, batch 1400, loss[loss=0.2526, simple_loss=0.3002, pruned_loss=0.1025, over 15056.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3345, pruned_loss=0.1019, over 4288288.88 frames. ], batch size: 60, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:30:45,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=557304.0, ans=0.2 2023-06-19 21:31:03,563 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.007e+02 3.409e+02 4.154e+02 6.851e+02, threshold=6.817e+02, percent-clipped=4.0 2023-06-19 21:31:26,662 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:31:57,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=557544.0, ans=0.0 2023-06-19 21:32:18,705 INFO [train.py:996] (0/4) Epoch 4, batch 1450, loss[loss=0.3026, simple_loss=0.3559, pruned_loss=0.1246, over 21554.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3329, pruned_loss=0.1019, over 4291245.61 frames. ], batch size: 230, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:32:24,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=557604.0, ans=0.125 2023-06-19 21:32:25,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=557604.0, ans=0.2 2023-06-19 21:33:18,079 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-06-19 21:34:02,935 INFO [train.py:996] (0/4) Epoch 4, batch 1500, loss[loss=0.2583, simple_loss=0.317, pruned_loss=0.09981, over 21203.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3363, pruned_loss=0.1036, over 4291254.36 frames. ], batch size: 608, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:34:15,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=557904.0, ans=0.0 2023-06-19 21:34:33,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.980e+02 3.543e+02 4.143e+02 6.339e+02, threshold=7.086e+02, percent-clipped=0.0 2023-06-19 21:34:41,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-19 21:34:56,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=558024.0, ans=0.0 2023-06-19 21:35:03,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=22.5 2023-06-19 21:35:24,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=558084.0, ans=0.125 2023-06-19 21:35:29,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=558144.0, ans=0.0 2023-06-19 21:35:31,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=558144.0, ans=0.125 2023-06-19 21:35:49,416 INFO [train.py:996] (0/4) Epoch 4, batch 1550, loss[loss=0.1955, simple_loss=0.2893, pruned_loss=0.05083, over 21692.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3348, pruned_loss=0.1028, over 4285126.71 frames. ], batch size: 247, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:36:16,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=558264.0, ans=0.0 2023-06-19 21:37:07,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-19 21:37:34,575 INFO [train.py:996] (0/4) Epoch 4, batch 1600, loss[loss=0.2494, simple_loss=0.3009, pruned_loss=0.09896, over 21078.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.334, pruned_loss=0.1009, over 4280463.12 frames. ], batch size: 143, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:38:15,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 2.993e+02 3.386e+02 4.443e+02 8.016e+02, threshold=6.773e+02, percent-clipped=2.0 2023-06-19 21:38:25,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=558624.0, ans=0.0 2023-06-19 21:39:12,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-19 21:39:19,178 INFO [train.py:996] (0/4) Epoch 4, batch 1650, loss[loss=0.3046, simple_loss=0.3618, pruned_loss=0.1237, over 20775.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3335, pruned_loss=0.1008, over 4278296.21 frames. ], batch size: 607, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:39:22,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-19 21:39:28,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-19 21:40:11,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=558924.0, ans=0.1 2023-06-19 21:40:18,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=558924.0, ans=0.0 2023-06-19 21:41:00,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=559044.0, ans=0.125 2023-06-19 21:41:05,521 INFO [train.py:996] (0/4) Epoch 4, batch 1700, loss[loss=0.2524, simple_loss=0.3107, pruned_loss=0.09702, over 21596.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3376, pruned_loss=0.1024, over 4282435.59 frames. ], batch size: 548, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:41:12,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=559104.0, ans=0.125 2023-06-19 21:41:53,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 2.877e+02 3.357e+02 4.119e+02 6.244e+02, threshold=6.713e+02, percent-clipped=0.0 2023-06-19 21:41:57,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=559224.0, ans=0.1 2023-06-19 21:42:01,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=559224.0, ans=0.0 2023-06-19 21:42:23,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=559284.0, ans=0.125 2023-06-19 21:42:34,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-19 21:42:41,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=559344.0, ans=0.125 2023-06-19 21:42:47,727 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-19 21:42:56,583 INFO [train.py:996] (0/4) Epoch 4, batch 1750, loss[loss=0.2097, simple_loss=0.2932, pruned_loss=0.06309, over 21801.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3386, pruned_loss=0.1012, over 4279301.03 frames. ], batch size: 282, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:44:44,130 INFO [train.py:996] (0/4) Epoch 4, batch 1800, loss[loss=0.1569, simple_loss=0.204, pruned_loss=0.05486, over 21815.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3324, pruned_loss=0.09665, over 4283973.43 frames. ], batch size: 102, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:45:13,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=559764.0, ans=0.1 2023-06-19 21:45:27,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 3.069e+02 3.500e+02 4.481e+02 7.550e+02, threshold=6.999e+02, percent-clipped=2.0 2023-06-19 21:45:32,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=559824.0, ans=10.0 2023-06-19 21:45:36,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=559824.0, ans=0.125 2023-06-19 21:46:01,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=559884.0, ans=0.125 2023-06-19 21:46:34,183 INFO [train.py:996] (0/4) Epoch 4, batch 1850, loss[loss=0.2063, simple_loss=0.2937, pruned_loss=0.05948, over 21744.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3344, pruned_loss=0.09506, over 4285948.86 frames. ], batch size: 124, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:46:37,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=560004.0, ans=0.02 2023-06-19 21:46:51,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=560064.0, ans=0.125 2023-06-19 21:48:15,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=560244.0, ans=0.125 2023-06-19 21:48:17,794 INFO [train.py:996] (0/4) Epoch 4, batch 1900, loss[loss=0.2769, simple_loss=0.346, pruned_loss=0.1039, over 21821.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3345, pruned_loss=0.09617, over 4289845.20 frames. ], batch size: 351, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:48:24,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=560304.0, ans=0.125 2023-06-19 21:48:53,840 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.971e+02 3.385e+02 4.219e+02 8.098e+02, threshold=6.770e+02, percent-clipped=2.0 2023-06-19 21:49:51,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=560544.0, ans=0.09899494936611666 2023-06-19 21:50:02,271 INFO [train.py:996] (0/4) Epoch 4, batch 1950, loss[loss=0.255, simple_loss=0.3012, pruned_loss=0.1044, over 21821.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3326, pruned_loss=0.09643, over 4284731.71 frames. ], batch size: 125, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:50:21,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=560604.0, ans=0.1 2023-06-19 21:51:18,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=560784.0, ans=0.125 2023-06-19 21:51:23,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=560844.0, ans=0.0 2023-06-19 21:51:26,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=560844.0, ans=0.025 2023-06-19 21:51:28,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=560844.0, ans=0.05 2023-06-19 21:51:46,748 INFO [train.py:996] (0/4) Epoch 4, batch 2000, loss[loss=0.1789, simple_loss=0.2426, pruned_loss=0.05765, over 21341.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3256, pruned_loss=0.0938, over 4281404.77 frames. ], batch size: 159, lr: 8.51e-03, grad_scale: 32.0 2023-06-19 21:52:07,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=560964.0, ans=0.0 2023-06-19 21:52:24,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.002e+02 3.642e+02 4.364e+02 7.369e+02, threshold=7.284e+02, percent-clipped=1.0 2023-06-19 21:52:30,985 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:53:14,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=22.5 2023-06-19 21:53:30,409 INFO [train.py:996] (0/4) Epoch 4, batch 2050, loss[loss=0.2822, simple_loss=0.3453, pruned_loss=0.1095, over 21870.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3291, pruned_loss=0.09526, over 4281106.45 frames. ], batch size: 351, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:53:49,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=561204.0, ans=0.125 2023-06-19 21:54:36,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=561384.0, ans=0.0 2023-06-19 21:55:20,888 INFO [train.py:996] (0/4) Epoch 4, batch 2100, loss[loss=0.2934, simple_loss=0.3657, pruned_loss=0.1106, over 21289.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3342, pruned_loss=0.09873, over 4288524.38 frames. ], batch size: 159, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:55:33,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=561504.0, ans=0.2 2023-06-19 21:55:59,298 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.198e+02 3.847e+02 4.816e+02 7.420e+02, threshold=7.693e+02, percent-clipped=1.0 2023-06-19 21:56:16,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.59 vs. limit=22.5 2023-06-19 21:57:06,053 INFO [train.py:996] (0/4) Epoch 4, batch 2150, loss[loss=0.2724, simple_loss=0.3205, pruned_loss=0.1121, over 21863.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.334, pruned_loss=0.1001, over 4277900.27 frames. ], batch size: 107, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 21:57:20,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=561804.0, ans=0.125 2023-06-19 21:57:41,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=561864.0, ans=0.125 2023-06-19 21:57:41,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=561864.0, ans=0.125 2023-06-19 21:58:11,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=561984.0, ans=0.1 2023-06-19 21:58:18,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=561984.0, ans=0.125 2023-06-19 21:58:25,093 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-06-19 21:58:50,862 INFO [train.py:996] (0/4) Epoch 4, batch 2200, loss[loss=0.2507, simple_loss=0.3371, pruned_loss=0.0821, over 21743.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3362, pruned_loss=0.09903, over 4276815.30 frames. ], batch size: 298, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 21:58:53,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=562104.0, ans=0.0 2023-06-19 21:59:02,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=562104.0, ans=0.0 2023-06-19 21:59:03,341 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-19 21:59:14,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-19 21:59:28,566 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.061e+02 3.534e+02 4.711e+02 8.653e+02, threshold=7.068e+02, percent-clipped=2.0 2023-06-19 21:59:39,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=562224.0, ans=0.125 2023-06-19 21:59:44,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=562224.0, ans=0.125 2023-06-19 22:00:18,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=562344.0, ans=0.0 2023-06-19 22:00:29,279 INFO [train.py:996] (0/4) Epoch 4, batch 2250, loss[loss=0.2028, simple_loss=0.2634, pruned_loss=0.07116, over 21353.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3348, pruned_loss=0.09775, over 4278671.44 frames. ], batch size: 144, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 22:02:01,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=562644.0, ans=0.125 2023-06-19 22:02:08,511 INFO [train.py:996] (0/4) Epoch 4, batch 2300, loss[loss=0.3662, simple_loss=0.4458, pruned_loss=0.1434, over 19749.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3316, pruned_loss=0.09829, over 4276890.31 frames. ], batch size: 702, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 22:02:29,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.03 vs. limit=6.0 2023-06-19 22:02:29,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.67 vs. limit=8.0 2023-06-19 22:02:51,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.061e+02 3.548e+02 4.710e+02 1.046e+03, threshold=7.097e+02, percent-clipped=5.0 2023-06-19 22:03:39,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=562944.0, ans=0.2 2023-06-19 22:03:55,578 INFO [train.py:996] (0/4) Epoch 4, batch 2350, loss[loss=0.2803, simple_loss=0.3425, pruned_loss=0.109, over 21676.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3296, pruned_loss=0.09988, over 4278131.25 frames. ], batch size: 332, lr: 8.49e-03, grad_scale: 16.0 2023-06-19 22:03:59,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=563004.0, ans=0.125 2023-06-19 22:04:09,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=563004.0, ans=0.0 2023-06-19 22:04:11,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=563064.0, ans=0.1 2023-06-19 22:04:19,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-19 22:04:58,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=563184.0, ans=0.0 2023-06-19 22:04:58,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=563184.0, ans=10.0 2023-06-19 22:05:02,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=563184.0, ans=0.125 2023-06-19 22:05:20,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=563184.0, ans=0.125 2023-06-19 22:05:39,379 INFO [train.py:996] (0/4) Epoch 4, batch 2400, loss[loss=0.3043, simple_loss=0.3604, pruned_loss=0.1241, over 21343.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3354, pruned_loss=0.1028, over 4264437.37 frames. ], batch size: 159, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:06:23,726 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.093e+02 3.486e+02 4.537e+02 7.539e+02, threshold=6.972e+02, percent-clipped=1.0 2023-06-19 22:07:23,985 INFO [train.py:996] (0/4) Epoch 4, batch 2450, loss[loss=0.2731, simple_loss=0.3305, pruned_loss=0.1079, over 21778.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3406, pruned_loss=0.1054, over 4264371.20 frames. ], batch size: 124, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:07:35,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-19 22:08:16,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=563724.0, ans=0.2 2023-06-19 22:08:16,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=563724.0, ans=0.025 2023-06-19 22:08:22,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=563784.0, ans=0.0 2023-06-19 22:08:33,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=563784.0, ans=0.125 2023-06-19 22:09:01,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=563904.0, ans=0.125 2023-06-19 22:09:02,303 INFO [train.py:996] (0/4) Epoch 4, batch 2500, loss[loss=0.2579, simple_loss=0.3396, pruned_loss=0.08809, over 21640.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3362, pruned_loss=0.1032, over 4262751.19 frames. ], batch size: 247, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:09:11,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-19 22:09:24,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=563964.0, ans=0.0 2023-06-19 22:09:45,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.863e+02 3.660e+02 4.293e+02 8.660e+02, threshold=7.321e+02, percent-clipped=2.0 2023-06-19 22:10:00,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=564084.0, ans=0.2 2023-06-19 22:10:30,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-19 22:10:44,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-19 22:10:45,493 INFO [train.py:996] (0/4) Epoch 4, batch 2550, loss[loss=0.2872, simple_loss=0.3794, pruned_loss=0.0975, over 21294.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3336, pruned_loss=0.1008, over 4256319.86 frames. ], batch size: 548, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:10:46,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=564204.0, ans=0.125 2023-06-19 22:12:02,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=564384.0, ans=0.1 2023-06-19 22:12:12,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=564444.0, ans=0.0 2023-06-19 22:12:29,237 INFO [train.py:996] (0/4) Epoch 4, batch 2600, loss[loss=0.3181, simple_loss=0.3687, pruned_loss=0.1338, over 21805.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3354, pruned_loss=0.1027, over 4264599.83 frames. ], batch size: 118, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:13:12,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.048e+02 3.693e+02 4.515e+02 8.330e+02, threshold=7.386e+02, percent-clipped=1.0 2023-06-19 22:13:21,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=564624.0, ans=0.0 2023-06-19 22:13:29,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=564684.0, ans=0.0 2023-06-19 22:13:31,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.06 vs. limit=10.0 2023-06-19 22:14:11,619 INFO [train.py:996] (0/4) Epoch 4, batch 2650, loss[loss=0.2377, simple_loss=0.3043, pruned_loss=0.08551, over 21842.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3344, pruned_loss=0.1038, over 4275197.86 frames. ], batch size: 332, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:14:34,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=564864.0, ans=0.1 2023-06-19 22:15:28,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=564984.0, ans=15.0 2023-06-19 22:15:37,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=564984.0, ans=0.0 2023-06-19 22:15:57,009 INFO [train.py:996] (0/4) Epoch 4, batch 2700, loss[loss=0.2162, simple_loss=0.2733, pruned_loss=0.07952, over 21226.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.332, pruned_loss=0.1025, over 4256674.04 frames. ], batch size: 176, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:15:59,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=565104.0, ans=0.0 2023-06-19 22:16:10,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-19 22:16:39,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 3.006e+02 3.494e+02 4.497e+02 9.129e+02, threshold=6.988e+02, percent-clipped=4.0 2023-06-19 22:17:21,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=565284.0, ans=0.125 2023-06-19 22:17:36,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=565344.0, ans=0.0 2023-06-19 22:17:40,807 INFO [train.py:996] (0/4) Epoch 4, batch 2750, loss[loss=0.3055, simple_loss=0.4278, pruned_loss=0.09159, over 20828.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3319, pruned_loss=0.1021, over 4251334.19 frames. ], batch size: 607, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:18:14,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-19 22:18:18,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=565464.0, ans=0.0 2023-06-19 22:18:36,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=565524.0, ans=10.0 2023-06-19 22:19:07,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-19 22:19:19,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=565644.0, ans=0.125 2023-06-19 22:19:32,222 INFO [train.py:996] (0/4) Epoch 4, batch 2800, loss[loss=0.2902, simple_loss=0.3706, pruned_loss=0.1049, over 21785.00 frames. ], tot_loss[loss=0.272, simple_loss=0.337, pruned_loss=0.1035, over 4262382.79 frames. ], batch size: 332, lr: 8.47e-03, grad_scale: 32.0 2023-06-19 22:20:00,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=565764.0, ans=0.05 2023-06-19 22:20:17,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 3.042e+02 3.463e+02 4.341e+02 7.810e+02, threshold=6.926e+02, percent-clipped=4.0 2023-06-19 22:20:27,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=565824.0, ans=0.1 2023-06-19 22:20:34,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-19 22:20:51,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=565944.0, ans=0.0 2023-06-19 22:21:01,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=565944.0, ans=0.1 2023-06-19 22:21:13,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=565944.0, ans=0.0 2023-06-19 22:21:16,502 INFO [train.py:996] (0/4) Epoch 4, batch 2850, loss[loss=0.291, simple_loss=0.3473, pruned_loss=0.1173, over 19971.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.336, pruned_loss=0.1039, over 4260378.32 frames. ], batch size: 704, lr: 8.47e-03, grad_scale: 32.0 2023-06-19 22:21:33,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=566064.0, ans=0.125 2023-06-19 22:21:59,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=566124.0, ans=0.2 2023-06-19 22:22:01,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=566124.0, ans=0.125 2023-06-19 22:22:04,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=566124.0, ans=0.07 2023-06-19 22:22:32,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=566244.0, ans=0.0 2023-06-19 22:22:47,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=566244.0, ans=10.0 2023-06-19 22:22:59,616 INFO [train.py:996] (0/4) Epoch 4, batch 2900, loss[loss=0.361, simple_loss=0.4362, pruned_loss=0.1428, over 21624.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.332, pruned_loss=0.1024, over 4260586.15 frames. ], batch size: 441, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:23:37,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=566424.0, ans=0.0 2023-06-19 22:23:45,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.998e+02 3.695e+02 4.530e+02 8.664e+02, threshold=7.390e+02, percent-clipped=3.0 2023-06-19 22:23:57,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=566424.0, ans=0.95 2023-06-19 22:24:38,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=566544.0, ans=0.0 2023-06-19 22:24:42,838 INFO [train.py:996] (0/4) Epoch 4, batch 2950, loss[loss=0.2696, simple_loss=0.3162, pruned_loss=0.1115, over 20200.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3352, pruned_loss=0.1032, over 4267383.34 frames. ], batch size: 703, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:24:54,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=566604.0, ans=0.125 2023-06-19 22:25:43,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-19 22:26:15,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=566844.0, ans=0.125 2023-06-19 22:26:16,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=566844.0, ans=0.05 2023-06-19 22:26:25,953 INFO [train.py:996] (0/4) Epoch 4, batch 3000, loss[loss=0.2917, simple_loss=0.3586, pruned_loss=0.1124, over 21796.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3415, pruned_loss=0.1054, over 4270962.42 frames. ], batch size: 332, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:26:25,954 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-19 22:26:43,404 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2637, simple_loss=0.3577, pruned_loss=0.08486, over 1796401.00 frames. 2023-06-19 22:26:43,404 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-19 22:27:29,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.065e+02 3.685e+02 4.308e+02 7.209e+02, threshold=7.369e+02, percent-clipped=0.0 2023-06-19 22:28:15,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=567144.0, ans=0.2 2023-06-19 22:28:27,677 INFO [train.py:996] (0/4) Epoch 4, batch 3050, loss[loss=0.1971, simple_loss=0.2871, pruned_loss=0.05353, over 21489.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3422, pruned_loss=0.1041, over 4278147.13 frames. ], batch size: 211, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:29:10,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-19 22:29:59,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-19 22:30:12,641 INFO [train.py:996] (0/4) Epoch 4, batch 3100, loss[loss=0.2577, simple_loss=0.3471, pruned_loss=0.08416, over 21681.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3419, pruned_loss=0.1032, over 4278298.85 frames. ], batch size: 298, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:30:28,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.33 vs. limit=6.0 2023-06-19 22:30:52,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 3.250e+02 3.985e+02 4.690e+02 7.522e+02, threshold=7.970e+02, percent-clipped=1.0 2023-06-19 22:30:53,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=567624.0, ans=0.125 2023-06-19 22:32:03,261 INFO [train.py:996] (0/4) Epoch 4, batch 3150, loss[loss=0.2885, simple_loss=0.3486, pruned_loss=0.1142, over 21405.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3439, pruned_loss=0.1037, over 4276100.83 frames. ], batch size: 211, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:32:10,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=567804.0, ans=0.0 2023-06-19 22:32:17,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=567804.0, ans=0.0 2023-06-19 22:32:27,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=567864.0, ans=0.125 2023-06-19 22:32:30,825 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:32:40,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=567924.0, ans=0.125 2023-06-19 22:33:21,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=567984.0, ans=0.2 2023-06-19 22:33:48,471 INFO [train.py:996] (0/4) Epoch 4, batch 3200, loss[loss=0.2822, simple_loss=0.3581, pruned_loss=0.1031, over 21800.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3447, pruned_loss=0.1032, over 4279923.90 frames. ], batch size: 332, lr: 8.46e-03, grad_scale: 32.0 2023-06-19 22:34:34,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.110e+02 3.486e+02 4.566e+02 1.016e+03, threshold=6.972e+02, percent-clipped=1.0 2023-06-19 22:34:34,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=568224.0, ans=0.125 2023-06-19 22:35:11,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=568344.0, ans=0.125 2023-06-19 22:35:12,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=568344.0, ans=0.0 2023-06-19 22:35:18,880 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:35:23,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=568344.0, ans=0.1 2023-06-19 22:35:27,842 INFO [train.py:996] (0/4) Epoch 4, batch 3250, loss[loss=0.2637, simple_loss=0.3147, pruned_loss=0.1063, over 21539.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3424, pruned_loss=0.1042, over 4279803.20 frames. ], batch size: 441, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:35:28,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=568404.0, ans=0.0 2023-06-19 22:35:34,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=568404.0, ans=0.125 2023-06-19 22:35:36,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=568404.0, ans=0.0 2023-06-19 22:35:58,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=568464.0, ans=0.1 2023-06-19 22:36:12,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=568524.0, ans=0.0 2023-06-19 22:36:55,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=568644.0, ans=0.0 2023-06-19 22:37:12,314 INFO [train.py:996] (0/4) Epoch 4, batch 3300, loss[loss=0.2402, simple_loss=0.3014, pruned_loss=0.08951, over 21896.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3366, pruned_loss=0.1043, over 4276920.68 frames. ], batch size: 125, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:37:57,092 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.879e+02 3.455e+02 4.524e+02 7.307e+02, threshold=6.909e+02, percent-clipped=1.0 2023-06-19 22:38:55,636 INFO [train.py:996] (0/4) Epoch 4, batch 3350, loss[loss=0.3158, simple_loss=0.3688, pruned_loss=0.1315, over 21779.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3406, pruned_loss=0.1051, over 4282350.38 frames. ], batch size: 124, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:39:18,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=569004.0, ans=0.2 2023-06-19 22:39:29,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=569064.0, ans=0.5 2023-06-19 22:40:19,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=569184.0, ans=0.1 2023-06-19 22:40:21,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=569184.0, ans=0.125 2023-06-19 22:40:26,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=569244.0, ans=0.125 2023-06-19 22:40:50,348 INFO [train.py:996] (0/4) Epoch 4, batch 3400, loss[loss=0.2652, simple_loss=0.3206, pruned_loss=0.1049, over 21828.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3403, pruned_loss=0.1053, over 4284454.15 frames. ], batch size: 107, lr: 8.45e-03, grad_scale: 16.0 2023-06-19 22:40:51,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-19 22:40:54,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=569304.0, ans=0.1 2023-06-19 22:41:09,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=569364.0, ans=0.125 2023-06-19 22:41:37,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 3.071e+02 3.735e+02 4.641e+02 6.693e+02, threshold=7.470e+02, percent-clipped=0.0 2023-06-19 22:41:51,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=569484.0, ans=0.125 2023-06-19 22:42:01,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-19 22:42:08,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-19 22:42:15,418 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-19 22:42:29,564 INFO [train.py:996] (0/4) Epoch 4, batch 3450, loss[loss=0.2295, simple_loss=0.2956, pruned_loss=0.08172, over 21558.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3358, pruned_loss=0.1042, over 4280517.29 frames. ], batch size: 263, lr: 8.45e-03, grad_scale: 16.0 2023-06-19 22:42:35,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=569604.0, ans=0.125 2023-06-19 22:42:38,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=22.5 2023-06-19 22:43:21,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=569724.0, ans=0.125 2023-06-19 22:43:29,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=569784.0, ans=0.0 2023-06-19 22:43:42,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=569784.0, ans=0.0 2023-06-19 22:43:49,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-19 22:44:15,160 INFO [train.py:996] (0/4) Epoch 4, batch 3500, loss[loss=0.2385, simple_loss=0.3005, pruned_loss=0.08826, over 21220.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3448, pruned_loss=0.1075, over 4284354.89 frames. ], batch size: 608, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:44:26,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=569904.0, ans=0.125 2023-06-19 22:44:44,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=569964.0, ans=0.125 2023-06-19 22:45:03,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 3.084e+02 3.677e+02 4.361e+02 8.360e+02, threshold=7.354e+02, percent-clipped=5.0 2023-06-19 22:45:20,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=570084.0, ans=0.125 2023-06-19 22:45:45,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=570144.0, ans=0.125 2023-06-19 22:45:57,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=570144.0, ans=0.1 2023-06-19 22:46:00,052 INFO [train.py:996] (0/4) Epoch 4, batch 3550, loss[loss=0.2536, simple_loss=0.3247, pruned_loss=0.09121, over 21635.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3488, pruned_loss=0.109, over 4278564.56 frames. ], batch size: 332, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:46:29,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=570264.0, ans=0.2 2023-06-19 22:47:11,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=570384.0, ans=0.0 2023-06-19 22:47:33,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=570444.0, ans=0.1 2023-06-19 22:47:51,387 INFO [train.py:996] (0/4) Epoch 4, batch 3600, loss[loss=0.2616, simple_loss=0.3339, pruned_loss=0.09465, over 20050.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.342, pruned_loss=0.1077, over 4279077.89 frames. ], batch size: 702, lr: 8.44e-03, grad_scale: 32.0 2023-06-19 22:48:29,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.447e+02 3.242e+02 3.839e+02 4.789e+02 9.292e+02, threshold=7.677e+02, percent-clipped=2.0 2023-06-19 22:48:52,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-19 22:49:19,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=570744.0, ans=0.125 2023-06-19 22:49:33,692 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:49:34,918 INFO [train.py:996] (0/4) Epoch 4, batch 3650, loss[loss=0.2555, simple_loss=0.3222, pruned_loss=0.09444, over 21633.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3441, pruned_loss=0.1081, over 4280836.08 frames. ], batch size: 230, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:50:08,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=570924.0, ans=0.125 2023-06-19 22:51:14,673 INFO [train.py:996] (0/4) Epoch 4, batch 3700, loss[loss=0.2757, simple_loss=0.3404, pruned_loss=0.1055, over 21884.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3411, pruned_loss=0.1063, over 4282463.17 frames. ], batch size: 371, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:51:28,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=571104.0, ans=0.0 2023-06-19 22:51:52,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.752e+02 3.200e+02 3.601e+02 6.077e+02, threshold=6.399e+02, percent-clipped=0.0 2023-06-19 22:52:03,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-06-19 22:52:16,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=571284.0, ans=0.125 2023-06-19 22:52:18,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-19 22:52:32,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=571344.0, ans=0.2 2023-06-19 22:52:45,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-19 22:52:57,331 INFO [train.py:996] (0/4) Epoch 4, batch 3750, loss[loss=0.2045, simple_loss=0.2775, pruned_loss=0.06572, over 21403.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3393, pruned_loss=0.1057, over 4280296.00 frames. ], batch size: 194, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:53:02,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-19 22:53:15,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=571464.0, ans=0.125 2023-06-19 22:53:47,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=571524.0, ans=0.125 2023-06-19 22:54:05,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.82 vs. limit=6.0 2023-06-19 22:54:33,102 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:54:40,578 INFO [train.py:996] (0/4) Epoch 4, batch 3800, loss[loss=0.2216, simple_loss=0.3332, pruned_loss=0.05494, over 20786.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3377, pruned_loss=0.1034, over 4276620.79 frames. ], batch size: 608, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:55:27,671 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.805e+02 3.314e+02 3.828e+02 7.886e+02, threshold=6.628e+02, percent-clipped=5.0 2023-06-19 22:56:23,682 INFO [train.py:996] (0/4) Epoch 4, batch 3850, loss[loss=0.212, simple_loss=0.2772, pruned_loss=0.07336, over 21719.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3345, pruned_loss=0.1031, over 4275813.82 frames. ], batch size: 112, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:57:14,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=572124.0, ans=0.125 2023-06-19 22:58:06,834 INFO [train.py:996] (0/4) Epoch 4, batch 3900, loss[loss=0.239, simple_loss=0.3035, pruned_loss=0.08724, over 21853.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3305, pruned_loss=0.1031, over 4266568.63 frames. ], batch size: 107, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:58:21,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-06-19 22:58:55,581 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.958e+02 3.677e+02 4.804e+02 9.279e+02, threshold=7.354e+02, percent-clipped=7.0 2023-06-19 22:59:18,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-19 22:59:51,720 INFO [train.py:996] (0/4) Epoch 4, batch 3950, loss[loss=0.2411, simple_loss=0.3529, pruned_loss=0.06472, over 19796.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3317, pruned_loss=0.1019, over 4268760.00 frames. ], batch size: 702, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:00:20,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-19 23:00:32,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=572724.0, ans=0.125 2023-06-19 23:01:34,293 INFO [train.py:996] (0/4) Epoch 4, batch 4000, loss[loss=0.223, simple_loss=0.2729, pruned_loss=0.08656, over 21298.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3251, pruned_loss=0.09829, over 4274465.79 frames. ], batch size: 177, lr: 8.42e-03, grad_scale: 32.0 2023-06-19 23:01:41,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=572904.0, ans=0.125 2023-06-19 23:02:22,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.603e+02 3.194e+02 3.964e+02 9.151e+02, threshold=6.387e+02, percent-clipped=1.0 2023-06-19 23:02:30,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-19 23:02:35,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=573084.0, ans=0.07 2023-06-19 23:03:07,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=573144.0, ans=0.125 2023-06-19 23:03:09,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-19 23:03:18,138 INFO [train.py:996] (0/4) Epoch 4, batch 4050, loss[loss=0.2463, simple_loss=0.3086, pruned_loss=0.092, over 21147.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3238, pruned_loss=0.09582, over 4270541.92 frames. ], batch size: 143, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:03:20,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=573204.0, ans=0.125 2023-06-19 23:04:08,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=573324.0, ans=0.125 2023-06-19 23:04:26,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-19 23:04:57,151 INFO [train.py:996] (0/4) Epoch 4, batch 4100, loss[loss=0.3069, simple_loss=0.3696, pruned_loss=0.1221, over 21338.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3272, pruned_loss=0.09748, over 4279099.29 frames. ], batch size: 548, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:05:01,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=573504.0, ans=0.0 2023-06-19 23:05:46,193 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.845e+02 3.334e+02 4.002e+02 7.963e+02, threshold=6.669e+02, percent-clipped=0.0 2023-06-19 23:06:35,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=573744.0, ans=0.1 2023-06-19 23:06:40,773 INFO [train.py:996] (0/4) Epoch 4, batch 4150, loss[loss=0.2634, simple_loss=0.3118, pruned_loss=0.1075, over 21263.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3294, pruned_loss=0.09628, over 4273203.51 frames. ], batch size: 548, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:07:27,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-06-19 23:08:14,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=574044.0, ans=0.125 2023-06-19 23:08:25,551 INFO [train.py:996] (0/4) Epoch 4, batch 4200, loss[loss=0.2656, simple_loss=0.3304, pruned_loss=0.1004, over 21629.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3282, pruned_loss=0.09549, over 4262938.93 frames. ], batch size: 247, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:08:46,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=574104.0, ans=0.5 2023-06-19 23:09:14,143 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-19 23:09:22,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=574224.0, ans=0.0 2023-06-19 23:09:26,363 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.689e+02 3.288e+02 4.795e+02 7.055e+02, threshold=6.577e+02, percent-clipped=3.0 2023-06-19 23:09:48,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=574284.0, ans=0.0 2023-06-19 23:09:55,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.87 vs. limit=10.0 2023-06-19 23:10:19,738 INFO [train.py:996] (0/4) Epoch 4, batch 4250, loss[loss=0.2776, simple_loss=0.3475, pruned_loss=0.1039, over 19983.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3368, pruned_loss=0.09772, over 4264940.48 frames. ], batch size: 702, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:11:05,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=574524.0, ans=0.0 2023-06-19 23:11:05,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-19 23:11:13,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=574524.0, ans=0.0 2023-06-19 23:12:06,310 INFO [train.py:996] (0/4) Epoch 4, batch 4300, loss[loss=0.3751, simple_loss=0.4466, pruned_loss=0.1518, over 21414.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3433, pruned_loss=0.1006, over 4266136.33 frames. ], batch size: 507, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:12:06,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=574704.0, ans=0.125 2023-06-19 23:12:08,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=574704.0, ans=0.0 2023-06-19 23:12:15,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=574704.0, ans=0.125 2023-06-19 23:12:40,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-19 23:12:41,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=574764.0, ans=0.125 2023-06-19 23:12:51,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=574824.0, ans=0.125 2023-06-19 23:12:57,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 2.886e+02 3.415e+02 4.755e+02 8.316e+02, threshold=6.829e+02, percent-clipped=3.0 2023-06-19 23:13:06,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=574884.0, ans=0.1 2023-06-19 23:13:31,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=574884.0, ans=0.025 2023-06-19 23:13:39,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-19 23:13:42,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=574944.0, ans=0.0 2023-06-19 23:14:00,222 INFO [train.py:996] (0/4) Epoch 4, batch 4350, loss[loss=0.2781, simple_loss=0.3313, pruned_loss=0.1124, over 21449.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3425, pruned_loss=0.09983, over 4267809.36 frames. ], batch size: 389, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:14:05,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=575004.0, ans=0.1 2023-06-19 23:15:11,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=575184.0, ans=0.0 2023-06-19 23:15:40,587 INFO [train.py:996] (0/4) Epoch 4, batch 4400, loss[loss=0.3124, simple_loss=0.3818, pruned_loss=0.1215, over 21370.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.34, pruned_loss=0.0996, over 4263560.01 frames. ], batch size: 549, lr: 8.40e-03, grad_scale: 32.0 2023-06-19 23:15:50,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-19 23:15:54,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=575304.0, ans=0.125 2023-06-19 23:16:04,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=575364.0, ans=0.0 2023-06-19 23:16:10,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=575364.0, ans=0.1 2023-06-19 23:16:26,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.820e+02 3.325e+02 4.010e+02 7.079e+02, threshold=6.649e+02, percent-clipped=1.0 2023-06-19 23:16:26,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=575424.0, ans=0.5 2023-06-19 23:16:28,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=575424.0, ans=0.0 2023-06-19 23:17:12,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=575544.0, ans=0.125 2023-06-19 23:17:25,258 INFO [train.py:996] (0/4) Epoch 4, batch 4450, loss[loss=0.2671, simple_loss=0.3513, pruned_loss=0.09142, over 21581.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3468, pruned_loss=0.1009, over 4270367.96 frames. ], batch size: 230, lr: 8.40e-03, grad_scale: 32.0 2023-06-19 23:18:11,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=575724.0, ans=0.125 2023-06-19 23:18:45,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=575844.0, ans=0.125 2023-06-19 23:19:08,180 INFO [train.py:996] (0/4) Epoch 4, batch 4500, loss[loss=0.2595, simple_loss=0.3371, pruned_loss=0.09092, over 21235.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3485, pruned_loss=0.1029, over 4272406.84 frames. ], batch size: 159, lr: 8.40e-03, grad_scale: 16.0 2023-06-19 23:19:18,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=575904.0, ans=0.0 2023-06-19 23:19:38,688 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-96000.pt 2023-06-19 23:20:01,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.953e+02 3.681e+02 4.394e+02 8.500e+02, threshold=7.362e+02, percent-clipped=5.0 2023-06-19 23:20:17,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=576084.0, ans=0.0 2023-06-19 23:20:24,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=576084.0, ans=0.125 2023-06-19 23:20:53,480 INFO [train.py:996] (0/4) Epoch 4, batch 4550, loss[loss=0.3698, simple_loss=0.4171, pruned_loss=0.1613, over 21402.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3523, pruned_loss=0.1039, over 4275344.90 frames. ], batch size: 471, lr: 8.40e-03, grad_scale: 16.0 2023-06-19 23:21:25,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-19 23:21:34,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=576264.0, ans=0.125 2023-06-19 23:21:54,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=8.0 2023-06-19 23:22:38,833 INFO [train.py:996] (0/4) Epoch 4, batch 4600, loss[loss=0.2467, simple_loss=0.3007, pruned_loss=0.09635, over 21165.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3534, pruned_loss=0.1062, over 4276434.86 frames. ], batch size: 608, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:23:34,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 2.974e+02 3.353e+02 4.220e+02 8.842e+02, threshold=6.706e+02, percent-clipped=3.0 2023-06-19 23:24:19,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=576744.0, ans=0.1 2023-06-19 23:24:21,970 INFO [train.py:996] (0/4) Epoch 4, batch 4650, loss[loss=0.2287, simple_loss=0.3045, pruned_loss=0.07645, over 21866.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3461, pruned_loss=0.1045, over 4282695.00 frames. ], batch size: 332, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:24:36,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=576804.0, ans=0.1 2023-06-19 23:24:41,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-19 23:24:56,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=576864.0, ans=0.035 2023-06-19 23:24:56,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=576864.0, ans=0.1 2023-06-19 23:25:16,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=576924.0, ans=0.0 2023-06-19 23:25:17,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=576924.0, ans=0.1 2023-06-19 23:25:23,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.28 vs. limit=22.5 2023-06-19 23:26:00,223 INFO [train.py:996] (0/4) Epoch 4, batch 4700, loss[loss=0.2315, simple_loss=0.2965, pruned_loss=0.0832, over 21860.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3363, pruned_loss=0.1005, over 4278977.22 frames. ], batch size: 107, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:26:56,742 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.084e+02 3.825e+02 4.515e+02 8.128e+02, threshold=7.651e+02, percent-clipped=5.0 2023-06-19 23:27:42,054 INFO [train.py:996] (0/4) Epoch 4, batch 4750, loss[loss=0.2886, simple_loss=0.3383, pruned_loss=0.1194, over 21566.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3307, pruned_loss=0.1009, over 4274447.70 frames. ], batch size: 548, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:27:56,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2023-06-19 23:28:10,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=577464.0, ans=0.1 2023-06-19 23:28:38,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=577524.0, ans=0.125 2023-06-19 23:28:46,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=577524.0, ans=0.2 2023-06-19 23:29:15,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=577644.0, ans=0.0 2023-06-19 23:29:27,880 INFO [train.py:996] (0/4) Epoch 4, batch 4800, loss[loss=0.308, simple_loss=0.401, pruned_loss=0.1075, over 21527.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3314, pruned_loss=0.1016, over 4283568.84 frames. ], batch size: 471, lr: 8.39e-03, grad_scale: 32.0 2023-06-19 23:29:56,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=577764.0, ans=0.125 2023-06-19 23:30:14,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=577764.0, ans=0.0 2023-06-19 23:30:25,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 3.016e+02 3.604e+02 4.520e+02 9.140e+02, threshold=7.207e+02, percent-clipped=2.0 2023-06-19 23:31:02,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=577944.0, ans=0.0 2023-06-19 23:31:11,099 INFO [train.py:996] (0/4) Epoch 4, batch 4850, loss[loss=0.2505, simple_loss=0.3135, pruned_loss=0.0938, over 21636.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3292, pruned_loss=0.1002, over 4279922.77 frames. ], batch size: 230, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:31:11,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=578004.0, ans=0.2 2023-06-19 23:31:57,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-19 23:32:53,796 INFO [train.py:996] (0/4) Epoch 4, batch 4900, loss[loss=0.2892, simple_loss=0.3655, pruned_loss=0.1064, over 21744.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3321, pruned_loss=0.1023, over 4289665.46 frames. ], batch size: 441, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:33:10,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-19 23:33:12,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.96 vs. limit=15.0 2023-06-19 23:33:22,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-19 23:33:50,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 3.075e+02 3.679e+02 4.552e+02 8.349e+02, threshold=7.359e+02, percent-clipped=3.0 2023-06-19 23:33:56,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=578424.0, ans=10.0 2023-06-19 23:34:06,391 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-19 23:34:37,038 INFO [train.py:996] (0/4) Epoch 4, batch 4950, loss[loss=0.2196, simple_loss=0.2914, pruned_loss=0.07388, over 21293.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3359, pruned_loss=0.1007, over 4284883.71 frames. ], batch size: 144, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:35:08,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=578664.0, ans=0.125 2023-06-19 23:35:28,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=578724.0, ans=0.0 2023-06-19 23:35:46,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=578784.0, ans=0.125 2023-06-19 23:36:07,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-19 23:36:19,082 INFO [train.py:996] (0/4) Epoch 4, batch 5000, loss[loss=0.3083, simple_loss=0.3704, pruned_loss=0.1231, over 21855.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3348, pruned_loss=0.09574, over 4286967.10 frames. ], batch size: 414, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:36:42,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-19 23:37:15,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.766e+02 3.352e+02 4.422e+02 7.725e+02, threshold=6.703e+02, percent-clipped=2.0 2023-06-19 23:37:34,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=579084.0, ans=0.125 2023-06-19 23:37:56,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-19 23:38:01,107 INFO [train.py:996] (0/4) Epoch 4, batch 5050, loss[loss=0.2305, simple_loss=0.3012, pruned_loss=0.07993, over 15254.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3364, pruned_loss=0.09779, over 4284052.68 frames. ], batch size: 61, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:38:08,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=579204.0, ans=0.04949747468305833 2023-06-19 23:38:25,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=579264.0, ans=0.0 2023-06-19 23:39:10,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-06-19 23:39:16,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=579384.0, ans=0.125 2023-06-19 23:39:19,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=579384.0, ans=0.2 2023-06-19 23:39:43,709 INFO [train.py:996] (0/4) Epoch 4, batch 5100, loss[loss=0.25, simple_loss=0.3133, pruned_loss=0.09334, over 21778.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3351, pruned_loss=0.09813, over 4291416.45 frames. ], batch size: 247, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:39:47,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=579504.0, ans=0.2 2023-06-19 23:40:17,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=579564.0, ans=0.2 2023-06-19 23:40:22,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-19 23:40:39,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.860e+02 3.323e+02 3.950e+02 6.797e+02, threshold=6.645e+02, percent-clipped=1.0 2023-06-19 23:40:49,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-06-19 23:40:53,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=579684.0, ans=0.0 2023-06-19 23:41:26,683 INFO [train.py:996] (0/4) Epoch 4, batch 5150, loss[loss=0.2665, simple_loss=0.3252, pruned_loss=0.1039, over 21837.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3327, pruned_loss=0.09828, over 4296450.84 frames. ], batch size: 298, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:41:44,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=579804.0, ans=0.125 2023-06-19 23:42:13,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=579864.0, ans=0.125 2023-06-19 23:42:32,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=579984.0, ans=0.125 2023-06-19 23:43:16,527 INFO [train.py:996] (0/4) Epoch 4, batch 5200, loss[loss=0.2771, simple_loss=0.3668, pruned_loss=0.09367, over 21776.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3327, pruned_loss=0.09888, over 4296347.89 frames. ], batch size: 351, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:43:18,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=580104.0, ans=0.0 2023-06-19 23:43:48,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=580164.0, ans=0.1 2023-06-19 23:43:48,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=580164.0, ans=0.125 2023-06-19 23:44:10,736 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 2.847e+02 3.708e+02 4.367e+02 7.934e+02, threshold=7.417e+02, percent-clipped=2.0 2023-06-19 23:44:22,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=580284.0, ans=0.1 2023-06-19 23:44:47,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=580344.0, ans=0.125 2023-06-19 23:45:01,069 INFO [train.py:996] (0/4) Epoch 4, batch 5250, loss[loss=0.2777, simple_loss=0.3514, pruned_loss=0.102, over 21833.00 frames. ], tot_loss[loss=0.266, simple_loss=0.337, pruned_loss=0.09749, over 4297307.48 frames. ], batch size: 316, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:45:06,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=580404.0, ans=0.1 2023-06-19 23:46:17,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=580584.0, ans=0.0 2023-06-19 23:46:35,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=580644.0, ans=0.0 2023-06-19 23:46:41,709 INFO [train.py:996] (0/4) Epoch 4, batch 5300, loss[loss=0.3146, simple_loss=0.4376, pruned_loss=0.09579, over 19806.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.337, pruned_loss=0.0981, over 4299922.24 frames. ], batch size: 702, lr: 8.36e-03, grad_scale: 32.0 2023-06-19 23:46:43,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=580704.0, ans=0.2 2023-06-19 23:47:34,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.860e+02 3.383e+02 4.031e+02 8.552e+02, threshold=6.767e+02, percent-clipped=2.0 2023-06-19 23:47:38,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.75 vs. limit=6.0 2023-06-19 23:47:40,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-19 23:48:09,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=580944.0, ans=0.0 2023-06-19 23:48:23,143 INFO [train.py:996] (0/4) Epoch 4, batch 5350, loss[loss=0.2471, simple_loss=0.3065, pruned_loss=0.09383, over 21690.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3382, pruned_loss=0.1013, over 4304507.97 frames. ], batch size: 230, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:48:38,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=581004.0, ans=0.035 2023-06-19 23:48:42,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-19 23:49:44,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=581244.0, ans=0.125 2023-06-19 23:50:10,525 INFO [train.py:996] (0/4) Epoch 4, batch 5400, loss[loss=0.2524, simple_loss=0.3213, pruned_loss=0.09172, over 21791.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3357, pruned_loss=0.1024, over 4312749.98 frames. ], batch size: 298, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:50:12,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=581304.0, ans=0.2 2023-06-19 23:50:19,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-19 23:50:52,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=581424.0, ans=0.2 2023-06-19 23:50:52,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=581424.0, ans=0.1 2023-06-19 23:50:52,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=581424.0, ans=0.5 2023-06-19 23:51:03,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=581424.0, ans=0.025 2023-06-19 23:51:04,839 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 3.139e+02 3.601e+02 4.345e+02 9.321e+02, threshold=7.202e+02, percent-clipped=3.0 2023-06-19 23:51:22,259 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:51:47,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=581544.0, ans=0.07 2023-06-19 23:51:54,863 INFO [train.py:996] (0/4) Epoch 4, batch 5450, loss[loss=0.2994, simple_loss=0.4, pruned_loss=0.09946, over 21779.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3325, pruned_loss=0.09893, over 4310275.93 frames. ], batch size: 351, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:53:05,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=581784.0, ans=0.125 2023-06-19 23:53:24,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=581844.0, ans=0.2 2023-06-19 23:53:25,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=581844.0, ans=0.1 2023-06-19 23:53:44,857 INFO [train.py:996] (0/4) Epoch 4, batch 5500, loss[loss=0.204, simple_loss=0.2784, pruned_loss=0.06482, over 21895.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3363, pruned_loss=0.09534, over 4305838.77 frames. ], batch size: 98, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:53:49,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-19 23:53:53,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=581904.0, ans=0.125 2023-06-19 23:53:53,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-19 23:54:33,619 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.703e+02 3.148e+02 3.931e+02 6.952e+02, threshold=6.296e+02, percent-clipped=0.0 2023-06-19 23:55:30,375 INFO [train.py:996] (0/4) Epoch 4, batch 5550, loss[loss=0.2351, simple_loss=0.3237, pruned_loss=0.0732, over 21687.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3356, pruned_loss=0.09191, over 4296388.27 frames. ], batch size: 263, lr: 8.35e-03, grad_scale: 16.0 2023-06-19 23:55:53,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=582264.0, ans=0.025 2023-06-19 23:56:31,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=582324.0, ans=0.0 2023-06-19 23:56:54,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=582384.0, ans=0.125 2023-06-19 23:57:19,422 INFO [train.py:996] (0/4) Epoch 4, batch 5600, loss[loss=0.3232, simple_loss=0.439, pruned_loss=0.1037, over 19811.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3346, pruned_loss=0.08941, over 4292956.49 frames. ], batch size: 702, lr: 8.35e-03, grad_scale: 32.0 2023-06-19 23:57:28,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=582504.0, ans=0.125 2023-06-19 23:57:32,947 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:58:12,431 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.737e+02 3.310e+02 4.006e+02 7.274e+02, threshold=6.621e+02, percent-clipped=1.0 2023-06-19 23:58:14,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=582624.0, ans=10.0 2023-06-19 23:58:34,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=582684.0, ans=0.125 2023-06-19 23:58:55,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=582744.0, ans=0.0 2023-06-19 23:59:01,265 INFO [train.py:996] (0/4) Epoch 4, batch 5650, loss[loss=0.3196, simple_loss=0.3742, pruned_loss=0.1325, over 21754.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3387, pruned_loss=0.09214, over 4294971.09 frames. ], batch size: 441, lr: 8.35e-03, grad_scale: 32.0 2023-06-19 23:59:05,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=582804.0, ans=0.125 2023-06-19 23:59:06,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=582804.0, ans=0.1 2023-06-19 23:59:16,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=582864.0, ans=0.0 2023-06-19 23:59:28,051 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:59:52,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=582924.0, ans=0.125 2023-06-20 00:00:01,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=582924.0, ans=0.125 2023-06-20 00:00:11,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=582984.0, ans=0.1 2023-06-20 00:00:23,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=582984.0, ans=0.025 2023-06-20 00:00:26,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=583044.0, ans=0.2 2023-06-20 00:00:32,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-20 00:00:42,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=583044.0, ans=0.125 2023-06-20 00:00:42,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=583044.0, ans=0.125 2023-06-20 00:00:44,950 INFO [train.py:996] (0/4) Epoch 4, batch 5700, loss[loss=0.2539, simple_loss=0.333, pruned_loss=0.08734, over 21652.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3377, pruned_loss=0.09409, over 4285802.30 frames. ], batch size: 263, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 00:00:52,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=583104.0, ans=0.0 2023-06-20 00:01:05,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=583164.0, ans=0.1 2023-06-20 00:01:14,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=583164.0, ans=0.1 2023-06-20 00:01:19,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-20 00:01:38,510 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.072e+02 3.794e+02 4.480e+02 7.487e+02, threshold=7.588e+02, percent-clipped=5.0 2023-06-20 00:01:54,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=583284.0, ans=0.125 2023-06-20 00:02:29,589 INFO [train.py:996] (0/4) Epoch 4, batch 5750, loss[loss=0.2581, simple_loss=0.3184, pruned_loss=0.09895, over 20040.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3314, pruned_loss=0.0899, over 4285116.74 frames. ], batch size: 702, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 00:02:33,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=583404.0, ans=0.125 2023-06-20 00:03:47,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=583644.0, ans=0.1 2023-06-20 00:04:13,596 INFO [train.py:996] (0/4) Epoch 4, batch 5800, loss[loss=0.225, simple_loss=0.304, pruned_loss=0.07299, over 21179.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3307, pruned_loss=0.08857, over 4281788.97 frames. ], batch size: 143, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:04:14,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=583704.0, ans=0.1 2023-06-20 00:04:24,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-06-20 00:04:29,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=583764.0, ans=0.125 2023-06-20 00:05:02,495 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 2.603e+02 3.108e+02 3.966e+02 5.463e+02, threshold=6.216e+02, percent-clipped=0.0 2023-06-20 00:05:15,738 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-20 00:05:50,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=583944.0, ans=0.125 2023-06-20 00:05:53,803 INFO [train.py:996] (0/4) Epoch 4, batch 5850, loss[loss=0.1937, simple_loss=0.2889, pruned_loss=0.04925, over 21287.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3254, pruned_loss=0.08327, over 4284121.54 frames. ], batch size: 176, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:06:19,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=584064.0, ans=0.0 2023-06-20 00:06:23,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584064.0, ans=0.1 2023-06-20 00:06:48,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=584124.0, ans=0.125 2023-06-20 00:07:31,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=584244.0, ans=0.05 2023-06-20 00:07:31,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=584244.0, ans=0.125 2023-06-20 00:07:36,938 INFO [train.py:996] (0/4) Epoch 4, batch 5900, loss[loss=0.2314, simple_loss=0.2966, pruned_loss=0.08309, over 21841.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3186, pruned_loss=0.07859, over 4286111.54 frames. ], batch size: 282, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:07:54,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=584304.0, ans=0.125 2023-06-20 00:07:54,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=584304.0, ans=0.0 2023-06-20 00:08:03,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=584364.0, ans=0.0 2023-06-20 00:08:08,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=584364.0, ans=0.125 2023-06-20 00:08:14,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=584364.0, ans=0.125 2023-06-20 00:08:29,711 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 2.549e+02 3.049e+02 3.679e+02 6.495e+02, threshold=6.098e+02, percent-clipped=1.0 2023-06-20 00:08:31,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584424.0, ans=0.1 2023-06-20 00:08:48,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=584484.0, ans=0.0 2023-06-20 00:09:28,649 INFO [train.py:996] (0/4) Epoch 4, batch 5950, loss[loss=0.2283, simple_loss=0.3005, pruned_loss=0.078, over 21649.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3193, pruned_loss=0.08222, over 4276928.66 frames. ], batch size: 230, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:10:54,212 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.73 vs. limit=22.5 2023-06-20 00:11:04,407 INFO [train.py:996] (0/4) Epoch 4, batch 6000, loss[loss=0.2701, simple_loss=0.3078, pruned_loss=0.1162, over 21320.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3183, pruned_loss=0.08709, over 4274620.09 frames. ], batch size: 473, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:11:04,410 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 00:11:26,261 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2686, simple_loss=0.3646, pruned_loss=0.08628, over 1796401.00 frames. 2023-06-20 00:11:26,262 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 00:12:01,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=585024.0, ans=0.0 2023-06-20 00:12:03,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-20 00:12:19,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.849e+02 3.273e+02 3.960e+02 7.085e+02, threshold=6.546e+02, percent-clipped=4.0 2023-06-20 00:13:00,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=585144.0, ans=0.125 2023-06-20 00:13:09,989 INFO [train.py:996] (0/4) Epoch 4, batch 6050, loss[loss=0.2717, simple_loss=0.3223, pruned_loss=0.1106, over 21803.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3155, pruned_loss=0.08919, over 4272867.47 frames. ], batch size: 107, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:13:53,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-20 00:14:50,542 INFO [train.py:996] (0/4) Epoch 4, batch 6100, loss[loss=0.2284, simple_loss=0.3083, pruned_loss=0.07427, over 21656.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3132, pruned_loss=0.08674, over 4274592.32 frames. ], batch size: 263, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:15:14,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-20 00:15:21,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=585624.0, ans=0.05 2023-06-20 00:15:43,388 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.539e+02 3.084e+02 3.751e+02 6.044e+02, threshold=6.168e+02, percent-clipped=0.0 2023-06-20 00:16:32,414 INFO [train.py:996] (0/4) Epoch 4, batch 6150, loss[loss=0.2246, simple_loss=0.2892, pruned_loss=0.08001, over 21116.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3172, pruned_loss=0.09109, over 4275974.03 frames. ], batch size: 159, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:16:41,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=585804.0, ans=0.125 2023-06-20 00:16:52,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=585864.0, ans=0.1 2023-06-20 00:17:05,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=585924.0, ans=0.07 2023-06-20 00:17:24,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=585924.0, ans=0.0 2023-06-20 00:18:10,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=586044.0, ans=0.0 2023-06-20 00:18:14,444 INFO [train.py:996] (0/4) Epoch 4, batch 6200, loss[loss=0.2461, simple_loss=0.3139, pruned_loss=0.08914, over 21290.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3189, pruned_loss=0.09128, over 4273851.81 frames. ], batch size: 143, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:18:56,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=586224.0, ans=0.125 2023-06-20 00:19:00,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=586224.0, ans=0.125 2023-06-20 00:19:04,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-20 00:19:08,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.711e+02 3.284e+02 3.994e+02 6.399e+02, threshold=6.568e+02, percent-clipped=2.0 2023-06-20 00:19:32,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=586284.0, ans=0.125 2023-06-20 00:19:41,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=586284.0, ans=0.125 2023-06-20 00:19:59,701 INFO [train.py:996] (0/4) Epoch 4, batch 6250, loss[loss=0.2755, simple_loss=0.3568, pruned_loss=0.09713, over 21618.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3245, pruned_loss=0.09169, over 4268401.64 frames. ], batch size: 230, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:20:00,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=586404.0, ans=0.125 2023-06-20 00:20:15,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=586464.0, ans=0.1 2023-06-20 00:20:29,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=586464.0, ans=0.1 2023-06-20 00:21:07,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-20 00:21:40,726 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:21:43,510 INFO [train.py:996] (0/4) Epoch 4, batch 6300, loss[loss=0.2403, simple_loss=0.3273, pruned_loss=0.07667, over 21659.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3276, pruned_loss=0.09001, over 4272358.01 frames. ], batch size: 230, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:21:48,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=586704.0, ans=0.2 2023-06-20 00:22:17,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.97 vs. limit=5.0 2023-06-20 00:22:42,478 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=22.5 2023-06-20 00:22:45,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.688e+02 3.149e+02 3.968e+02 6.842e+02, threshold=6.299e+02, percent-clipped=2.0 2023-06-20 00:22:47,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-20 00:23:08,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=586944.0, ans=0.1 2023-06-20 00:23:26,115 INFO [train.py:996] (0/4) Epoch 4, batch 6350, loss[loss=0.275, simple_loss=0.3305, pruned_loss=0.1098, over 21432.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3334, pruned_loss=0.09544, over 4280697.46 frames. ], batch size: 211, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:23:47,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=587064.0, ans=0.125 2023-06-20 00:23:50,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=587064.0, ans=0.0 2023-06-20 00:24:22,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=587124.0, ans=10.0 2023-06-20 00:24:36,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=587184.0, ans=0.125 2023-06-20 00:25:16,736 INFO [train.py:996] (0/4) Epoch 4, batch 6400, loss[loss=0.3189, simple_loss=0.3755, pruned_loss=0.1311, over 21450.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3412, pruned_loss=0.1006, over 4283638.18 frames. ], batch size: 471, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:26:11,105 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.295e+02 3.771e+02 4.525e+02 8.192e+02, threshold=7.543e+02, percent-clipped=2.0 2023-06-20 00:26:21,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587484.0, ans=0.1 2023-06-20 00:26:22,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=587484.0, ans=0.125 2023-06-20 00:26:28,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587484.0, ans=0.1 2023-06-20 00:26:29,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=587484.0, ans=0.0 2023-06-20 00:26:35,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-20 00:26:59,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-20 00:27:05,318 INFO [train.py:996] (0/4) Epoch 4, batch 6450, loss[loss=0.2324, simple_loss=0.3214, pruned_loss=0.07172, over 21599.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3421, pruned_loss=0.0994, over 4285436.69 frames. ], batch size: 230, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:27:52,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=587724.0, ans=0.125 2023-06-20 00:28:42,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=587904.0, ans=0.035 2023-06-20 00:28:48,577 INFO [train.py:996] (0/4) Epoch 4, batch 6500, loss[loss=0.2512, simple_loss=0.3052, pruned_loss=0.09861, over 21474.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3333, pruned_loss=0.0978, over 4288542.45 frames. ], batch size: 441, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 00:29:07,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=587904.0, ans=0.125 2023-06-20 00:29:10,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=587964.0, ans=0.0 2023-06-20 00:29:14,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=587964.0, ans=15.0 2023-06-20 00:29:35,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.671e+02 3.231e+02 3.777e+02 5.375e+02, threshold=6.462e+02, percent-clipped=0.0 2023-06-20 00:29:38,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=588024.0, ans=0.125 2023-06-20 00:29:39,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=588024.0, ans=0.125 2023-06-20 00:29:43,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=588084.0, ans=0.0 2023-06-20 00:29:49,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-20 00:30:27,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=588144.0, ans=0.125 2023-06-20 00:30:30,259 INFO [train.py:996] (0/4) Epoch 4, batch 6550, loss[loss=0.3669, simple_loss=0.4006, pruned_loss=0.1666, over 21636.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3318, pruned_loss=0.09709, over 4271685.41 frames. ], batch size: 507, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 00:30:50,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=588264.0, ans=0.07 2023-06-20 00:30:56,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=588264.0, ans=0.125 2023-06-20 00:31:02,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=588264.0, ans=0.125 2023-06-20 00:31:10,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=12.0 2023-06-20 00:31:21,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=588324.0, ans=0.2 2023-06-20 00:31:52,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=588444.0, ans=0.0 2023-06-20 00:32:13,067 INFO [train.py:996] (0/4) Epoch 4, batch 6600, loss[loss=0.278, simple_loss=0.3152, pruned_loss=0.1204, over 21357.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3262, pruned_loss=0.09668, over 4262857.05 frames. ], batch size: 473, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:32:39,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=588564.0, ans=0.125 2023-06-20 00:33:01,896 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.719e+02 3.222e+02 3.782e+02 6.837e+02, threshold=6.444e+02, percent-clipped=2.0 2023-06-20 00:33:03,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=588624.0, ans=0.125 2023-06-20 00:33:54,745 INFO [train.py:996] (0/4) Epoch 4, batch 6650, loss[loss=0.2524, simple_loss=0.3119, pruned_loss=0.09647, over 21603.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.32, pruned_loss=0.09249, over 4265341.06 frames. ], batch size: 415, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:34:04,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=12.0 2023-06-20 00:34:25,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.05 vs. limit=6.0 2023-06-20 00:34:40,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=588924.0, ans=0.125 2023-06-20 00:35:06,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=588984.0, ans=0.125 2023-06-20 00:35:14,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=589044.0, ans=0.125 2023-06-20 00:35:28,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-20 00:35:31,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=589044.0, ans=0.0 2023-06-20 00:35:34,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=589044.0, ans=0.125 2023-06-20 00:35:36,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=589104.0, ans=0.1 2023-06-20 00:35:37,443 INFO [train.py:996] (0/4) Epoch 4, batch 6700, loss[loss=0.2585, simple_loss=0.3134, pruned_loss=0.1018, over 21806.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3156, pruned_loss=0.09223, over 4261100.19 frames. ], batch size: 352, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:36:07,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=589164.0, ans=0.125 2023-06-20 00:36:26,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.795e+02 3.323e+02 4.034e+02 6.039e+02, threshold=6.647e+02, percent-clipped=0.0 2023-06-20 00:36:59,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=589344.0, ans=0.0 2023-06-20 00:37:11,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=15.0 2023-06-20 00:37:17,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=589404.0, ans=0.125 2023-06-20 00:37:18,660 INFO [train.py:996] (0/4) Epoch 4, batch 6750, loss[loss=0.2531, simple_loss=0.3095, pruned_loss=0.09835, over 21777.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3136, pruned_loss=0.09215, over 4256393.28 frames. ], batch size: 102, lr: 8.30e-03, grad_scale: 16.0 2023-06-20 00:37:47,623 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-20 00:38:50,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=589644.0, ans=0.1 2023-06-20 00:38:51,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=589644.0, ans=0.125 2023-06-20 00:38:54,431 INFO [train.py:996] (0/4) Epoch 4, batch 6800, loss[loss=0.22, simple_loss=0.3361, pruned_loss=0.05192, over 19753.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3157, pruned_loss=0.09501, over 4271678.93 frames. ], batch size: 702, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 00:39:04,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=589704.0, ans=0.0 2023-06-20 00:39:04,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=589704.0, ans=0.125 2023-06-20 00:39:13,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=589704.0, ans=10.0 2023-06-20 00:39:32,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=589824.0, ans=0.125 2023-06-20 00:39:38,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=589824.0, ans=0.125 2023-06-20 00:39:43,467 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.844e+02 3.168e+02 3.952e+02 7.008e+02, threshold=6.337e+02, percent-clipped=1.0 2023-06-20 00:40:35,925 INFO [train.py:996] (0/4) Epoch 4, batch 6850, loss[loss=0.2366, simple_loss=0.2921, pruned_loss=0.09053, over 21608.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3138, pruned_loss=0.09597, over 4266663.64 frames. ], batch size: 263, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 00:40:52,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=590064.0, ans=0.125 2023-06-20 00:41:02,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=590064.0, ans=0.1 2023-06-20 00:41:59,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=590244.0, ans=0.125 2023-06-20 00:42:20,513 INFO [train.py:996] (0/4) Epoch 4, batch 6900, loss[loss=0.2585, simple_loss=0.3509, pruned_loss=0.08306, over 21566.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3168, pruned_loss=0.09598, over 4272852.05 frames. ], batch size: 471, lr: 8.30e-03, grad_scale: 16.0 2023-06-20 00:42:21,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=590304.0, ans=0.0 2023-06-20 00:43:22,483 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.145e+02 3.688e+02 5.056e+02 7.443e+02, threshold=7.376e+02, percent-clipped=5.0 2023-06-20 00:43:41,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=590544.0, ans=0.2 2023-06-20 00:44:03,280 INFO [train.py:996] (0/4) Epoch 4, batch 6950, loss[loss=0.2802, simple_loss=0.3448, pruned_loss=0.1078, over 21293.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3172, pruned_loss=0.09272, over 4277791.04 frames. ], batch size: 548, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:44:10,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=590604.0, ans=0.2 2023-06-20 00:44:16,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-20 00:44:54,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=590724.0, ans=0.2 2023-06-20 00:45:02,492 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:45:02,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=590724.0, ans=0.125 2023-06-20 00:45:50,577 INFO [train.py:996] (0/4) Epoch 4, batch 7000, loss[loss=0.2669, simple_loss=0.3155, pruned_loss=0.1091, over 21614.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3207, pruned_loss=0.09582, over 4276743.18 frames. ], batch size: 298, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:45:59,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-20 00:46:00,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=590904.0, ans=0.125 2023-06-20 00:46:46,593 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 3.062e+02 3.465e+02 4.392e+02 8.171e+02, threshold=6.929e+02, percent-clipped=2.0 2023-06-20 00:47:33,153 INFO [train.py:996] (0/4) Epoch 4, batch 7050, loss[loss=0.2128, simple_loss=0.3048, pruned_loss=0.06043, over 21177.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3176, pruned_loss=0.09445, over 4258301.00 frames. ], batch size: 548, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:47:47,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=591204.0, ans=0.125 2023-06-20 00:48:29,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=591384.0, ans=0.125 2023-06-20 00:48:45,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=591384.0, ans=0.2 2023-06-20 00:49:11,736 INFO [train.py:996] (0/4) Epoch 4, batch 7100, loss[loss=0.2509, simple_loss=0.3269, pruned_loss=0.0875, over 21746.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3238, pruned_loss=0.09549, over 4261355.30 frames. ], batch size: 332, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:49:15,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=591504.0, ans=0.125 2023-06-20 00:49:17,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-06-20 00:49:53,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=591624.0, ans=0.125 2023-06-20 00:50:07,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.760e+02 3.326e+02 4.324e+02 6.991e+02, threshold=6.652e+02, percent-clipped=1.0 2023-06-20 00:50:24,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=591684.0, ans=0.0 2023-06-20 00:50:39,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-06-20 00:50:53,174 INFO [train.py:996] (0/4) Epoch 4, batch 7150, loss[loss=0.3141, simple_loss=0.3731, pruned_loss=0.1275, over 21907.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3215, pruned_loss=0.09334, over 4268355.56 frames. ], batch size: 372, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:52:30,749 INFO [train.py:996] (0/4) Epoch 4, batch 7200, loss[loss=0.2509, simple_loss=0.3009, pruned_loss=0.1004, over 21573.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3247, pruned_loss=0.0962, over 4271024.78 frames. ], batch size: 247, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:52:56,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=592164.0, ans=0.125 2023-06-20 00:53:04,368 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:53:07,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=592164.0, ans=0.05 2023-06-20 00:53:24,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=592224.0, ans=0.125 2023-06-20 00:53:31,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.719e+02 3.108e+02 3.931e+02 6.174e+02, threshold=6.217e+02, percent-clipped=0.0 2023-06-20 00:53:49,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=592284.0, ans=0.125 2023-06-20 00:54:00,737 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:54:12,860 INFO [train.py:996] (0/4) Epoch 4, batch 7250, loss[loss=0.2713, simple_loss=0.3129, pruned_loss=0.1149, over 21368.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3205, pruned_loss=0.09538, over 4273682.30 frames. ], batch size: 177, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:54:48,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=592464.0, ans=0.0 2023-06-20 00:55:09,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=15.0 2023-06-20 00:55:11,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=592524.0, ans=0.1 2023-06-20 00:55:40,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=592644.0, ans=0.0 2023-06-20 00:55:51,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=592644.0, ans=0.125 2023-06-20 00:55:55,968 INFO [train.py:996] (0/4) Epoch 4, batch 7300, loss[loss=0.2274, simple_loss=0.2761, pruned_loss=0.08933, over 21197.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3145, pruned_loss=0.09406, over 4264641.52 frames. ], batch size: 176, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:56:52,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=592824.0, ans=0.2 2023-06-20 00:56:58,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.938e+02 3.597e+02 4.532e+02 8.618e+02, threshold=7.193e+02, percent-clipped=4.0 2023-06-20 00:57:26,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=592944.0, ans=0.125 2023-06-20 00:57:45,781 INFO [train.py:996] (0/4) Epoch 4, batch 7350, loss[loss=0.3427, simple_loss=0.3802, pruned_loss=0.1525, over 21444.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3132, pruned_loss=0.09516, over 4272933.04 frames. ], batch size: 510, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:58:09,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=593064.0, ans=0.125 2023-06-20 00:58:09,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=593064.0, ans=0.125 2023-06-20 00:58:16,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-20 00:59:20,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=593244.0, ans=0.1 2023-06-20 00:59:31,744 INFO [train.py:996] (0/4) Epoch 4, batch 7400, loss[loss=0.2452, simple_loss=0.3092, pruned_loss=0.0906, over 21472.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.321, pruned_loss=0.09882, over 4275764.39 frames. ], batch size: 212, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:59:55,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=593364.0, ans=0.1 2023-06-20 01:00:12,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=593364.0, ans=0.125 2023-06-20 01:00:28,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=593424.0, ans=15.0 2023-06-20 01:00:28,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 3.006e+02 3.626e+02 4.126e+02 7.462e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-20 01:00:36,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=593484.0, ans=0.0 2023-06-20 01:01:15,465 INFO [train.py:996] (0/4) Epoch 4, batch 7450, loss[loss=0.2379, simple_loss=0.2903, pruned_loss=0.09275, over 21583.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3203, pruned_loss=0.09771, over 4280178.16 frames. ], batch size: 230, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:01:16,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=593604.0, ans=0.125 2023-06-20 01:01:23,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=593604.0, ans=0.125 2023-06-20 01:02:18,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=593784.0, ans=0.125 2023-06-20 01:02:26,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=593784.0, ans=0.125 2023-06-20 01:03:05,770 INFO [train.py:996] (0/4) Epoch 4, batch 7500, loss[loss=0.3, simple_loss=0.3949, pruned_loss=0.1026, over 21870.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3249, pruned_loss=0.09981, over 4268674.65 frames. ], batch size: 317, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:03:27,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=593964.0, ans=0.125 2023-06-20 01:04:09,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.163e+02 3.635e+02 4.766e+02 7.864e+02, threshold=7.270e+02, percent-clipped=2.0 2023-06-20 01:04:47,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.24 vs. limit=6.0 2023-06-20 01:04:51,506 INFO [train.py:996] (0/4) Epoch 4, batch 7550, loss[loss=0.2851, simple_loss=0.3718, pruned_loss=0.09921, over 21722.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3337, pruned_loss=0.09902, over 4265117.02 frames. ], batch size: 298, lr: 8.27e-03, grad_scale: 16.0 2023-06-20 01:04:59,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=594204.0, ans=0.0 2023-06-20 01:05:09,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=594204.0, ans=0.2 2023-06-20 01:05:37,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=594324.0, ans=0.125 2023-06-20 01:06:03,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=594384.0, ans=0.1 2023-06-20 01:06:06,709 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-20 01:06:16,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-20 01:06:17,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=594444.0, ans=0.2 2023-06-20 01:06:24,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=594444.0, ans=0.125 2023-06-20 01:06:33,682 INFO [train.py:996] (0/4) Epoch 4, batch 7600, loss[loss=0.3287, simple_loss=0.3801, pruned_loss=0.1387, over 21861.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3315, pruned_loss=0.09742, over 4265413.32 frames. ], batch size: 107, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:07:27,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.700e+02 3.107e+02 3.746e+02 5.626e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-20 01:07:57,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=594744.0, ans=0.015 2023-06-20 01:08:17,612 INFO [train.py:996] (0/4) Epoch 4, batch 7650, loss[loss=0.2811, simple_loss=0.3349, pruned_loss=0.1136, over 21939.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3296, pruned_loss=0.09908, over 4274318.19 frames. ], batch size: 316, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:09:26,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.41 vs. limit=22.5 2023-06-20 01:09:34,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=594984.0, ans=0.0 2023-06-20 01:09:35,418 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-20 01:09:43,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=594984.0, ans=0.125 2023-06-20 01:10:03,433 INFO [train.py:996] (0/4) Epoch 4, batch 7700, loss[loss=0.2799, simple_loss=0.3586, pruned_loss=0.1006, over 16893.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.334, pruned_loss=0.1022, over 4279405.75 frames. ], batch size: 60, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:11:08,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.825e+02 3.572e+02 4.383e+02 7.085e+02, threshold=7.144e+02, percent-clipped=3.0 2023-06-20 01:11:54,627 INFO [train.py:996] (0/4) Epoch 4, batch 7750, loss[loss=0.2422, simple_loss=0.3217, pruned_loss=0.08131, over 21433.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3388, pruned_loss=0.1026, over 4267880.72 frames. ], batch size: 131, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:12:47,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=595524.0, ans=0.05 2023-06-20 01:13:41,023 INFO [train.py:996] (0/4) Epoch 4, batch 7800, loss[loss=0.2261, simple_loss=0.2718, pruned_loss=0.09022, over 21172.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3401, pruned_loss=0.102, over 4267588.21 frames. ], batch size: 159, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:14:11,852 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-20 01:14:19,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=595764.0, ans=0.2 2023-06-20 01:14:23,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=595824.0, ans=0.0 2023-06-20 01:14:44,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.079e+02 3.630e+02 4.586e+02 7.709e+02, threshold=7.261e+02, percent-clipped=1.0 2023-06-20 01:14:58,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=595884.0, ans=0.125 2023-06-20 01:15:23,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=596004.0, ans=0.125 2023-06-20 01:15:24,362 INFO [train.py:996] (0/4) Epoch 4, batch 7850, loss[loss=0.2855, simple_loss=0.3288, pruned_loss=0.1211, over 21906.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3328, pruned_loss=0.1004, over 4252060.36 frames. ], batch size: 373, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:15:57,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=596064.0, ans=0.0 2023-06-20 01:16:06,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=596124.0, ans=0.125 2023-06-20 01:16:21,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=596124.0, ans=0.125 2023-06-20 01:17:10,737 INFO [train.py:996] (0/4) Epoch 4, batch 7900, loss[loss=0.2383, simple_loss=0.31, pruned_loss=0.08329, over 21403.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3282, pruned_loss=0.09907, over 4248759.30 frames. ], batch size: 211, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:17:51,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=596364.0, ans=0.0 2023-06-20 01:18:13,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=596424.0, ans=0.125 2023-06-20 01:18:14,888 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.214e+02 3.696e+02 4.914e+02 8.338e+02, threshold=7.393e+02, percent-clipped=4.0 2023-06-20 01:18:43,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=596544.0, ans=0.1 2023-06-20 01:18:56,168 INFO [train.py:996] (0/4) Epoch 4, batch 7950, loss[loss=0.2588, simple_loss=0.3357, pruned_loss=0.09102, over 21932.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3284, pruned_loss=0.09735, over 4248940.56 frames. ], batch size: 316, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:19:54,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=596724.0, ans=0.2 2023-06-20 01:20:31,715 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.61 vs. limit=22.5 2023-06-20 01:20:53,178 INFO [train.py:996] (0/4) Epoch 4, batch 8000, loss[loss=0.3098, simple_loss=0.366, pruned_loss=0.1269, over 21326.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3346, pruned_loss=0.1001, over 4262098.33 frames. ], batch size: 548, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:20:59,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=596904.0, ans=0.125 2023-06-20 01:21:14,869 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-20 01:21:16,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=596964.0, ans=0.0 2023-06-20 01:21:21,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-20 01:21:47,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=597024.0, ans=0.125 2023-06-20 01:21:55,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.970e+02 3.290e+02 4.047e+02 5.946e+02, threshold=6.580e+02, percent-clipped=0.0 2023-06-20 01:22:46,661 INFO [train.py:996] (0/4) Epoch 4, batch 8050, loss[loss=0.2947, simple_loss=0.3696, pruned_loss=0.1099, over 21744.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3398, pruned_loss=0.1018, over 4269389.62 frames. ], batch size: 332, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:23:05,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=597264.0, ans=0.125 2023-06-20 01:24:02,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=597384.0, ans=0.125 2023-06-20 01:24:32,714 INFO [train.py:996] (0/4) Epoch 4, batch 8100, loss[loss=0.2618, simple_loss=0.3217, pruned_loss=0.1009, over 21681.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3385, pruned_loss=0.1025, over 4273519.30 frames. ], batch size: 263, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:25:24,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=597624.0, ans=0.125 2023-06-20 01:25:33,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=597624.0, ans=0.2 2023-06-20 01:25:39,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.366e+02 3.127e+02 3.761e+02 5.016e+02 1.103e+03, threshold=7.523e+02, percent-clipped=9.0 2023-06-20 01:25:48,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=597684.0, ans=0.125 2023-06-20 01:26:05,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=597744.0, ans=0.0 2023-06-20 01:26:20,083 INFO [train.py:996] (0/4) Epoch 4, batch 8150, loss[loss=0.3062, simple_loss=0.3948, pruned_loss=0.1088, over 21694.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3462, pruned_loss=0.1031, over 4269465.64 frames. ], batch size: 414, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:26:20,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=597804.0, ans=0.025 2023-06-20 01:26:30,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=597804.0, ans=0.125 2023-06-20 01:26:36,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=597804.0, ans=0.125 2023-06-20 01:26:53,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=597864.0, ans=0.0 2023-06-20 01:27:08,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=597924.0, ans=0.0 2023-06-20 01:28:10,474 INFO [train.py:996] (0/4) Epoch 4, batch 8200, loss[loss=0.2931, simple_loss=0.3299, pruned_loss=0.1282, over 21351.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.339, pruned_loss=0.1004, over 4257628.52 frames. ], batch size: 473, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:28:38,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=598164.0, ans=0.0 2023-06-20 01:28:50,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=598164.0, ans=0.0 2023-06-20 01:29:11,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=598224.0, ans=0.015 2023-06-20 01:29:13,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 2.961e+02 3.420e+02 4.432e+02 7.003e+02, threshold=6.840e+02, percent-clipped=0.0 2023-06-20 01:29:53,793 INFO [train.py:996] (0/4) Epoch 4, batch 8250, loss[loss=0.241, simple_loss=0.3252, pruned_loss=0.0784, over 21693.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3401, pruned_loss=0.1015, over 4263930.19 frames. ], batch size: 247, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:30:01,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=598404.0, ans=0.0 2023-06-20 01:31:04,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2023-06-20 01:31:09,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.10 vs. limit=15.0 2023-06-20 01:31:38,821 INFO [train.py:996] (0/4) Epoch 4, batch 8300, loss[loss=0.2279, simple_loss=0.2942, pruned_loss=0.08077, over 21240.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3372, pruned_loss=0.09882, over 4260536.84 frames. ], batch size: 159, lr: 8.24e-03, grad_scale: 16.0 2023-06-20 01:32:18,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=598764.0, ans=0.125 2023-06-20 01:32:25,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=598764.0, ans=0.1 2023-06-20 01:32:27,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=598824.0, ans=0.0 2023-06-20 01:32:30,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=598824.0, ans=0.0 2023-06-20 01:32:45,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.814e+02 3.368e+02 3.938e+02 8.477e+02, threshold=6.736e+02, percent-clipped=1.0 2023-06-20 01:32:50,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=598884.0, ans=0.125 2023-06-20 01:33:00,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=598944.0, ans=0.125 2023-06-20 01:33:23,488 INFO [train.py:996] (0/4) Epoch 4, batch 8350, loss[loss=0.2226, simple_loss=0.2969, pruned_loss=0.07418, over 21218.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3329, pruned_loss=0.0959, over 4250780.68 frames. ], batch size: 176, lr: 8.24e-03, grad_scale: 16.0 2023-06-20 01:34:14,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-20 01:34:43,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=599184.0, ans=0.2 2023-06-20 01:34:55,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=599244.0, ans=0.125 2023-06-20 01:35:08,080 INFO [train.py:996] (0/4) Epoch 4, batch 8400, loss[loss=0.2099, simple_loss=0.2596, pruned_loss=0.08007, over 21907.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.331, pruned_loss=0.09382, over 4259718.12 frames. ], batch size: 98, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:35:41,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=599364.0, ans=0.0 2023-06-20 01:35:42,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-20 01:36:14,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.656e+02 3.140e+02 3.908e+02 6.671e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-20 01:36:16,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=599484.0, ans=10.0 2023-06-20 01:36:25,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-20 01:36:29,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=599544.0, ans=0.0 2023-06-20 01:36:50,939 INFO [train.py:996] (0/4) Epoch 4, batch 8450, loss[loss=0.257, simple_loss=0.3102, pruned_loss=0.1019, over 21679.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3303, pruned_loss=0.09439, over 4268303.58 frames. ], batch size: 263, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:37:22,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=599664.0, ans=0.125 2023-06-20 01:38:25,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=599844.0, ans=0.07 2023-06-20 01:38:34,762 INFO [train.py:996] (0/4) Epoch 4, batch 8500, loss[loss=0.2547, simple_loss=0.3154, pruned_loss=0.09703, over 21756.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.326, pruned_loss=0.09555, over 4274781.15 frames. ], batch size: 316, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:39:01,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=15.0 2023-06-20 01:39:11,022 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-100000.pt 2023-06-20 01:39:44,113 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.032e+02 3.480e+02 4.088e+02 6.738e+02, threshold=6.960e+02, percent-clipped=1.0 2023-06-20 01:40:15,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=600144.0, ans=0.125 2023-06-20 01:40:21,766 INFO [train.py:996] (0/4) Epoch 4, batch 8550, loss[loss=0.2297, simple_loss=0.2971, pruned_loss=0.08111, over 21337.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3309, pruned_loss=0.09789, over 4271269.33 frames. ], batch size: 144, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:41:02,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=600264.0, ans=0.125 2023-06-20 01:41:06,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=600264.0, ans=22.5 2023-06-20 01:42:17,793 INFO [train.py:996] (0/4) Epoch 4, batch 8600, loss[loss=0.3436, simple_loss=0.4398, pruned_loss=0.1236, over 19826.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.339, pruned_loss=0.1008, over 4263288.47 frames. ], batch size: 702, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:42:40,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=600564.0, ans=0.125 2023-06-20 01:42:48,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=600564.0, ans=0.125 2023-06-20 01:42:51,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=600564.0, ans=0.0 2023-06-20 01:42:53,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=600564.0, ans=0.125 2023-06-20 01:43:15,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.082e+02 3.829e+02 4.661e+02 1.059e+03, threshold=7.657e+02, percent-clipped=7.0 2023-06-20 01:43:15,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=600684.0, ans=0.2 2023-06-20 01:43:25,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-20 01:43:27,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=600684.0, ans=0.2 2023-06-20 01:43:51,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=600744.0, ans=0.125 2023-06-20 01:43:56,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=600744.0, ans=0.0 2023-06-20 01:44:06,483 INFO [train.py:996] (0/4) Epoch 4, batch 8650, loss[loss=0.2144, simple_loss=0.3158, pruned_loss=0.05657, over 21783.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3452, pruned_loss=0.1012, over 4270843.36 frames. ], batch size: 332, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:44:23,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=600864.0, ans=0.1 2023-06-20 01:44:47,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-20 01:45:43,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=601104.0, ans=0.125 2023-06-20 01:45:44,878 INFO [train.py:996] (0/4) Epoch 4, batch 8700, loss[loss=0.2341, simple_loss=0.2905, pruned_loss=0.08883, over 21846.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3363, pruned_loss=0.09577, over 4262975.42 frames. ], batch size: 107, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:46:27,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=601224.0, ans=0.1 2023-06-20 01:46:42,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.831e+02 3.437e+02 4.356e+02 1.035e+03, threshold=6.874e+02, percent-clipped=3.0 2023-06-20 01:47:00,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=601284.0, ans=0.0 2023-06-20 01:47:20,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=601344.0, ans=0.0 2023-06-20 01:47:35,553 INFO [train.py:996] (0/4) Epoch 4, batch 8750, loss[loss=0.2567, simple_loss=0.3232, pruned_loss=0.09511, over 21814.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3334, pruned_loss=0.0975, over 4260460.60 frames. ], batch size: 298, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:47:47,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=601404.0, ans=0.0 2023-06-20 01:47:49,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-06-20 01:47:54,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=601464.0, ans=0.125 2023-06-20 01:48:35,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=601584.0, ans=0.2 2023-06-20 01:48:50,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=601584.0, ans=0.125 2023-06-20 01:49:00,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-20 01:49:18,562 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:49:22,809 INFO [train.py:996] (0/4) Epoch 4, batch 8800, loss[loss=0.3297, simple_loss=0.3972, pruned_loss=0.1311, over 21590.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3415, pruned_loss=0.101, over 4266700.45 frames. ], batch size: 389, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 01:49:32,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-20 01:49:36,850 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:49:48,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=601764.0, ans=0.1 2023-06-20 01:50:10,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=601824.0, ans=0.1 2023-06-20 01:50:13,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=601824.0, ans=0.04949747468305833 2023-06-20 01:50:20,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 2.907e+02 3.416e+02 4.301e+02 7.142e+02, threshold=6.833e+02, percent-clipped=3.0 2023-06-20 01:50:45,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=601944.0, ans=0.125 2023-06-20 01:51:03,481 INFO [train.py:996] (0/4) Epoch 4, batch 8850, loss[loss=0.2483, simple_loss=0.3371, pruned_loss=0.07974, over 15965.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3481, pruned_loss=0.1032, over 4261443.17 frames. ], batch size: 61, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 01:51:20,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=602064.0, ans=10.0 2023-06-20 01:51:22,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-20 01:51:27,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=602064.0, ans=0.125 2023-06-20 01:51:33,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=602064.0, ans=0.0 2023-06-20 01:52:21,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=602184.0, ans=0.0 2023-06-20 01:52:24,086 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.31 vs. limit=10.0 2023-06-20 01:52:29,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=602244.0, ans=0.2 2023-06-20 01:52:30,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=602244.0, ans=0.0 2023-06-20 01:52:35,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=602244.0, ans=0.0 2023-06-20 01:52:44,283 INFO [train.py:996] (0/4) Epoch 4, batch 8900, loss[loss=0.2476, simple_loss=0.3005, pruned_loss=0.09733, over 21322.00 frames. ], tot_loss[loss=0.273, simple_loss=0.342, pruned_loss=0.102, over 4259453.99 frames. ], batch size: 194, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:53:18,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=602364.0, ans=0.125 2023-06-20 01:53:22,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=602364.0, ans=0.2 2023-06-20 01:54:00,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.914e+02 3.498e+02 4.067e+02 9.619e+02, threshold=6.997e+02, percent-clipped=2.0 2023-06-20 01:54:10,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=602484.0, ans=0.035 2023-06-20 01:54:10,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=602484.0, ans=0.0 2023-06-20 01:54:18,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=602544.0, ans=0.1 2023-06-20 01:54:31,860 INFO [train.py:996] (0/4) Epoch 4, batch 8950, loss[loss=0.2353, simple_loss=0.2829, pruned_loss=0.09387, over 21193.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3416, pruned_loss=0.1008, over 4255754.17 frames. ], batch size: 159, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:55:36,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=602724.0, ans=10.0 2023-06-20 01:55:47,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=602784.0, ans=0.125 2023-06-20 01:55:57,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=602844.0, ans=0.125 2023-06-20 01:56:15,333 INFO [train.py:996] (0/4) Epoch 4, batch 9000, loss[loss=0.1902, simple_loss=0.2507, pruned_loss=0.06483, over 21428.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3361, pruned_loss=0.09999, over 4261016.45 frames. ], batch size: 212, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:56:15,336 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 01:56:37,869 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2701, simple_loss=0.3695, pruned_loss=0.08531, over 1796401.00 frames. 2023-06-20 01:56:37,870 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 01:56:43,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=602904.0, ans=0.1 2023-06-20 01:57:23,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-20 01:57:40,807 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.934e+02 3.477e+02 4.426e+02 7.521e+02, threshold=6.955e+02, percent-clipped=2.0 2023-06-20 01:58:02,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=603144.0, ans=0.0 2023-06-20 01:58:24,291 INFO [train.py:996] (0/4) Epoch 4, batch 9050, loss[loss=0.239, simple_loss=0.3219, pruned_loss=0.07808, over 21554.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3322, pruned_loss=0.09611, over 4252590.06 frames. ], batch size: 389, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:58:46,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=603204.0, ans=0.1 2023-06-20 01:59:21,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=603324.0, ans=0.1 2023-06-20 01:59:39,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=603384.0, ans=0.5 2023-06-20 02:00:15,331 INFO [train.py:996] (0/4) Epoch 4, batch 9100, loss[loss=0.2604, simple_loss=0.3387, pruned_loss=0.09101, over 21449.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3372, pruned_loss=0.09781, over 4256278.95 frames. ], batch size: 131, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 02:00:20,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=603504.0, ans=0.0 2023-06-20 02:00:39,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=603564.0, ans=0.07 2023-06-20 02:00:51,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=603624.0, ans=0.2 2023-06-20 02:01:02,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-20 02:01:08,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.683e+02 3.374e+02 4.242e+02 6.313e+02, threshold=6.748e+02, percent-clipped=0.0 2023-06-20 02:01:34,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=603744.0, ans=0.015 2023-06-20 02:01:56,037 INFO [train.py:996] (0/4) Epoch 4, batch 9150, loss[loss=0.2741, simple_loss=0.3601, pruned_loss=0.09406, over 21783.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3362, pruned_loss=0.09574, over 4256000.68 frames. ], batch size: 332, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 02:01:56,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-20 02:03:37,731 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-20 02:03:41,438 INFO [train.py:996] (0/4) Epoch 4, batch 9200, loss[loss=0.2886, simple_loss=0.3498, pruned_loss=0.1137, over 21397.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3387, pruned_loss=0.09462, over 4263466.60 frames. ], batch size: 131, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 02:04:05,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=604164.0, ans=0.125 2023-06-20 02:04:05,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=604164.0, ans=0.125 2023-06-20 02:04:24,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=604224.0, ans=0.125 2023-06-20 02:04:45,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.901e+02 3.630e+02 4.447e+02 7.984e+02, threshold=7.260e+02, percent-clipped=1.0 2023-06-20 02:05:04,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-06-20 02:05:24,933 INFO [train.py:996] (0/4) Epoch 4, batch 9250, loss[loss=0.3263, simple_loss=0.4458, pruned_loss=0.1034, over 19761.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3427, pruned_loss=0.09813, over 4267602.06 frames. ], batch size: 702, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:05:30,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=604404.0, ans=0.0 2023-06-20 02:05:50,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=604464.0, ans=0.125 2023-06-20 02:07:05,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=604704.0, ans=0.125 2023-06-20 02:07:06,134 INFO [train.py:996] (0/4) Epoch 4, batch 9300, loss[loss=0.2498, simple_loss=0.3344, pruned_loss=0.08266, over 21531.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3377, pruned_loss=0.09824, over 4261161.13 frames. ], batch size: 230, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:07:08,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=604704.0, ans=0.1 2023-06-20 02:07:34,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=604764.0, ans=0.125 2023-06-20 02:08:18,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 3.021e+02 3.641e+02 4.393e+02 8.139e+02, threshold=7.281e+02, percent-clipped=1.0 2023-06-20 02:08:30,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=604944.0, ans=0.125 2023-06-20 02:08:30,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=604944.0, ans=0.1 2023-06-20 02:08:46,480 INFO [train.py:996] (0/4) Epoch 4, batch 9350, loss[loss=0.3278, simple_loss=0.3878, pruned_loss=0.1339, over 21416.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3434, pruned_loss=0.1, over 4267414.23 frames. ], batch size: 471, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:08:50,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=605004.0, ans=0.2 2023-06-20 02:09:53,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=605124.0, ans=0.0 2023-06-20 02:10:15,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2023-06-20 02:10:28,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=605244.0, ans=0.125 2023-06-20 02:10:30,887 INFO [train.py:996] (0/4) Epoch 4, batch 9400, loss[loss=0.2509, simple_loss=0.3054, pruned_loss=0.09816, over 21348.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3461, pruned_loss=0.1009, over 4268346.89 frames. ], batch size: 160, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:11:06,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=605364.0, ans=0.2 2023-06-20 02:11:46,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 3.026e+02 3.567e+02 4.359e+02 8.563e+02, threshold=7.134e+02, percent-clipped=2.0 2023-06-20 02:12:13,937 INFO [train.py:996] (0/4) Epoch 4, batch 9450, loss[loss=0.2328, simple_loss=0.2935, pruned_loss=0.086, over 21814.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3367, pruned_loss=0.09929, over 4261663.84 frames. ], batch size: 118, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:12:16,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=605604.0, ans=0.025 2023-06-20 02:13:31,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=605784.0, ans=0.04949747468305833 2023-06-20 02:13:52,791 INFO [train.py:996] (0/4) Epoch 4, batch 9500, loss[loss=0.2428, simple_loss=0.301, pruned_loss=0.09225, over 21659.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3285, pruned_loss=0.09702, over 4260239.79 frames. ], batch size: 333, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:13:59,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=605904.0, ans=0.125 2023-06-20 02:14:43,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=605964.0, ans=0.125 2023-06-20 02:14:52,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=606024.0, ans=0.2 2023-06-20 02:14:57,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-20 02:15:09,566 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.876e+02 3.483e+02 4.277e+02 8.627e+02, threshold=6.965e+02, percent-clipped=2.0 2023-06-20 02:15:37,598 INFO [train.py:996] (0/4) Epoch 4, batch 9550, loss[loss=0.3141, simple_loss=0.3844, pruned_loss=0.1219, over 21730.00 frames. ], tot_loss[loss=0.266, simple_loss=0.333, pruned_loss=0.0995, over 4264405.65 frames. ], batch size: 441, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:16:35,449 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-06-20 02:16:36,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=606324.0, ans=0.125 2023-06-20 02:17:18,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=606444.0, ans=10.0 2023-06-20 02:17:21,037 INFO [train.py:996] (0/4) Epoch 4, batch 9600, loss[loss=0.221, simple_loss=0.2912, pruned_loss=0.07537, over 21819.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3352, pruned_loss=0.1006, over 4272787.67 frames. ], batch size: 247, lr: 8.19e-03, grad_scale: 32.0 2023-06-20 02:17:38,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=606504.0, ans=0.0 2023-06-20 02:18:16,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=606624.0, ans=0.0 2023-06-20 02:18:28,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=606624.0, ans=0.125 2023-06-20 02:18:32,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=606684.0, ans=0.0 2023-06-20 02:18:36,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.961e+02 3.442e+02 3.920e+02 7.478e+02, threshold=6.885e+02, percent-clipped=1.0 2023-06-20 02:18:47,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=606744.0, ans=0.2 2023-06-20 02:18:50,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=606744.0, ans=0.0 2023-06-20 02:19:09,541 INFO [train.py:996] (0/4) Epoch 4, batch 9650, loss[loss=0.2867, simple_loss=0.3448, pruned_loss=0.1143, over 21748.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3338, pruned_loss=0.1003, over 4280876.36 frames. ], batch size: 298, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:19:35,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=606864.0, ans=0.125 2023-06-20 02:19:57,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=606924.0, ans=0.1 2023-06-20 02:20:04,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=606924.0, ans=0.2 2023-06-20 02:20:06,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=606924.0, ans=0.125 2023-06-20 02:20:13,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=606984.0, ans=0.125 2023-06-20 02:20:25,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=606984.0, ans=0.125 2023-06-20 02:20:53,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=607104.0, ans=0.1 2023-06-20 02:20:54,132 INFO [train.py:996] (0/4) Epoch 4, batch 9700, loss[loss=0.301, simple_loss=0.3657, pruned_loss=0.1182, over 21535.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3374, pruned_loss=0.1003, over 4280098.70 frames. ], batch size: 471, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:21:44,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=607224.0, ans=0.0 2023-06-20 02:21:53,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=607224.0, ans=0.125 2023-06-20 02:21:56,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=607224.0, ans=0.0 2023-06-20 02:22:00,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=607284.0, ans=0.125 2023-06-20 02:22:07,444 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 2.962e+02 3.413e+02 3.970e+02 9.096e+02, threshold=6.826e+02, percent-clipped=3.0 2023-06-20 02:22:38,388 INFO [train.py:996] (0/4) Epoch 4, batch 9750, loss[loss=0.2533, simple_loss=0.2883, pruned_loss=0.1091, over 20030.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3313, pruned_loss=0.09942, over 4278099.11 frames. ], batch size: 703, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:22:42,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-20 02:23:18,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-20 02:23:29,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=22.5 2023-06-20 02:23:34,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=607524.0, ans=0.1 2023-06-20 02:23:46,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=607584.0, ans=0.125 2023-06-20 02:24:03,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=607644.0, ans=0.2 2023-06-20 02:24:09,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=607644.0, ans=0.125 2023-06-20 02:24:13,833 INFO [train.py:996] (0/4) Epoch 4, batch 9800, loss[loss=0.2293, simple_loss=0.2982, pruned_loss=0.0802, over 21610.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3308, pruned_loss=0.09935, over 4274591.23 frames. ], batch size: 263, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:24:49,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=607764.0, ans=0.1 2023-06-20 02:25:12,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=607824.0, ans=0.5 2023-06-20 02:25:29,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.802e+02 3.178e+02 3.680e+02 5.783e+02, threshold=6.355e+02, percent-clipped=0.0 2023-06-20 02:25:50,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=607944.0, ans=0.125 2023-06-20 02:25:51,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=607944.0, ans=0.125 2023-06-20 02:25:56,205 INFO [train.py:996] (0/4) Epoch 4, batch 9850, loss[loss=0.2637, simple_loss=0.3136, pruned_loss=0.1069, over 21567.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3291, pruned_loss=0.09985, over 4259632.97 frames. ], batch size: 391, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:26:27,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=608064.0, ans=0.1 2023-06-20 02:26:49,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=608124.0, ans=0.07 2023-06-20 02:27:00,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=608124.0, ans=0.125 2023-06-20 02:27:12,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2023-06-20 02:27:32,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=608244.0, ans=0.125 2023-06-20 02:27:41,096 INFO [train.py:996] (0/4) Epoch 4, batch 9900, loss[loss=0.247, simple_loss=0.2984, pruned_loss=0.0978, over 15440.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3256, pruned_loss=0.1001, over 4239340.09 frames. ], batch size: 61, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:29:00,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.957e+02 3.478e+02 4.825e+02 8.249e+02, threshold=6.956e+02, percent-clipped=2.0 2023-06-20 02:29:17,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=608544.0, ans=0.0 2023-06-20 02:29:24,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=608544.0, ans=0.125 2023-06-20 02:29:31,839 INFO [train.py:996] (0/4) Epoch 4, batch 9950, loss[loss=0.2369, simple_loss=0.2887, pruned_loss=0.09256, over 21390.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.329, pruned_loss=0.1024, over 4249784.95 frames. ], batch size: 211, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:30:45,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-20 02:30:58,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=608844.0, ans=0.125 2023-06-20 02:31:23,739 INFO [train.py:996] (0/4) Epoch 4, batch 10000, loss[loss=0.2365, simple_loss=0.3023, pruned_loss=0.08533, over 21644.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3233, pruned_loss=0.09966, over 4244441.23 frames. ], batch size: 391, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 02:31:58,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=608964.0, ans=0.125 2023-06-20 02:32:12,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=609024.0, ans=0.125 2023-06-20 02:32:32,032 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.674e+02 3.222e+02 3.735e+02 9.123e+02, threshold=6.443e+02, percent-clipped=2.0 2023-06-20 02:32:51,817 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:33:14,580 INFO [train.py:996] (0/4) Epoch 4, batch 10050, loss[loss=0.1949, simple_loss=0.2753, pruned_loss=0.05724, over 20748.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3249, pruned_loss=0.1002, over 4253912.21 frames. ], batch size: 608, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 02:33:24,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=609204.0, ans=0.0 2023-06-20 02:33:47,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=609264.0, ans=0.1 2023-06-20 02:35:07,533 INFO [train.py:996] (0/4) Epoch 4, batch 10100, loss[loss=0.2748, simple_loss=0.3453, pruned_loss=0.1021, over 21915.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.321, pruned_loss=0.09681, over 4257565.94 frames. ], batch size: 316, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:35:12,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=609504.0, ans=0.125 2023-06-20 02:35:16,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=609504.0, ans=0.09899494936611666 2023-06-20 02:35:36,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=609564.0, ans=0.0 2023-06-20 02:35:45,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=609624.0, ans=0.025 2023-06-20 02:36:11,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.027e+02 3.628e+02 4.360e+02 7.943e+02, threshold=7.256e+02, percent-clipped=2.0 2023-06-20 02:36:19,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=609684.0, ans=0.1 2023-06-20 02:36:24,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=609744.0, ans=0.125 2023-06-20 02:36:51,220 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=15.0 2023-06-20 02:36:52,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=609804.0, ans=0.2 2023-06-20 02:36:53,348 INFO [train.py:996] (0/4) Epoch 4, batch 10150, loss[loss=0.2525, simple_loss=0.327, pruned_loss=0.08899, over 21668.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.328, pruned_loss=0.1001, over 4260536.09 frames. ], batch size: 332, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:37:25,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=609924.0, ans=0.1 2023-06-20 02:37:25,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=22.5 2023-06-20 02:38:38,452 INFO [train.py:996] (0/4) Epoch 4, batch 10200, loss[loss=0.2326, simple_loss=0.3114, pruned_loss=0.07693, over 21747.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3273, pruned_loss=0.09756, over 4258303.84 frames. ], batch size: 282, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:39:02,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=610164.0, ans=0.125 2023-06-20 02:39:42,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=610284.0, ans=0.125 2023-06-20 02:39:42,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=610284.0, ans=0.125 2023-06-20 02:39:44,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=610284.0, ans=0.125 2023-06-20 02:39:52,551 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 2.503e+02 3.181e+02 4.095e+02 8.895e+02, threshold=6.363e+02, percent-clipped=3.0 2023-06-20 02:40:06,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=610344.0, ans=0.125 2023-06-20 02:40:23,073 INFO [train.py:996] (0/4) Epoch 4, batch 10250, loss[loss=0.2555, simple_loss=0.3364, pruned_loss=0.08734, over 21620.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3236, pruned_loss=0.09287, over 4246085.60 frames. ], batch size: 389, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:40:46,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=610464.0, ans=0.125 2023-06-20 02:41:18,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=610524.0, ans=0.125 2023-06-20 02:42:09,920 INFO [train.py:996] (0/4) Epoch 4, batch 10300, loss[loss=0.2952, simple_loss=0.3654, pruned_loss=0.1125, over 21427.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3249, pruned_loss=0.09241, over 4255737.99 frames. ], batch size: 131, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:42:50,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=610824.0, ans=0.125 2023-06-20 02:43:25,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 3.082e+02 3.756e+02 4.509e+02 8.129e+02, threshold=7.512e+02, percent-clipped=5.0 2023-06-20 02:43:42,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-20 02:43:51,340 INFO [train.py:996] (0/4) Epoch 4, batch 10350, loss[loss=0.2327, simple_loss=0.311, pruned_loss=0.0772, over 21890.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3276, pruned_loss=0.09299, over 4255437.58 frames. ], batch size: 373, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:45:03,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=12.0 2023-06-20 02:45:12,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=611184.0, ans=0.125 2023-06-20 02:45:26,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=611244.0, ans=0.04949747468305833 2023-06-20 02:45:35,495 INFO [train.py:996] (0/4) Epoch 4, batch 10400, loss[loss=0.2462, simple_loss=0.3162, pruned_loss=0.08813, over 21892.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3202, pruned_loss=0.09143, over 4255962.23 frames. ], batch size: 373, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:45:36,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-20 02:45:43,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=611304.0, ans=0.95 2023-06-20 02:45:45,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.00 vs. limit=22.5 2023-06-20 02:46:30,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=611424.0, ans=0.0 2023-06-20 02:46:39,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-20 02:46:40,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=611424.0, ans=0.1 2023-06-20 02:46:55,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=611484.0, ans=0.125 2023-06-20 02:46:56,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.060e+02 3.521e+02 4.304e+02 7.584e+02, threshold=7.042e+02, percent-clipped=1.0 2023-06-20 02:47:21,576 INFO [train.py:996] (0/4) Epoch 4, batch 10450, loss[loss=0.2378, simple_loss=0.3149, pruned_loss=0.0803, over 21401.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3253, pruned_loss=0.09534, over 4256558.72 frames. ], batch size: 211, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:47:30,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=611604.0, ans=0.0 2023-06-20 02:47:48,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=611664.0, ans=0.125 2023-06-20 02:49:04,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-20 02:49:17,076 INFO [train.py:996] (0/4) Epoch 4, batch 10500, loss[loss=0.2718, simple_loss=0.3187, pruned_loss=0.1125, over 21223.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.325, pruned_loss=0.09337, over 4250097.71 frames. ], batch size: 176, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:49:30,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=611904.0, ans=0.125 2023-06-20 02:50:10,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=612024.0, ans=0.125 2023-06-20 02:50:25,510 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.864e+02 3.538e+02 4.749e+02 1.100e+03, threshold=7.075e+02, percent-clipped=4.0 2023-06-20 02:51:01,162 INFO [train.py:996] (0/4) Epoch 4, batch 10550, loss[loss=0.1858, simple_loss=0.2466, pruned_loss=0.06252, over 15213.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3206, pruned_loss=0.09359, over 4235202.44 frames. ], batch size: 60, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:51:04,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=612204.0, ans=0.0 2023-06-20 02:51:13,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=612204.0, ans=0.125 2023-06-20 02:51:18,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-20 02:51:30,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=612264.0, ans=0.0 2023-06-20 02:51:51,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=612324.0, ans=0.125 2023-06-20 02:52:03,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=612384.0, ans=0.125 2023-06-20 02:52:41,303 INFO [train.py:996] (0/4) Epoch 4, batch 10600, loss[loss=0.1906, simple_loss=0.2636, pruned_loss=0.05878, over 21299.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3171, pruned_loss=0.09239, over 4242194.11 frames. ], batch size: 131, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:53:14,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=612564.0, ans=0.0 2023-06-20 02:53:15,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=612564.0, ans=0.0 2023-06-20 02:53:36,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=612624.0, ans=0.125 2023-06-20 02:53:46,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=612684.0, ans=0.0 2023-06-20 02:53:48,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=612684.0, ans=0.2 2023-06-20 02:53:53,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.887e+02 3.753e+02 5.124e+02 8.898e+02, threshold=7.506e+02, percent-clipped=7.0 2023-06-20 02:54:22,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=612744.0, ans=0.0 2023-06-20 02:54:22,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-20 02:54:27,796 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-20 02:54:28,393 INFO [train.py:996] (0/4) Epoch 4, batch 10650, loss[loss=0.2024, simple_loss=0.2708, pruned_loss=0.06697, over 21394.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3164, pruned_loss=0.08977, over 4251016.80 frames. ], batch size: 211, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:54:39,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=612804.0, ans=0.0 2023-06-20 02:55:18,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=612924.0, ans=0.035 2023-06-20 02:55:54,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=612984.0, ans=0.1 2023-06-20 02:56:00,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=613044.0, ans=0.125 2023-06-20 02:56:03,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=613044.0, ans=0.125 2023-06-20 02:56:09,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=613044.0, ans=0.0 2023-06-20 02:56:11,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=613044.0, ans=0.0 2023-06-20 02:56:24,063 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:56:26,518 INFO [train.py:996] (0/4) Epoch 4, batch 10700, loss[loss=0.3569, simple_loss=0.406, pruned_loss=0.1539, over 21415.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3186, pruned_loss=0.09116, over 4249678.06 frames. ], batch size: 471, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:56:27,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=613104.0, ans=0.125 2023-06-20 02:56:31,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=613104.0, ans=0.2 2023-06-20 02:56:40,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=613104.0, ans=0.125 2023-06-20 02:57:36,052 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 3.228e+02 3.667e+02 4.550e+02 7.977e+02, threshold=7.334e+02, percent-clipped=1.0 2023-06-20 02:58:11,839 INFO [train.py:996] (0/4) Epoch 4, batch 10750, loss[loss=0.3119, simple_loss=0.3954, pruned_loss=0.1142, over 21763.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3275, pruned_loss=0.09564, over 4257614.83 frames. ], batch size: 351, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:58:16,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-20 02:58:40,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=613464.0, ans=0.125 2023-06-20 02:58:40,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=613464.0, ans=0.1 2023-06-20 02:58:41,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=613464.0, ans=0.05 2023-06-20 02:59:14,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=613584.0, ans=0.125 2023-06-20 02:59:58,083 INFO [train.py:996] (0/4) Epoch 4, batch 10800, loss[loss=0.2958, simple_loss=0.3544, pruned_loss=0.1186, over 21387.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3339, pruned_loss=0.09699, over 4258136.95 frames. ], batch size: 211, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 03:00:05,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=613704.0, ans=0.2 2023-06-20 03:00:43,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=613824.0, ans=0.05 2023-06-20 03:01:16,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 2.918e+02 3.193e+02 3.822e+02 6.360e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-20 03:01:41,183 INFO [train.py:996] (0/4) Epoch 4, batch 10850, loss[loss=0.2504, simple_loss=0.3189, pruned_loss=0.09097, over 21776.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3344, pruned_loss=0.09749, over 4264641.86 frames. ], batch size: 352, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 03:01:48,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=614004.0, ans=0.1 2023-06-20 03:01:51,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=614004.0, ans=0.125 2023-06-20 03:02:55,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=614184.0, ans=0.0 2023-06-20 03:03:01,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=614184.0, ans=0.05 2023-06-20 03:03:04,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.43 vs. limit=12.0 2023-06-20 03:03:12,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=614244.0, ans=0.0 2023-06-20 03:03:13,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=614244.0, ans=0.2 2023-06-20 03:03:24,903 INFO [train.py:996] (0/4) Epoch 4, batch 10900, loss[loss=0.2529, simple_loss=0.3408, pruned_loss=0.08252, over 21690.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.327, pruned_loss=0.09508, over 4268828.55 frames. ], batch size: 298, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:03:31,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=614304.0, ans=0.2 2023-06-20 03:03:49,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=614364.0, ans=0.1 2023-06-20 03:04:00,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=614364.0, ans=0.125 2023-06-20 03:04:05,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=614364.0, ans=0.2 2023-06-20 03:04:37,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=614484.0, ans=0.125 2023-06-20 03:04:43,220 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.644e+02 3.102e+02 4.157e+02 6.653e+02, threshold=6.203e+02, percent-clipped=4.0 2023-06-20 03:04:47,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2023-06-20 03:05:04,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=614544.0, ans=0.1 2023-06-20 03:05:07,366 INFO [train.py:996] (0/4) Epoch 4, batch 10950, loss[loss=0.2257, simple_loss=0.2905, pruned_loss=0.0804, over 21885.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3227, pruned_loss=0.09322, over 4271302.74 frames. ], batch size: 107, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:06:16,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=614784.0, ans=0.0 2023-06-20 03:06:19,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=614784.0, ans=0.05 2023-06-20 03:06:37,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=614844.0, ans=0.125 2023-06-20 03:06:46,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=614844.0, ans=0.2 2023-06-20 03:06:49,634 INFO [train.py:996] (0/4) Epoch 4, batch 11000, loss[loss=0.2495, simple_loss=0.3063, pruned_loss=0.0963, over 21688.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3211, pruned_loss=0.09449, over 4270093.16 frames. ], batch size: 230, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:08:07,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.675e+02 3.019e+02 3.528e+02 7.831e+02, threshold=6.038e+02, percent-clipped=1.0 2023-06-20 03:08:31,491 INFO [train.py:996] (0/4) Epoch 4, batch 11050, loss[loss=0.2296, simple_loss=0.2854, pruned_loss=0.08686, over 21613.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.319, pruned_loss=0.09574, over 4264083.40 frames. ], batch size: 264, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:08:32,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-20 03:08:57,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=615264.0, ans=0.1 2023-06-20 03:09:04,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=615264.0, ans=0.0 2023-06-20 03:09:06,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=615264.0, ans=0.125 2023-06-20 03:09:10,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=615264.0, ans=0.125 2023-06-20 03:09:16,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=615324.0, ans=0.125 2023-06-20 03:10:13,731 INFO [train.py:996] (0/4) Epoch 4, batch 11100, loss[loss=0.2387, simple_loss=0.3045, pruned_loss=0.08643, over 21414.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3201, pruned_loss=0.09592, over 4253132.37 frames. ], batch size: 194, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 03:11:33,206 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.025e+02 3.574e+02 4.736e+02 7.982e+02, threshold=7.148e+02, percent-clipped=11.0 2023-06-20 03:11:37,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=615684.0, ans=0.125 2023-06-20 03:11:56,231 INFO [train.py:996] (0/4) Epoch 4, batch 11150, loss[loss=0.3001, simple_loss=0.3484, pruned_loss=0.1259, over 21298.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3189, pruned_loss=0.09571, over 4256562.93 frames. ], batch size: 471, lr: 8.12e-03, grad_scale: 16.0 2023-06-20 03:12:23,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-20 03:13:17,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=616044.0, ans=0.125 2023-06-20 03:13:33,719 INFO [train.py:996] (0/4) Epoch 4, batch 11200, loss[loss=0.2195, simple_loss=0.2863, pruned_loss=0.07632, over 21648.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3161, pruned_loss=0.09465, over 4250343.69 frames. ], batch size: 282, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:13:41,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=616104.0, ans=0.1 2023-06-20 03:14:08,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=10.0 2023-06-20 03:14:12,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=616164.0, ans=0.1 2023-06-20 03:14:54,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.789e+02 3.395e+02 4.282e+02 6.875e+02, threshold=6.790e+02, percent-clipped=0.0 2023-06-20 03:14:56,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-20 03:15:09,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=616344.0, ans=0.2 2023-06-20 03:15:17,026 INFO [train.py:996] (0/4) Epoch 4, batch 11250, loss[loss=0.252, simple_loss=0.3222, pruned_loss=0.09088, over 21798.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3154, pruned_loss=0.0945, over 4256961.37 frames. ], batch size: 118, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:15:18,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-20 03:16:00,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=616524.0, ans=0.125 2023-06-20 03:16:38,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=616584.0, ans=0.125 2023-06-20 03:16:42,806 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-20 03:16:46,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=616644.0, ans=0.0 2023-06-20 03:16:54,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=616644.0, ans=0.1 2023-06-20 03:16:54,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=616644.0, ans=0.0 2023-06-20 03:17:00,623 INFO [train.py:996] (0/4) Epoch 4, batch 11300, loss[loss=0.2311, simple_loss=0.3087, pruned_loss=0.07677, over 21981.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3166, pruned_loss=0.09489, over 4254272.72 frames. ], batch size: 373, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:17:01,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=616704.0, ans=0.0 2023-06-20 03:17:02,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=616704.0, ans=0.1 2023-06-20 03:17:07,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-20 03:18:21,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.889e+02 3.588e+02 4.478e+02 6.707e+02, threshold=7.176e+02, percent-clipped=0.0 2023-06-20 03:18:24,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-20 03:18:26,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=616884.0, ans=0.1 2023-06-20 03:18:45,402 INFO [train.py:996] (0/4) Epoch 4, batch 11350, loss[loss=0.3433, simple_loss=0.4, pruned_loss=0.1433, over 21714.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.319, pruned_loss=0.09496, over 4263990.86 frames. ], batch size: 441, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:18:45,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=617004.0, ans=0.2 2023-06-20 03:18:49,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=617004.0, ans=0.125 2023-06-20 03:20:23,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-20 03:20:26,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=617244.0, ans=0.1 2023-06-20 03:20:41,073 INFO [train.py:996] (0/4) Epoch 4, batch 11400, loss[loss=0.253, simple_loss=0.3393, pruned_loss=0.08337, over 21749.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3247, pruned_loss=0.09737, over 4268488.79 frames. ], batch size: 332, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 03:21:30,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=617424.0, ans=0.125 2023-06-20 03:21:53,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.300e+02 4.007e+02 5.041e+02 8.202e+02, threshold=8.013e+02, percent-clipped=6.0 2023-06-20 03:22:23,885 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:22:26,663 INFO [train.py:996] (0/4) Epoch 4, batch 11450, loss[loss=0.2479, simple_loss=0.3285, pruned_loss=0.08368, over 21464.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3245, pruned_loss=0.09564, over 4266989.49 frames. ], batch size: 211, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:22:27,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=617604.0, ans=0.0 2023-06-20 03:22:44,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=617664.0, ans=0.0 2023-06-20 03:23:08,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=617664.0, ans=0.1 2023-06-20 03:23:16,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=617724.0, ans=0.125 2023-06-20 03:23:56,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=617844.0, ans=0.125 2023-06-20 03:23:59,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-20 03:24:05,944 INFO [train.py:996] (0/4) Epoch 4, batch 11500, loss[loss=0.2446, simple_loss=0.3359, pruned_loss=0.07662, over 21804.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3268, pruned_loss=0.09596, over 4270504.95 frames. ], batch size: 282, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:24:08,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=617904.0, ans=0.1 2023-06-20 03:24:41,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-20 03:24:49,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-20 03:25:18,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=618084.0, ans=0.125 2023-06-20 03:25:19,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.106e+02 3.529e+02 4.777e+02 9.700e+02, threshold=7.057e+02, percent-clipped=3.0 2023-06-20 03:25:32,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-20 03:25:47,103 INFO [train.py:996] (0/4) Epoch 4, batch 11550, loss[loss=0.2867, simple_loss=0.3747, pruned_loss=0.09937, over 21735.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3337, pruned_loss=0.09611, over 4275397.77 frames. ], batch size: 351, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:25:50,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=618204.0, ans=0.0 2023-06-20 03:27:02,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=618384.0, ans=0.125 2023-06-20 03:27:37,879 INFO [train.py:996] (0/4) Epoch 4, batch 11600, loss[loss=0.3018, simple_loss=0.3914, pruned_loss=0.1061, over 21663.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3493, pruned_loss=0.0982, over 4273584.57 frames. ], batch size: 247, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 03:27:39,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=618504.0, ans=0.125 2023-06-20 03:27:50,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=618504.0, ans=0.125 2023-06-20 03:28:10,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=618564.0, ans=0.1 2023-06-20 03:28:55,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.978e+02 3.550e+02 4.282e+02 6.763e+02, threshold=7.099e+02, percent-clipped=1.0 2023-06-20 03:29:22,215 INFO [train.py:996] (0/4) Epoch 4, batch 11650, loss[loss=0.3337, simple_loss=0.4303, pruned_loss=0.1185, over 21653.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3557, pruned_loss=0.09862, over 4278965.45 frames. ], batch size: 414, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:29:49,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=22.5 2023-06-20 03:30:26,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=618984.0, ans=0.125 2023-06-20 03:30:44,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=619044.0, ans=0.0 2023-06-20 03:30:45,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=619044.0, ans=0.125 2023-06-20 03:30:59,665 INFO [train.py:996] (0/4) Epoch 4, batch 11700, loss[loss=0.2281, simple_loss=0.2825, pruned_loss=0.08687, over 21755.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3456, pruned_loss=0.09867, over 4282889.43 frames. ], batch size: 102, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:31:22,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=619164.0, ans=0.125 2023-06-20 03:31:22,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=619164.0, ans=0.125 2023-06-20 03:32:17,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.962e+02 3.618e+02 4.622e+02 7.851e+02, threshold=7.236e+02, percent-clipped=2.0 2023-06-20 03:32:49,977 INFO [train.py:996] (0/4) Epoch 4, batch 11750, loss[loss=0.2577, simple_loss=0.3135, pruned_loss=0.1009, over 21276.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.337, pruned_loss=0.09907, over 4282092.87 frames. ], batch size: 176, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:33:00,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=619404.0, ans=0.0 2023-06-20 03:34:11,536 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-20 03:34:35,083 INFO [train.py:996] (0/4) Epoch 4, batch 11800, loss[loss=0.305, simple_loss=0.3639, pruned_loss=0.123, over 21416.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3404, pruned_loss=0.1018, over 4275439.46 frames. ], batch size: 159, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:35:32,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=619824.0, ans=0.125 2023-06-20 03:35:38,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=619884.0, ans=0.125 2023-06-20 03:35:42,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=619884.0, ans=0.125 2023-06-20 03:35:51,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.018e+02 3.704e+02 4.604e+02 8.251e+02, threshold=7.407e+02, percent-clipped=4.0 2023-06-20 03:35:57,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=619944.0, ans=0.1 2023-06-20 03:36:10,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=619944.0, ans=0.1 2023-06-20 03:36:11,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=619944.0, ans=0.125 2023-06-20 03:36:19,794 INFO [train.py:996] (0/4) Epoch 4, batch 11850, loss[loss=0.2711, simple_loss=0.3498, pruned_loss=0.09622, over 21461.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3406, pruned_loss=0.09994, over 4277613.21 frames. ], batch size: 548, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:36:23,170 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-20 03:36:31,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=620004.0, ans=0.125 2023-06-20 03:36:45,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=620064.0, ans=0.125 2023-06-20 03:37:03,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=620124.0, ans=0.125 2023-06-20 03:37:46,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=620244.0, ans=0.125 2023-06-20 03:37:59,372 INFO [train.py:996] (0/4) Epoch 4, batch 11900, loss[loss=0.261, simple_loss=0.361, pruned_loss=0.0805, over 21647.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3407, pruned_loss=0.09749, over 4277795.77 frames. ], batch size: 441, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:38:27,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=620364.0, ans=0.125 2023-06-20 03:38:35,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=620364.0, ans=0.0 2023-06-20 03:38:46,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=620424.0, ans=0.07 2023-06-20 03:39:22,718 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.597e+02 3.084e+02 3.622e+02 5.304e+02, threshold=6.167e+02, percent-clipped=0.0 2023-06-20 03:39:25,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=620484.0, ans=0.125 2023-06-20 03:39:26,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-20 03:39:35,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=620544.0, ans=0.02 2023-06-20 03:39:44,840 INFO [train.py:996] (0/4) Epoch 4, batch 11950, loss[loss=0.2247, simple_loss=0.3447, pruned_loss=0.05238, over 19877.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.341, pruned_loss=0.09486, over 4266231.83 frames. ], batch size: 702, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:39:52,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=620604.0, ans=0.0 2023-06-20 03:40:16,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=620664.0, ans=0.125 2023-06-20 03:40:31,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.24 vs. limit=15.0 2023-06-20 03:40:51,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-20 03:41:04,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=620784.0, ans=0.1 2023-06-20 03:41:26,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-20 03:41:30,013 INFO [train.py:996] (0/4) Epoch 4, batch 12000, loss[loss=0.2303, simple_loss=0.2862, pruned_loss=0.08724, over 21745.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.337, pruned_loss=0.09306, over 4261021.60 frames. ], batch size: 124, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:41:30,014 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 03:41:51,447 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2681, simple_loss=0.3653, pruned_loss=0.08549, over 1796401.00 frames. 2023-06-20 03:41:51,448 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 03:42:11,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=620964.0, ans=0.125 2023-06-20 03:42:13,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=620964.0, ans=0.125 2023-06-20 03:43:04,279 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.836e+02 3.434e+02 3.961e+02 6.580e+02, threshold=6.867e+02, percent-clipped=2.0 2023-06-20 03:43:23,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=621144.0, ans=0.0 2023-06-20 03:43:34,853 INFO [train.py:996] (0/4) Epoch 4, batch 12050, loss[loss=0.3297, simple_loss=0.3794, pruned_loss=0.1401, over 21870.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3322, pruned_loss=0.09518, over 4267508.12 frames. ], batch size: 118, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:43:41,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=621204.0, ans=0.2 2023-06-20 03:44:13,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=621264.0, ans=0.95 2023-06-20 03:45:09,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=621444.0, ans=0.125 2023-06-20 03:45:21,143 INFO [train.py:996] (0/4) Epoch 4, batch 12100, loss[loss=0.3508, simple_loss=0.4042, pruned_loss=0.1487, over 21397.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3379, pruned_loss=0.09993, over 4275347.91 frames. ], batch size: 548, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:45:24,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=621504.0, ans=0.1 2023-06-20 03:45:59,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=621564.0, ans=0.0 2023-06-20 03:46:36,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=621684.0, ans=0.125 2023-06-20 03:46:47,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.023e+02 3.734e+02 4.594e+02 9.342e+02, threshold=7.469e+02, percent-clipped=3.0 2023-06-20 03:46:56,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=621744.0, ans=0.125 2023-06-20 03:47:14,427 INFO [train.py:996] (0/4) Epoch 4, batch 12150, loss[loss=0.2537, simple_loss=0.349, pruned_loss=0.07916, over 21782.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3409, pruned_loss=0.09967, over 4271339.10 frames. ], batch size: 332, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:48:11,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=621924.0, ans=0.5 2023-06-20 03:48:24,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=621984.0, ans=0.07 2023-06-20 03:48:54,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=622044.0, ans=0.1 2023-06-20 03:49:02,711 INFO [train.py:996] (0/4) Epoch 4, batch 12200, loss[loss=0.2299, simple_loss=0.2859, pruned_loss=0.08698, over 21492.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3401, pruned_loss=0.1001, over 4274010.09 frames. ], batch size: 212, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:49:07,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=622104.0, ans=0.0 2023-06-20 03:49:10,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=622104.0, ans=0.0 2023-06-20 03:49:11,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=622104.0, ans=0.125 2023-06-20 03:49:34,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=622164.0, ans=0.0 2023-06-20 03:49:40,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=622224.0, ans=0.0 2023-06-20 03:50:06,323 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:50:18,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.874e+02 3.548e+02 4.515e+02 8.617e+02, threshold=7.096e+02, percent-clipped=2.0 2023-06-20 03:50:21,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=8.0 2023-06-20 03:50:25,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=622344.0, ans=0.0 2023-06-20 03:50:43,060 INFO [train.py:996] (0/4) Epoch 4, batch 12250, loss[loss=0.194, simple_loss=0.2696, pruned_loss=0.05924, over 21525.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3322, pruned_loss=0.09663, over 4274810.53 frames. ], batch size: 230, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:50:44,302 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-20 03:52:02,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-20 03:52:07,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=622644.0, ans=0.2 2023-06-20 03:52:21,931 INFO [train.py:996] (0/4) Epoch 4, batch 12300, loss[loss=0.2568, simple_loss=0.3451, pruned_loss=0.08428, over 21746.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3224, pruned_loss=0.08945, over 4281770.02 frames. ], batch size: 332, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:52:42,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=622704.0, ans=0.2 2023-06-20 03:52:58,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=22.5 2023-06-20 03:53:04,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=622824.0, ans=0.125 2023-06-20 03:53:08,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.94 vs. limit=22.5 2023-06-20 03:53:21,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=622824.0, ans=0.07 2023-06-20 03:53:30,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=15.0 2023-06-20 03:53:41,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=622884.0, ans=0.1 2023-06-20 03:53:45,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.451e+02 2.752e+02 3.675e+02 6.755e+02, threshold=5.504e+02, percent-clipped=0.0 2023-06-20 03:54:00,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=622944.0, ans=0.125 2023-06-20 03:54:09,747 INFO [train.py:996] (0/4) Epoch 4, batch 12350, loss[loss=0.2862, simple_loss=0.3474, pruned_loss=0.1125, over 21869.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3267, pruned_loss=0.09028, over 4282373.10 frames. ], batch size: 107, lr: 8.08e-03, grad_scale: 16.0 2023-06-20 03:54:10,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=623004.0, ans=0.1 2023-06-20 03:54:14,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-20 03:54:59,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=623124.0, ans=0.125 2023-06-20 03:55:17,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=623184.0, ans=0.125 2023-06-20 03:55:44,963 INFO [train.py:996] (0/4) Epoch 4, batch 12400, loss[loss=0.2721, simple_loss=0.3237, pruned_loss=0.1103, over 21312.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3289, pruned_loss=0.09399, over 4288332.87 frames. ], batch size: 159, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:55:56,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-06-20 03:56:03,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=623304.0, ans=0.0 2023-06-20 03:56:17,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=623364.0, ans=0.1 2023-06-20 03:56:20,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=623364.0, ans=0.0 2023-06-20 03:56:20,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=623364.0, ans=0.125 2023-06-20 03:56:25,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=623364.0, ans=0.125 2023-06-20 03:56:42,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=623424.0, ans=0.125 2023-06-20 03:57:06,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.090e+02 3.958e+02 4.907e+02 8.874e+02, threshold=7.916e+02, percent-clipped=17.0 2023-06-20 03:57:11,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=623544.0, ans=0.125 2023-06-20 03:57:32,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=623604.0, ans=0.1 2023-06-20 03:57:33,460 INFO [train.py:996] (0/4) Epoch 4, batch 12450, loss[loss=0.3131, simple_loss=0.377, pruned_loss=0.1246, over 21305.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3331, pruned_loss=0.09738, over 4283470.51 frames. ], batch size: 143, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 03:57:54,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=623664.0, ans=0.1 2023-06-20 03:58:26,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.48 vs. limit=5.0 2023-06-20 03:58:48,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=623784.0, ans=0.0 2023-06-20 03:59:07,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=623844.0, ans=0.0 2023-06-20 03:59:12,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=623844.0, ans=0.1 2023-06-20 03:59:19,847 INFO [train.py:996] (0/4) Epoch 4, batch 12500, loss[loss=0.2368, simple_loss=0.2781, pruned_loss=0.09774, over 20179.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3454, pruned_loss=0.1029, over 4284904.74 frames. ], batch size: 703, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 03:59:56,215 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-104000.pt 2023-06-20 04:00:08,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=624024.0, ans=0.0 2023-06-20 04:00:36,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=624084.0, ans=0.125 2023-06-20 04:00:50,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=624084.0, ans=0.1 2023-06-20 04:00:50,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=624084.0, ans=0.0 2023-06-20 04:00:53,425 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.989e+02 3.343e+02 3.829e+02 7.985e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-20 04:00:54,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=624144.0, ans=0.0 2023-06-20 04:01:10,300 INFO [train.py:996] (0/4) Epoch 4, batch 12550, loss[loss=0.2949, simple_loss=0.374, pruned_loss=0.1079, over 21725.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3519, pruned_loss=0.1054, over 4288712.10 frames. ], batch size: 441, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 04:01:34,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=624204.0, ans=0.125 2023-06-20 04:02:01,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=624324.0, ans=0.125 2023-06-20 04:02:41,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=22.5 2023-06-20 04:03:04,267 INFO [train.py:996] (0/4) Epoch 4, batch 12600, loss[loss=0.2248, simple_loss=0.2821, pruned_loss=0.08375, over 21863.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3494, pruned_loss=0.1024, over 4281011.86 frames. ], batch size: 107, lr: 8.07e-03, grad_scale: 8.0 2023-06-20 04:03:09,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=624504.0, ans=0.125 2023-06-20 04:03:21,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=624564.0, ans=0.05 2023-06-20 04:04:21,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.838e+02 3.340e+02 3.938e+02 7.249e+02, threshold=6.681e+02, percent-clipped=1.0 2023-06-20 04:04:40,934 INFO [train.py:996] (0/4) Epoch 4, batch 12650, loss[loss=0.2647, simple_loss=0.3242, pruned_loss=0.1026, over 21793.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3405, pruned_loss=0.09749, over 4282466.93 frames. ], batch size: 247, lr: 8.07e-03, grad_scale: 8.0 2023-06-20 04:04:56,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=624864.0, ans=0.125 2023-06-20 04:05:21,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-06-20 04:05:28,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=624924.0, ans=0.125 2023-06-20 04:05:35,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=624924.0, ans=0.1 2023-06-20 04:06:17,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=625044.0, ans=0.0 2023-06-20 04:06:25,554 INFO [train.py:996] (0/4) Epoch 4, batch 12700, loss[loss=0.2851, simple_loss=0.3492, pruned_loss=0.1106, over 21637.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3408, pruned_loss=0.1006, over 4281494.76 frames. ], batch size: 389, lr: 8.06e-03, grad_scale: 8.0 2023-06-20 04:06:29,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=625104.0, ans=0.09899494936611666 2023-06-20 04:06:32,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=625104.0, ans=0.035 2023-06-20 04:07:03,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=625164.0, ans=0.2 2023-06-20 04:07:53,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.213e+02 3.839e+02 4.784e+02 8.311e+02, threshold=7.678e+02, percent-clipped=6.0 2023-06-20 04:08:07,947 INFO [train.py:996] (0/4) Epoch 4, batch 12750, loss[loss=0.3349, simple_loss=0.3848, pruned_loss=0.1426, over 21675.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3442, pruned_loss=0.1024, over 4286640.84 frames. ], batch size: 508, lr: 8.06e-03, grad_scale: 8.0 2023-06-20 04:08:37,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=625464.0, ans=0.02 2023-06-20 04:08:45,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=625464.0, ans=0.0 2023-06-20 04:09:03,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=625524.0, ans=0.125 2023-06-20 04:09:08,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-20 04:09:48,363 INFO [train.py:996] (0/4) Epoch 4, batch 12800, loss[loss=0.3017, simple_loss=0.3611, pruned_loss=0.1212, over 21766.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3424, pruned_loss=0.1026, over 4287595.61 frames. ], batch size: 441, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:09:48,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=625704.0, ans=0.2 2023-06-20 04:10:35,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-20 04:10:45,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=625824.0, ans=0.1 2023-06-20 04:11:16,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.809e+02 3.205e+02 4.130e+02 6.634e+02, threshold=6.411e+02, percent-clipped=0.0 2023-06-20 04:11:37,282 INFO [train.py:996] (0/4) Epoch 4, batch 12850, loss[loss=0.2855, simple_loss=0.3481, pruned_loss=0.1114, over 21477.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3441, pruned_loss=0.1038, over 4283000.42 frames. ], batch size: 131, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:11:44,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-20 04:12:12,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=626064.0, ans=0.125 2023-06-20 04:13:27,082 INFO [train.py:996] (0/4) Epoch 4, batch 12900, loss[loss=0.2227, simple_loss=0.2903, pruned_loss=0.07753, over 21188.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3389, pruned_loss=0.09888, over 4276127.18 frames. ], batch size: 159, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:13:29,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=626304.0, ans=0.1 2023-06-20 04:14:05,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=626424.0, ans=0.125 2023-06-20 04:14:44,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=626484.0, ans=0.125 2023-06-20 04:14:46,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=626484.0, ans=0.125 2023-06-20 04:14:50,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.539e+02 2.943e+02 3.440e+02 5.830e+02, threshold=5.886e+02, percent-clipped=0.0 2023-06-20 04:15:12,140 INFO [train.py:996] (0/4) Epoch 4, batch 12950, loss[loss=0.2604, simple_loss=0.3197, pruned_loss=0.1006, over 21365.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3384, pruned_loss=0.09762, over 4281414.69 frames. ], batch size: 194, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:15:12,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=626604.0, ans=0.125 2023-06-20 04:15:26,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-20 04:15:49,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=626724.0, ans=0.1 2023-06-20 04:16:23,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=626784.0, ans=0.0 2023-06-20 04:16:50,317 INFO [train.py:996] (0/4) Epoch 4, batch 13000, loss[loss=0.2011, simple_loss=0.2798, pruned_loss=0.06123, over 21819.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3392, pruned_loss=0.09683, over 4272746.59 frames. ], batch size: 282, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:17:02,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=626904.0, ans=0.125 2023-06-20 04:17:21,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=626964.0, ans=0.0 2023-06-20 04:18:13,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.732e+02 3.360e+02 4.009e+02 6.698e+02, threshold=6.719e+02, percent-clipped=5.0 2023-06-20 04:18:27,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=627144.0, ans=0.2 2023-06-20 04:18:33,519 INFO [train.py:996] (0/4) Epoch 4, batch 13050, loss[loss=0.2728, simple_loss=0.3311, pruned_loss=0.1073, over 21460.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3328, pruned_loss=0.09399, over 4270440.39 frames. ], batch size: 194, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:18:43,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=627204.0, ans=0.125 2023-06-20 04:19:26,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=627324.0, ans=0.125 2023-06-20 04:20:09,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=627444.0, ans=0.2 2023-06-20 04:20:16,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=627444.0, ans=0.1 2023-06-20 04:20:18,890 INFO [train.py:996] (0/4) Epoch 4, batch 13100, loss[loss=0.2647, simple_loss=0.3481, pruned_loss=0.09069, over 21697.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3355, pruned_loss=0.09485, over 4276652.09 frames. ], batch size: 441, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:20:22,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=627504.0, ans=0.0 2023-06-20 04:20:50,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-20 04:21:16,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=627624.0, ans=0.025 2023-06-20 04:21:48,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 3.050e+02 3.631e+02 4.148e+02 7.105e+02, threshold=7.262e+02, percent-clipped=1.0 2023-06-20 04:21:50,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=627744.0, ans=0.125 2023-06-20 04:22:03,663 INFO [train.py:996] (0/4) Epoch 4, batch 13150, loss[loss=0.2013, simple_loss=0.2788, pruned_loss=0.06192, over 21658.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3381, pruned_loss=0.09835, over 4271212.56 frames. ], batch size: 247, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:22:22,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.25 vs. limit=15.0 2023-06-20 04:22:38,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=627864.0, ans=0.125 2023-06-20 04:22:54,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-20 04:22:55,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=627924.0, ans=0.09899494936611666 2023-06-20 04:23:03,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=627924.0, ans=0.1 2023-06-20 04:23:18,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=627984.0, ans=0.125 2023-06-20 04:23:23,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=627984.0, ans=0.2 2023-06-20 04:23:51,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=628044.0, ans=0.02 2023-06-20 04:24:02,025 INFO [train.py:996] (0/4) Epoch 4, batch 13200, loss[loss=0.3052, simple_loss=0.3706, pruned_loss=0.1199, over 21436.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3358, pruned_loss=0.09818, over 4278626.56 frames. ], batch size: 131, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:24:10,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=628104.0, ans=0.0 2023-06-20 04:24:47,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=628224.0, ans=0.125 2023-06-20 04:25:12,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=628284.0, ans=0.05 2023-06-20 04:25:27,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.207e+02 2.669e+02 3.031e+02 3.643e+02 6.014e+02, threshold=6.063e+02, percent-clipped=0.0 2023-06-20 04:25:38,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=628344.0, ans=0.125 2023-06-20 04:25:48,487 INFO [train.py:996] (0/4) Epoch 4, batch 13250, loss[loss=0.2794, simple_loss=0.3323, pruned_loss=0.1132, over 21439.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3367, pruned_loss=0.09929, over 4272110.36 frames. ], batch size: 548, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:25:54,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=22.5 2023-06-20 04:26:04,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=628404.0, ans=0.125 2023-06-20 04:26:35,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=628524.0, ans=0.2 2023-06-20 04:26:45,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=628524.0, ans=0.125 2023-06-20 04:26:51,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.46 vs. limit=10.0 2023-06-20 04:27:40,945 INFO [train.py:996] (0/4) Epoch 4, batch 13300, loss[loss=0.2957, simple_loss=0.3601, pruned_loss=0.1156, over 21777.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.339, pruned_loss=0.09978, over 4272865.13 frames. ], batch size: 118, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:27:54,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.98 vs. limit=15.0 2023-06-20 04:29:00,439 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=22.5 2023-06-20 04:29:02,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 2.810e+02 3.267e+02 3.707e+02 6.798e+02, threshold=6.534e+02, percent-clipped=1.0 2023-06-20 04:29:15,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=12.0 2023-06-20 04:29:19,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-06-20 04:29:21,154 INFO [train.py:996] (0/4) Epoch 4, batch 13350, loss[loss=0.272, simple_loss=0.3541, pruned_loss=0.09495, over 21708.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3436, pruned_loss=0.1028, over 4275001.01 frames. ], batch size: 332, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:29:54,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=629064.0, ans=0.125 2023-06-20 04:30:37,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=629184.0, ans=0.0 2023-06-20 04:30:54,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=629244.0, ans=0.0 2023-06-20 04:31:05,742 INFO [train.py:996] (0/4) Epoch 4, batch 13400, loss[loss=0.2612, simple_loss=0.3197, pruned_loss=0.1014, over 21573.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.344, pruned_loss=0.1036, over 4271088.73 frames. ], batch size: 131, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:31:11,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=629304.0, ans=0.125 2023-06-20 04:31:32,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.81 vs. limit=15.0 2023-06-20 04:32:38,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.043e+02 3.589e+02 4.349e+02 7.690e+02, threshold=7.178e+02, percent-clipped=6.0 2023-06-20 04:32:51,556 INFO [train.py:996] (0/4) Epoch 4, batch 13450, loss[loss=0.2396, simple_loss=0.3106, pruned_loss=0.08428, over 21726.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3465, pruned_loss=0.1067, over 4270359.41 frames. ], batch size: 351, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:33:44,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=629724.0, ans=0.2 2023-06-20 04:34:16,714 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:34:19,814 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:34:32,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=629844.0, ans=0.1 2023-06-20 04:34:46,860 INFO [train.py:996] (0/4) Epoch 4, batch 13500, loss[loss=0.2469, simple_loss=0.3043, pruned_loss=0.09472, over 21581.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3371, pruned_loss=0.1033, over 4273066.03 frames. ], batch size: 230, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:35:32,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.54 vs. limit=6.0 2023-06-20 04:35:49,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=630024.0, ans=0.0 2023-06-20 04:36:21,742 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.218e+02 3.754e+02 4.451e+02 7.704e+02, threshold=7.508e+02, percent-clipped=1.0 2023-06-20 04:36:22,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=630144.0, ans=0.0 2023-06-20 04:36:33,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=630204.0, ans=0.125 2023-06-20 04:36:34,735 INFO [train.py:996] (0/4) Epoch 4, batch 13550, loss[loss=0.2727, simple_loss=0.3764, pruned_loss=0.0845, over 21762.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3385, pruned_loss=0.1014, over 4268960.42 frames. ], batch size: 282, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:36:37,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-20 04:36:53,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=630204.0, ans=0.1 2023-06-20 04:38:18,750 INFO [train.py:996] (0/4) Epoch 4, batch 13600, loss[loss=0.2864, simple_loss=0.3377, pruned_loss=0.1175, over 21676.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3406, pruned_loss=0.1021, over 4266783.66 frames. ], batch size: 230, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 04:38:22,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=630504.0, ans=0.125 2023-06-20 04:39:13,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=630624.0, ans=0.1 2023-06-20 04:39:50,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.831e+02 3.255e+02 3.648e+02 6.704e+02, threshold=6.511e+02, percent-clipped=0.0 2023-06-20 04:40:01,064 INFO [train.py:996] (0/4) Epoch 4, batch 13650, loss[loss=0.2356, simple_loss=0.2864, pruned_loss=0.09234, over 21137.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3342, pruned_loss=0.09864, over 4270875.95 frames. ], batch size: 143, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:40:21,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=630804.0, ans=0.2 2023-06-20 04:41:03,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=630924.0, ans=0.125 2023-06-20 04:41:04,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=630984.0, ans=0.035 2023-06-20 04:41:45,810 INFO [train.py:996] (0/4) Epoch 4, batch 13700, loss[loss=0.2398, simple_loss=0.3014, pruned_loss=0.08913, over 21623.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3305, pruned_loss=0.09817, over 4266206.89 frames. ], batch size: 247, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:42:33,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=15.0 2023-06-20 04:42:58,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=631284.0, ans=0.0 2023-06-20 04:42:58,308 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:43:23,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=631344.0, ans=0.5 2023-06-20 04:43:24,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.956e+02 3.502e+02 4.364e+02 8.587e+02, threshold=7.005e+02, percent-clipped=5.0 2023-06-20 04:43:42,228 INFO [train.py:996] (0/4) Epoch 4, batch 13750, loss[loss=0.2502, simple_loss=0.3161, pruned_loss=0.09211, over 21766.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3282, pruned_loss=0.09647, over 4268676.78 frames. ], batch size: 282, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:44:04,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=631464.0, ans=0.125 2023-06-20 04:44:18,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-20 04:44:36,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=631524.0, ans=0.125 2023-06-20 04:45:13,437 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=15.0 2023-06-20 04:45:21,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=631644.0, ans=0.125 2023-06-20 04:45:23,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=631644.0, ans=0.125 2023-06-20 04:45:32,626 INFO [train.py:996] (0/4) Epoch 4, batch 13800, loss[loss=0.2736, simple_loss=0.3628, pruned_loss=0.09219, over 21615.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3359, pruned_loss=0.09654, over 4266060.24 frames. ], batch size: 263, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:45:33,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=631704.0, ans=0.125 2023-06-20 04:45:38,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=631704.0, ans=0.125 2023-06-20 04:45:51,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=631704.0, ans=6.0 2023-06-20 04:46:03,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=631764.0, ans=0.2 2023-06-20 04:47:06,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.073e+02 3.642e+02 4.560e+02 8.359e+02, threshold=7.284e+02, percent-clipped=3.0 2023-06-20 04:47:15,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=631944.0, ans=0.125 2023-06-20 04:47:17,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=632004.0, ans=0.2 2023-06-20 04:47:23,557 INFO [train.py:996] (0/4) Epoch 4, batch 13850, loss[loss=0.2113, simple_loss=0.2859, pruned_loss=0.06835, over 21875.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3403, pruned_loss=0.09689, over 4269293.60 frames. ], batch size: 107, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:47:36,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=22.5 2023-06-20 04:47:55,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=632064.0, ans=0.0 2023-06-20 04:49:09,030 INFO [train.py:996] (0/4) Epoch 4, batch 13900, loss[loss=0.248, simple_loss=0.3059, pruned_loss=0.09499, over 21673.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3441, pruned_loss=0.1007, over 4270269.90 frames. ], batch size: 263, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:49:15,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=632304.0, ans=0.125 2023-06-20 04:49:21,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=632304.0, ans=0.0 2023-06-20 04:49:21,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=632304.0, ans=0.125 2023-06-20 04:49:32,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-20 04:50:40,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.556e+02 4.265e+02 5.269e+02 7.474e+02, threshold=8.529e+02, percent-clipped=1.0 2023-06-20 04:50:52,546 INFO [train.py:996] (0/4) Epoch 4, batch 13950, loss[loss=0.2497, simple_loss=0.3233, pruned_loss=0.08808, over 21487.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3449, pruned_loss=0.1031, over 4274629.87 frames. ], batch size: 131, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:51:08,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=632604.0, ans=10.0 2023-06-20 04:51:12,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=632664.0, ans=0.125 2023-06-20 04:51:50,447 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.88 vs. limit=10.0 2023-06-20 04:52:21,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=632844.0, ans=0.125 2023-06-20 04:52:35,144 INFO [train.py:996] (0/4) Epoch 4, batch 14000, loss[loss=0.2437, simple_loss=0.338, pruned_loss=0.07472, over 21794.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3421, pruned_loss=0.1008, over 4275096.31 frames. ], batch size: 332, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:52:57,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-20 04:53:26,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.85 vs. limit=10.0 2023-06-20 04:54:05,715 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.775e+02 3.339e+02 4.026e+02 8.444e+02, threshold=6.679e+02, percent-clipped=0.0 2023-06-20 04:54:09,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-20 04:54:17,334 INFO [train.py:996] (0/4) Epoch 4, batch 14050, loss[loss=0.249, simple_loss=0.2963, pruned_loss=0.1008, over 21418.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3356, pruned_loss=0.09662, over 4262904.88 frames. ], batch size: 211, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:54:40,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.70 vs. limit=15.0 2023-06-20 04:54:55,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=633324.0, ans=0.1 2023-06-20 04:55:23,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.39 vs. limit=10.0 2023-06-20 04:55:45,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=633444.0, ans=0.1 2023-06-20 04:55:45,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=633444.0, ans=0.0 2023-06-20 04:56:00,943 INFO [train.py:996] (0/4) Epoch 4, batch 14100, loss[loss=0.2382, simple_loss=0.2977, pruned_loss=0.08936, over 21540.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3288, pruned_loss=0.09552, over 4262968.89 frames. ], batch size: 263, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:57:34,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.697e+02 3.241e+02 4.086e+02 6.665e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 04:57:38,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=633744.0, ans=0.125 2023-06-20 04:57:43,886 INFO [train.py:996] (0/4) Epoch 4, batch 14150, loss[loss=0.2331, simple_loss=0.3201, pruned_loss=0.07304, over 21722.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3317, pruned_loss=0.09629, over 4263296.67 frames. ], batch size: 112, lr: 8.01e-03, grad_scale: 16.0 2023-06-20 04:58:07,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=633864.0, ans=0.2 2023-06-20 04:58:11,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=633864.0, ans=0.0 2023-06-20 04:58:23,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-20 04:58:54,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-20 04:58:55,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=633984.0, ans=0.0 2023-06-20 04:58:59,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=633984.0, ans=0.0 2023-06-20 04:59:24,548 INFO [train.py:996] (0/4) Epoch 4, batch 14200, loss[loss=0.2239, simple_loss=0.304, pruned_loss=0.07195, over 21341.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3289, pruned_loss=0.09381, over 4265235.34 frames. ], batch size: 176, lr: 8.01e-03, grad_scale: 16.0 2023-06-20 05:00:21,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=634224.0, ans=0.0 2023-06-20 05:00:34,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=634284.0, ans=0.125 2023-06-20 05:00:43,044 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:00:52,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.482e+02 2.802e+02 3.379e+02 6.129e+02, threshold=5.605e+02, percent-clipped=0.0 2023-06-20 05:01:07,627 INFO [train.py:996] (0/4) Epoch 4, batch 14250, loss[loss=0.2283, simple_loss=0.3122, pruned_loss=0.07217, over 21532.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3235, pruned_loss=0.09398, over 4262078.64 frames. ], batch size: 441, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:01:13,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=634404.0, ans=0.125 2023-06-20 05:01:26,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=634404.0, ans=0.125 2023-06-20 05:02:46,952 INFO [train.py:996] (0/4) Epoch 4, batch 14300, loss[loss=0.2395, simple_loss=0.3023, pruned_loss=0.0883, over 21399.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3242, pruned_loss=0.09289, over 4267696.85 frames. ], batch size: 131, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:02:47,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=634704.0, ans=0.125 2023-06-20 05:03:14,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=634764.0, ans=0.125 2023-06-20 05:03:14,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=634764.0, ans=0.125 2023-06-20 05:04:21,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.821e+02 3.422e+02 4.230e+02 9.010e+02, threshold=6.844e+02, percent-clipped=9.0 2023-06-20 05:04:31,910 INFO [train.py:996] (0/4) Epoch 4, batch 14350, loss[loss=0.2729, simple_loss=0.3275, pruned_loss=0.1092, over 21470.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3305, pruned_loss=0.09402, over 4255579.41 frames. ], batch size: 131, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:04:42,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=635004.0, ans=0.025 2023-06-20 05:04:45,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=635004.0, ans=0.125 2023-06-20 05:06:19,444 INFO [train.py:996] (0/4) Epoch 4, batch 14400, loss[loss=0.2481, simple_loss=0.31, pruned_loss=0.09313, over 21692.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3271, pruned_loss=0.09461, over 4255084.36 frames. ], batch size: 414, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:06:44,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=635364.0, ans=0.125 2023-06-20 05:07:42,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.821e+02 3.350e+02 4.136e+02 6.839e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-20 05:07:57,249 INFO [train.py:996] (0/4) Epoch 4, batch 14450, loss[loss=0.2382, simple_loss=0.3018, pruned_loss=0.0873, over 21790.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3245, pruned_loss=0.09646, over 4259185.56 frames. ], batch size: 124, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:08:20,029 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-20 05:08:20,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=15.0 2023-06-20 05:09:38,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=635904.0, ans=0.125 2023-06-20 05:09:39,332 INFO [train.py:996] (0/4) Epoch 4, batch 14500, loss[loss=0.2446, simple_loss=0.3284, pruned_loss=0.08043, over 21744.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3212, pruned_loss=0.09585, over 4269778.20 frames. ], batch size: 282, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:11:13,699 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.861e+02 3.336e+02 4.612e+02 7.217e+02, threshold=6.672e+02, percent-clipped=2.0 2023-06-20 05:11:18,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=636144.0, ans=0.1 2023-06-20 05:11:19,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-20 05:11:24,714 INFO [train.py:996] (0/4) Epoch 4, batch 14550, loss[loss=0.2699, simple_loss=0.3424, pruned_loss=0.0987, over 21766.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3269, pruned_loss=0.0982, over 4276108.76 frames. ], batch size: 332, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:11:44,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=636204.0, ans=0.2 2023-06-20 05:12:30,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=636384.0, ans=0.2 2023-06-20 05:12:53,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=636444.0, ans=0.0 2023-06-20 05:12:54,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=636444.0, ans=0.2 2023-06-20 05:13:10,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=636444.0, ans=0.125 2023-06-20 05:13:16,341 INFO [train.py:996] (0/4) Epoch 4, batch 14600, loss[loss=0.2757, simple_loss=0.353, pruned_loss=0.09917, over 21779.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3339, pruned_loss=0.1019, over 4272473.37 frames. ], batch size: 247, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:13:38,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=636564.0, ans=0.125 2023-06-20 05:13:46,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=636564.0, ans=0.1 2023-06-20 05:14:26,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=636684.0, ans=22.5 2023-06-20 05:14:27,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=636684.0, ans=0.125 2023-06-20 05:14:43,554 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 3.039e+02 3.637e+02 4.481e+02 9.662e+02, threshold=7.275e+02, percent-clipped=5.0 2023-06-20 05:14:53,038 INFO [train.py:996] (0/4) Epoch 4, batch 14650, loss[loss=0.1687, simple_loss=0.2574, pruned_loss=0.03997, over 21592.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.336, pruned_loss=0.1006, over 4272847.19 frames. ], batch size: 230, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:15:21,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=636864.0, ans=0.125 2023-06-20 05:15:54,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=636984.0, ans=0.5 2023-06-20 05:15:54,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=636984.0, ans=0.2 2023-06-20 05:16:00,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=636984.0, ans=0.0 2023-06-20 05:16:27,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.10 vs. limit=12.0 2023-06-20 05:16:40,854 INFO [train.py:996] (0/4) Epoch 4, batch 14700, loss[loss=0.2394, simple_loss=0.3282, pruned_loss=0.07526, over 21525.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3295, pruned_loss=0.09329, over 4272976.58 frames. ], batch size: 508, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:17:48,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-20 05:17:53,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=637284.0, ans=0.0 2023-06-20 05:17:53,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=637284.0, ans=0.1 2023-06-20 05:18:10,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=637344.0, ans=0.0 2023-06-20 05:18:12,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.351e+02 2.788e+02 3.519e+02 6.135e+02, threshold=5.577e+02, percent-clipped=0.0 2023-06-20 05:18:22,374 INFO [train.py:996] (0/4) Epoch 4, batch 14750, loss[loss=0.3064, simple_loss=0.3639, pruned_loss=0.1244, over 21492.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3364, pruned_loss=0.09713, over 4276596.88 frames. ], batch size: 194, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:18:50,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=637464.0, ans=0.125 2023-06-20 05:18:53,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=637464.0, ans=0.0 2023-06-20 05:19:00,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=637524.0, ans=0.125 2023-06-20 05:19:27,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.95 vs. limit=22.5 2023-06-20 05:19:46,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=637644.0, ans=0.125 2023-06-20 05:20:01,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=637704.0, ans=0.1 2023-06-20 05:20:03,232 INFO [train.py:996] (0/4) Epoch 4, batch 14800, loss[loss=0.2649, simple_loss=0.3241, pruned_loss=0.1028, over 21822.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3494, pruned_loss=0.1042, over 4273968.35 frames. ], batch size: 107, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 05:21:37,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-20 05:21:42,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.164e+02 3.889e+02 4.731e+02 8.129e+02, threshold=7.778e+02, percent-clipped=15.0 2023-06-20 05:22:00,594 INFO [train.py:996] (0/4) Epoch 4, batch 14850, loss[loss=0.2271, simple_loss=0.289, pruned_loss=0.08259, over 21859.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3426, pruned_loss=0.1035, over 4270675.90 frames. ], batch size: 107, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:22:33,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=638064.0, ans=10.0 2023-06-20 05:23:41,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=15.0 2023-06-20 05:23:46,632 INFO [train.py:996] (0/4) Epoch 4, batch 14900, loss[loss=0.291, simple_loss=0.3569, pruned_loss=0.1125, over 21386.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3473, pruned_loss=0.1067, over 4269574.09 frames. ], batch size: 549, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:23:55,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=638304.0, ans=0.05 2023-06-20 05:23:58,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=638304.0, ans=0.125 2023-06-20 05:24:03,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=638364.0, ans=0.2 2023-06-20 05:24:04,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-20 05:24:20,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=638364.0, ans=0.0 2023-06-20 05:24:47,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=638484.0, ans=0.035 2023-06-20 05:25:20,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=638544.0, ans=0.05 2023-06-20 05:25:23,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-20 05:25:25,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.971e+02 3.790e+02 5.715e+02 1.373e+03, threshold=7.580e+02, percent-clipped=7.0 2023-06-20 05:25:28,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=638544.0, ans=0.0 2023-06-20 05:25:32,238 INFO [train.py:996] (0/4) Epoch 4, batch 14950, loss[loss=0.2222, simple_loss=0.3072, pruned_loss=0.06857, over 21679.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3469, pruned_loss=0.1053, over 4273197.14 frames. ], batch size: 298, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:26:02,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-20 05:26:14,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-20 05:26:33,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=638784.0, ans=0.2 2023-06-20 05:26:45,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=638784.0, ans=0.125 2023-06-20 05:27:16,868 INFO [train.py:996] (0/4) Epoch 4, batch 15000, loss[loss=0.2496, simple_loss=0.3139, pruned_loss=0.0926, over 21791.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3487, pruned_loss=0.1063, over 4274475.37 frames. ], batch size: 247, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:27:16,870 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 05:27:27,741 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.3967, 2.3946, 2.1203, 2.8840], device='cuda:0') 2023-06-20 05:27:33,903 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2743, simple_loss=0.3665, pruned_loss=0.09108, over 1796401.00 frames. 2023-06-20 05:27:33,904 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 05:27:34,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=638904.0, ans=0.125 2023-06-20 05:28:20,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=639024.0, ans=0.0 2023-06-20 05:28:57,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=639084.0, ans=0.125 2023-06-20 05:29:00,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=639144.0, ans=0.0 2023-06-20 05:29:08,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=639144.0, ans=6.0 2023-06-20 05:29:12,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.360e+02 3.927e+02 4.837e+02 8.029e+02, threshold=7.853e+02, percent-clipped=2.0 2023-06-20 05:29:24,468 INFO [train.py:996] (0/4) Epoch 4, batch 15050, loss[loss=0.2709, simple_loss=0.3541, pruned_loss=0.09383, over 21642.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.349, pruned_loss=0.1072, over 4270611.89 frames. ], batch size: 263, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:30:22,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=639324.0, ans=0.0 2023-06-20 05:30:39,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=639384.0, ans=0.125 2023-06-20 05:30:59,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=639444.0, ans=0.125 2023-06-20 05:31:09,509 INFO [train.py:996] (0/4) Epoch 4, batch 15100, loss[loss=0.2942, simple_loss=0.3571, pruned_loss=0.1157, over 21342.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3493, pruned_loss=0.1057, over 4277816.43 frames. ], batch size: 159, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:32:03,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=639624.0, ans=0.125 2023-06-20 05:32:07,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=639624.0, ans=0.05 2023-06-20 05:32:45,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.063e+02 3.378e+02 3.992e+02 7.623e+02, threshold=6.756e+02, percent-clipped=0.0 2023-06-20 05:32:52,591 INFO [train.py:996] (0/4) Epoch 4, batch 15150, loss[loss=0.2265, simple_loss=0.2865, pruned_loss=0.08332, over 21309.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3459, pruned_loss=0.1063, over 4276898.48 frames. ], batch size: 211, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:33:22,830 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.84 vs. limit=5.0 2023-06-20 05:33:25,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=639864.0, ans=0.0 2023-06-20 05:33:46,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=639924.0, ans=0.125 2023-06-20 05:33:49,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=639924.0, ans=0.05 2023-06-20 05:33:57,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=639984.0, ans=0.2 2023-06-20 05:33:57,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=639984.0, ans=0.2 2023-06-20 05:33:59,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=12.0 2023-06-20 05:34:09,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=639984.0, ans=0.1 2023-06-20 05:34:21,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=640044.0, ans=0.1 2023-06-20 05:34:41,851 INFO [train.py:996] (0/4) Epoch 4, batch 15200, loss[loss=0.2306, simple_loss=0.2799, pruned_loss=0.0907, over 21228.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.336, pruned_loss=0.1014, over 4258467.06 frames. ], batch size: 143, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:35:00,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=640104.0, ans=0.1 2023-06-20 05:35:45,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=640284.0, ans=0.125 2023-06-20 05:36:05,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-20 05:36:14,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.035e+02 3.960e+02 4.645e+02 7.650e+02, threshold=7.920e+02, percent-clipped=3.0 2023-06-20 05:36:25,897 INFO [train.py:996] (0/4) Epoch 4, batch 15250, loss[loss=0.299, simple_loss=0.3494, pruned_loss=0.1243, over 21467.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3308, pruned_loss=0.09981, over 4269526.59 frames. ], batch size: 389, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:36:47,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=640464.0, ans=0.1 2023-06-20 05:37:15,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=640524.0, ans=0.125 2023-06-20 05:38:17,099 INFO [train.py:996] (0/4) Epoch 4, batch 15300, loss[loss=0.2248, simple_loss=0.2951, pruned_loss=0.07724, over 20160.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.333, pruned_loss=0.1026, over 4259294.75 frames. ], batch size: 702, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:39:03,629 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:39:06,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=640824.0, ans=0.2 2023-06-20 05:39:54,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.484e+02 2.881e+02 3.296e+02 3.984e+02 9.139e+02, threshold=6.591e+02, percent-clipped=2.0 2023-06-20 05:40:01,934 INFO [train.py:996] (0/4) Epoch 4, batch 15350, loss[loss=0.3085, simple_loss=0.3652, pruned_loss=0.1259, over 21797.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3388, pruned_loss=0.106, over 4265797.47 frames. ], batch size: 441, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:40:16,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=641004.0, ans=0.95 2023-06-20 05:40:47,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=641124.0, ans=0.0 2023-06-20 05:41:36,551 INFO [train.py:996] (0/4) Epoch 4, batch 15400, loss[loss=0.2663, simple_loss=0.3394, pruned_loss=0.09658, over 21170.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3383, pruned_loss=0.1033, over 4271626.54 frames. ], batch size: 143, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:41:54,931 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:42:36,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=641484.0, ans=0.2 2023-06-20 05:42:36,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=641484.0, ans=0.125 2023-06-20 05:43:07,587 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.755e+02 3.304e+02 3.947e+02 7.271e+02, threshold=6.607e+02, percent-clipped=2.0 2023-06-20 05:43:20,001 INFO [train.py:996] (0/4) Epoch 4, batch 15450, loss[loss=0.2596, simple_loss=0.3259, pruned_loss=0.09668, over 21899.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3362, pruned_loss=0.1018, over 4267393.00 frames. ], batch size: 107, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:43:20,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=641604.0, ans=0.125 2023-06-20 05:43:33,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=641604.0, ans=0.125 2023-06-20 05:44:26,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=641784.0, ans=0.125 2023-06-20 05:45:10,536 INFO [train.py:996] (0/4) Epoch 4, batch 15500, loss[loss=0.2273, simple_loss=0.2894, pruned_loss=0.08256, over 21284.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3395, pruned_loss=0.1022, over 4257122.16 frames. ], batch size: 608, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:45:22,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=641904.0, ans=0.125 2023-06-20 05:46:04,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=15.0 2023-06-20 05:46:56,144 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.827e+02 3.458e+02 4.345e+02 6.798e+02, threshold=6.916e+02, percent-clipped=2.0 2023-06-20 05:46:57,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=15.0 2023-06-20 05:47:00,884 INFO [train.py:996] (0/4) Epoch 4, batch 15550, loss[loss=0.2061, simple_loss=0.2904, pruned_loss=0.06096, over 21719.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3388, pruned_loss=0.09956, over 4257253.77 frames. ], batch size: 298, lr: 7.96e-03, grad_scale: 16.0 2023-06-20 05:48:39,100 INFO [train.py:996] (0/4) Epoch 4, batch 15600, loss[loss=0.242, simple_loss=0.3062, pruned_loss=0.08887, over 21168.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3343, pruned_loss=0.09775, over 4258715.41 frames. ], batch size: 548, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 05:48:39,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=642504.0, ans=0.0 2023-06-20 05:48:49,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=642504.0, ans=0.125 2023-06-20 05:48:52,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2023-06-20 05:48:56,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=642564.0, ans=0.0 2023-06-20 05:48:57,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=642564.0, ans=0.125 2023-06-20 05:49:15,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=642624.0, ans=0.1 2023-06-20 05:49:23,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=642624.0, ans=0.0 2023-06-20 05:49:32,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=642684.0, ans=0.125 2023-06-20 05:49:54,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=642744.0, ans=0.1 2023-06-20 05:50:00,602 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:50:16,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=642744.0, ans=0.125 2023-06-20 05:50:17,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.694e+02 3.221e+02 3.969e+02 6.566e+02, threshold=6.442e+02, percent-clipped=0.0 2023-06-20 05:50:21,218 INFO [train.py:996] (0/4) Epoch 4, batch 15650, loss[loss=0.2932, simple_loss=0.3373, pruned_loss=0.1246, over 21527.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3327, pruned_loss=0.09703, over 4259707.71 frames. ], batch size: 414, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:50:50,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=642864.0, ans=0.125 2023-06-20 05:50:50,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=642864.0, ans=0.125 2023-06-20 05:51:50,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=643044.0, ans=0.125 2023-06-20 05:51:55,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=643044.0, ans=0.0 2023-06-20 05:52:01,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=643044.0, ans=0.125 2023-06-20 05:52:06,105 INFO [train.py:996] (0/4) Epoch 4, batch 15700, loss[loss=0.3143, simple_loss=0.3563, pruned_loss=0.1361, over 21271.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3286, pruned_loss=0.09665, over 4263817.17 frames. ], batch size: 471, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:53:46,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.685e+02 3.204e+02 3.710e+02 6.179e+02, threshold=6.407e+02, percent-clipped=0.0 2023-06-20 05:53:49,716 INFO [train.py:996] (0/4) Epoch 4, batch 15750, loss[loss=0.2394, simple_loss=0.2948, pruned_loss=0.092, over 21789.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3231, pruned_loss=0.09626, over 4259222.23 frames. ], batch size: 118, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:54:23,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.85 vs. limit=10.0 2023-06-20 05:54:25,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=643524.0, ans=0.125 2023-06-20 05:54:48,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=643584.0, ans=0.015 2023-06-20 05:55:24,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=643644.0, ans=0.0 2023-06-20 05:55:31,434 INFO [train.py:996] (0/4) Epoch 4, batch 15800, loss[loss=0.2302, simple_loss=0.2893, pruned_loss=0.08555, over 21684.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3175, pruned_loss=0.09563, over 4265742.45 frames. ], batch size: 333, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:55:32,516 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-20 05:57:11,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.033e+02 3.511e+02 4.104e+02 6.063e+02, threshold=7.023e+02, percent-clipped=0.0 2023-06-20 05:57:14,461 INFO [train.py:996] (0/4) Epoch 4, batch 15850, loss[loss=0.2286, simple_loss=0.3064, pruned_loss=0.07537, over 20058.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3206, pruned_loss=0.09845, over 4270668.81 frames. ], batch size: 703, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:57:16,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=644004.0, ans=0.0 2023-06-20 05:57:33,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-20 05:57:44,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=644064.0, ans=0.1 2023-06-20 05:58:19,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=644184.0, ans=0.5 2023-06-20 05:58:38,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-20 05:58:46,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=644244.0, ans=0.05 2023-06-20 05:58:46,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=644244.0, ans=0.0 2023-06-20 05:58:48,845 INFO [train.py:996] (0/4) Epoch 4, batch 15900, loss[loss=0.249, simple_loss=0.3129, pruned_loss=0.09258, over 21778.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3175, pruned_loss=0.09772, over 4275768.72 frames. ], batch size: 124, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 05:59:22,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=644424.0, ans=0.0 2023-06-20 06:00:00,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=644484.0, ans=0.0 2023-06-20 06:00:28,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.691e+02 3.018e+02 3.866e+02 6.282e+02, threshold=6.037e+02, percent-clipped=0.0 2023-06-20 06:00:32,143 INFO [train.py:996] (0/4) Epoch 4, batch 15950, loss[loss=0.2197, simple_loss=0.2978, pruned_loss=0.07079, over 20765.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3189, pruned_loss=0.09549, over 4271878.19 frames. ], batch size: 607, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 06:00:39,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=644604.0, ans=0.2 2023-06-20 06:01:10,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=644724.0, ans=0.125 2023-06-20 06:02:14,384 INFO [train.py:996] (0/4) Epoch 4, batch 16000, loss[loss=0.2098, simple_loss=0.2982, pruned_loss=0.0607, over 21693.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3206, pruned_loss=0.09301, over 4272109.14 frames. ], batch size: 247, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:02:21,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=644904.0, ans=0.0 2023-06-20 06:03:11,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=645084.0, ans=0.2 2023-06-20 06:03:51,836 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=12.0 2023-06-20 06:03:53,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 2.704e+02 3.132e+02 3.952e+02 7.192e+02, threshold=6.264e+02, percent-clipped=3.0 2023-06-20 06:03:57,368 INFO [train.py:996] (0/4) Epoch 4, batch 16050, loss[loss=0.349, simple_loss=0.4317, pruned_loss=0.1331, over 21540.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3232, pruned_loss=0.09057, over 4275311.61 frames. ], batch size: 471, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:04:01,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=645204.0, ans=0.0 2023-06-20 06:04:58,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=645384.0, ans=0.0 2023-06-20 06:05:36,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=645444.0, ans=0.125 2023-06-20 06:05:40,821 INFO [train.py:996] (0/4) Epoch 4, batch 16100, loss[loss=0.255, simple_loss=0.3382, pruned_loss=0.08593, over 21400.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3266, pruned_loss=0.09183, over 4271836.04 frames. ], batch size: 211, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:05:49,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=645504.0, ans=0.1 2023-06-20 06:05:59,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=645564.0, ans=0.125 2023-06-20 06:07:14,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=645744.0, ans=0.0 2023-06-20 06:07:20,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 3.023e+02 3.501e+02 4.351e+02 8.172e+02, threshold=7.003e+02, percent-clipped=2.0 2023-06-20 06:07:23,554 INFO [train.py:996] (0/4) Epoch 4, batch 16150, loss[loss=0.2582, simple_loss=0.315, pruned_loss=0.1007, over 21538.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.328, pruned_loss=0.09515, over 4281731.24 frames. ], batch size: 548, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:07:58,505 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:08:37,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=645984.0, ans=0.2 2023-06-20 06:08:47,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=646044.0, ans=0.0 2023-06-20 06:08:49,681 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-20 06:08:58,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=646044.0, ans=0.0 2023-06-20 06:09:06,401 INFO [train.py:996] (0/4) Epoch 4, batch 16200, loss[loss=0.3633, simple_loss=0.4141, pruned_loss=0.1562, over 21743.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3334, pruned_loss=0.09683, over 4282128.32 frames. ], batch size: 441, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:09:25,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-20 06:09:38,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=646224.0, ans=0.125 2023-06-20 06:10:40,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.687e+02 3.085e+02 4.076e+02 6.886e+02, threshold=6.170e+02, percent-clipped=1.0 2023-06-20 06:10:44,035 INFO [train.py:996] (0/4) Epoch 4, batch 16250, loss[loss=0.2363, simple_loss=0.3021, pruned_loss=0.0853, over 21640.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3347, pruned_loss=0.0968, over 4279917.94 frames. ], batch size: 247, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:10:46,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=646404.0, ans=0.0 2023-06-20 06:11:33,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=646524.0, ans=0.2 2023-06-20 06:12:25,371 INFO [train.py:996] (0/4) Epoch 4, batch 16300, loss[loss=0.2768, simple_loss=0.3568, pruned_loss=0.09845, over 21553.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3277, pruned_loss=0.09286, over 4276741.78 frames. ], batch size: 441, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:13:26,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=646884.0, ans=0.1 2023-06-20 06:13:38,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=646884.0, ans=0.1 2023-06-20 06:13:50,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=646944.0, ans=0.125 2023-06-20 06:14:05,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.535e+02 3.018e+02 3.417e+02 5.954e+02, threshold=6.036e+02, percent-clipped=0.0 2023-06-20 06:14:08,704 INFO [train.py:996] (0/4) Epoch 4, batch 16350, loss[loss=0.3132, simple_loss=0.3712, pruned_loss=0.1276, over 21538.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3256, pruned_loss=0.0931, over 4270857.01 frames. ], batch size: 414, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:14:57,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=647124.0, ans=0.0 2023-06-20 06:15:25,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=647184.0, ans=0.125 2023-06-20 06:15:35,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=647244.0, ans=0.5 2023-06-20 06:15:42,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=22.5 2023-06-20 06:15:46,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=647244.0, ans=0.125 2023-06-20 06:15:52,749 INFO [train.py:996] (0/4) Epoch 4, batch 16400, loss[loss=0.2664, simple_loss=0.3261, pruned_loss=0.1034, over 21917.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3305, pruned_loss=0.09606, over 4271820.61 frames. ], batch size: 118, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 06:16:08,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-20 06:16:26,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=647364.0, ans=0.125 2023-06-20 06:17:05,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=647484.0, ans=0.1 2023-06-20 06:17:31,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 2.888e+02 3.304e+02 3.936e+02 7.106e+02, threshold=6.607e+02, percent-clipped=3.0 2023-06-20 06:17:32,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=647544.0, ans=0.125 2023-06-20 06:17:35,191 INFO [train.py:996] (0/4) Epoch 4, batch 16450, loss[loss=0.2323, simple_loss=0.2965, pruned_loss=0.08405, over 21716.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3299, pruned_loss=0.09689, over 4281285.53 frames. ], batch size: 230, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 06:17:47,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=647604.0, ans=0.125 2023-06-20 06:19:03,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=647844.0, ans=0.125 2023-06-20 06:19:19,942 INFO [train.py:996] (0/4) Epoch 4, batch 16500, loss[loss=0.3147, simple_loss=0.3843, pruned_loss=0.1226, over 21514.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3304, pruned_loss=0.09755, over 4281693.82 frames. ], batch size: 508, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:19:40,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=647964.0, ans=0.1 2023-06-20 06:19:47,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=647964.0, ans=0.125 2023-06-20 06:19:54,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=647964.0, ans=0.0 2023-06-20 06:19:56,076 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-108000.pt 2023-06-20 06:20:00,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=647964.0, ans=0.125 2023-06-20 06:20:04,456 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.36 vs. limit=10.0 2023-06-20 06:20:51,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=648144.0, ans=0.1 2023-06-20 06:21:03,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.948e+02 3.472e+02 4.236e+02 9.691e+02, threshold=6.943e+02, percent-clipped=9.0 2023-06-20 06:21:05,588 INFO [train.py:996] (0/4) Epoch 4, batch 16550, loss[loss=0.1587, simple_loss=0.1869, pruned_loss=0.06528, over 16768.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3259, pruned_loss=0.09294, over 4273494.61 frames. ], batch size: 60, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:21:06,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=648204.0, ans=0.0 2023-06-20 06:21:06,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=648204.0, ans=0.125 2023-06-20 06:21:57,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=648324.0, ans=0.0 2023-06-20 06:22:06,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=648324.0, ans=0.125 2023-06-20 06:22:07,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=648324.0, ans=0.04949747468305833 2023-06-20 06:22:10,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-20 06:22:59,782 INFO [train.py:996] (0/4) Epoch 4, batch 16600, loss[loss=0.313, simple_loss=0.4062, pruned_loss=0.1099, over 21798.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.336, pruned_loss=0.09731, over 4274864.14 frames. ], batch size: 282, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:23:17,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=648504.0, ans=0.125 2023-06-20 06:23:25,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=648564.0, ans=0.125 2023-06-20 06:23:41,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=648564.0, ans=0.125 2023-06-20 06:24:48,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.275e+02 4.120e+02 5.174e+02 8.172e+02, threshold=8.240e+02, percent-clipped=2.0 2023-06-20 06:24:49,949 INFO [train.py:996] (0/4) Epoch 4, batch 16650, loss[loss=0.2924, simple_loss=0.3621, pruned_loss=0.1113, over 21953.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3447, pruned_loss=0.09886, over 4276485.84 frames. ], batch size: 372, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:25:11,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=648864.0, ans=0.125 2023-06-20 06:25:23,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=648864.0, ans=0.025 2023-06-20 06:25:29,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=648924.0, ans=0.125 2023-06-20 06:26:18,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=649044.0, ans=0.125 2023-06-20 06:26:22,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-06-20 06:26:41,429 INFO [train.py:996] (0/4) Epoch 4, batch 16700, loss[loss=0.2724, simple_loss=0.351, pruned_loss=0.0969, over 21665.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3496, pruned_loss=0.1016, over 4275387.59 frames. ], batch size: 389, lr: 7.91e-03, grad_scale: 16.0 2023-06-20 06:26:55,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=649104.0, ans=0.1 2023-06-20 06:27:24,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=649224.0, ans=0.2 2023-06-20 06:27:35,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=649224.0, ans=0.125 2023-06-20 06:28:15,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=22.5 2023-06-20 06:28:26,848 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.053e+02 3.657e+02 4.357e+02 8.504e+02, threshold=7.314e+02, percent-clipped=1.0 2023-06-20 06:28:28,511 INFO [train.py:996] (0/4) Epoch 4, batch 16750, loss[loss=0.2976, simple_loss=0.3622, pruned_loss=0.1165, over 19842.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3521, pruned_loss=0.1048, over 4274992.62 frames. ], batch size: 702, lr: 7.91e-03, grad_scale: 16.0 2023-06-20 06:28:35,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-20 06:29:01,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=649464.0, ans=0.2 2023-06-20 06:29:27,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=649524.0, ans=0.0 2023-06-20 06:29:38,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=649524.0, ans=0.2 2023-06-20 06:29:39,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=649584.0, ans=0.125 2023-06-20 06:29:41,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-20 06:29:56,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.14 vs. limit=10.0 2023-06-20 06:30:12,944 INFO [train.py:996] (0/4) Epoch 4, batch 16800, loss[loss=0.2399, simple_loss=0.3197, pruned_loss=0.08002, over 21816.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3553, pruned_loss=0.1039, over 4276772.72 frames. ], batch size: 332, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:30:50,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-20 06:30:53,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=649764.0, ans=0.125 2023-06-20 06:31:07,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=649824.0, ans=0.1 2023-06-20 06:31:07,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=649824.0, ans=0.0 2023-06-20 06:31:19,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-20 06:31:28,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=649884.0, ans=0.125 2023-06-20 06:31:29,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=649884.0, ans=0.1 2023-06-20 06:31:37,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=649944.0, ans=0.0 2023-06-20 06:31:58,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.418e+02 3.215e+02 3.899e+02 5.075e+02 9.613e+02, threshold=7.798e+02, percent-clipped=9.0 2023-06-20 06:32:00,542 INFO [train.py:996] (0/4) Epoch 4, batch 16850, loss[loss=0.2906, simple_loss=0.3471, pruned_loss=0.1171, over 21785.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3528, pruned_loss=0.104, over 4273108.30 frames. ], batch size: 441, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:32:03,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=650004.0, ans=0.0 2023-06-20 06:33:37,220 INFO [train.py:996] (0/4) Epoch 4, batch 16900, loss[loss=0.2267, simple_loss=0.3022, pruned_loss=0.07557, over 20877.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3451, pruned_loss=0.1021, over 4276280.46 frames. ], batch size: 608, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:34:15,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=650364.0, ans=0.04949747468305833 2023-06-20 06:34:16,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-20 06:35:17,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.495e+02 2.882e+02 3.347e+02 4.730e+02, threshold=5.764e+02, percent-clipped=0.0 2023-06-20 06:35:18,595 INFO [train.py:996] (0/4) Epoch 4, batch 16950, loss[loss=0.2996, simple_loss=0.3439, pruned_loss=0.1277, over 21764.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3372, pruned_loss=0.09964, over 4271987.77 frames. ], batch size: 508, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:35:45,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=650664.0, ans=10.0 2023-06-20 06:36:36,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=650784.0, ans=0.1 2023-06-20 06:37:01,066 INFO [train.py:996] (0/4) Epoch 4, batch 17000, loss[loss=0.2667, simple_loss=0.3182, pruned_loss=0.1076, over 21226.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3338, pruned_loss=0.1004, over 4273216.32 frames. ], batch size: 608, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:37:25,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=650904.0, ans=0.125 2023-06-20 06:37:33,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=650964.0, ans=0.1 2023-06-20 06:37:45,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=650964.0, ans=0.0 2023-06-20 06:38:05,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=651024.0, ans=0.0 2023-06-20 06:38:14,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.12 vs. limit=6.0 2023-06-20 06:38:21,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=651084.0, ans=0.125 2023-06-20 06:38:21,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=651084.0, ans=0.0 2023-06-20 06:38:22,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=651084.0, ans=0.125 2023-06-20 06:38:57,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.930e+02 3.368e+02 4.146e+02 7.848e+02, threshold=6.737e+02, percent-clipped=5.0 2023-06-20 06:38:57,206 INFO [train.py:996] (0/4) Epoch 4, batch 17050, loss[loss=0.2565, simple_loss=0.3106, pruned_loss=0.1012, over 20253.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3425, pruned_loss=0.1045, over 4280107.80 frames. ], batch size: 707, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:39:50,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=651324.0, ans=0.1 2023-06-20 06:39:57,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=651384.0, ans=0.125 2023-06-20 06:40:03,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=651384.0, ans=0.0 2023-06-20 06:40:37,634 INFO [train.py:996] (0/4) Epoch 4, batch 17100, loss[loss=0.2773, simple_loss=0.3408, pruned_loss=0.107, over 21789.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3419, pruned_loss=0.1058, over 4279458.96 frames. ], batch size: 112, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:40:44,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=651504.0, ans=0.125 2023-06-20 06:40:46,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=651504.0, ans=0.125 2023-06-20 06:41:39,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=651684.0, ans=0.125 2023-06-20 06:41:56,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=651744.0, ans=0.125 2023-06-20 06:42:19,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.907e+02 3.321e+02 3.692e+02 6.035e+02, threshold=6.643e+02, percent-clipped=0.0 2023-06-20 06:42:19,641 INFO [train.py:996] (0/4) Epoch 4, batch 17150, loss[loss=0.2369, simple_loss=0.3026, pruned_loss=0.08564, over 21494.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3374, pruned_loss=0.105, over 4278730.91 frames. ], batch size: 211, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:42:47,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-20 06:42:58,666 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-20 06:44:07,986 INFO [train.py:996] (0/4) Epoch 4, batch 17200, loss[loss=0.3534, simple_loss=0.3935, pruned_loss=0.1567, over 21289.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3367, pruned_loss=0.104, over 4273569.72 frames. ], batch size: 507, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:44:35,276 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:44:37,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-20 06:44:43,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=652164.0, ans=0.125 2023-06-20 06:45:23,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=652284.0, ans=0.0 2023-06-20 06:45:56,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=652404.0, ans=0.0 2023-06-20 06:45:57,643 INFO [train.py:996] (0/4) Epoch 4, batch 17250, loss[loss=0.2756, simple_loss=0.3463, pruned_loss=0.1025, over 21732.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3418, pruned_loss=0.107, over 4273086.67 frames. ], batch size: 298, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:45:59,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.975e+02 3.249e+02 4.201e+02 6.802e+02, threshold=6.498e+02, percent-clipped=2.0 2023-06-20 06:46:06,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=652404.0, ans=0.1 2023-06-20 06:46:08,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-20 06:47:12,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=652584.0, ans=0.125 2023-06-20 06:47:28,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=652644.0, ans=0.125 2023-06-20 06:47:36,656 INFO [train.py:996] (0/4) Epoch 4, batch 17300, loss[loss=0.3064, simple_loss=0.3656, pruned_loss=0.1236, over 21667.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3505, pruned_loss=0.1109, over 4274256.76 frames. ], batch size: 351, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:47:45,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=652704.0, ans=0.125 2023-06-20 06:47:51,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=652764.0, ans=0.0 2023-06-20 06:48:12,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=652824.0, ans=0.035 2023-06-20 06:48:19,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=652824.0, ans=0.125 2023-06-20 06:48:27,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=652824.0, ans=0.09899494936611666 2023-06-20 06:49:13,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-20 06:49:17,064 INFO [train.py:996] (0/4) Epoch 4, batch 17350, loss[loss=0.2394, simple_loss=0.3262, pruned_loss=0.07628, over 21750.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3501, pruned_loss=0.1096, over 4280026.87 frames. ], batch size: 332, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:49:18,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.182e+02 3.780e+02 4.285e+02 5.975e+02, threshold=7.560e+02, percent-clipped=0.0 2023-06-20 06:49:22,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=653004.0, ans=0.1 2023-06-20 06:49:57,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=653064.0, ans=0.1 2023-06-20 06:50:10,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=653124.0, ans=0.0 2023-06-20 06:50:11,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-06-20 06:50:31,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=653184.0, ans=0.09899494936611666 2023-06-20 06:50:48,869 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-20 06:50:51,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=653244.0, ans=0.125 2023-06-20 06:50:51,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=653244.0, ans=0.125 2023-06-20 06:50:59,161 INFO [train.py:996] (0/4) Epoch 4, batch 17400, loss[loss=0.3349, simple_loss=0.408, pruned_loss=0.1309, over 21462.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3466, pruned_loss=0.1052, over 4275302.04 frames. ], batch size: 471, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:51:12,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-20 06:51:41,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-20 06:52:26,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=653544.0, ans=0.07 2023-06-20 06:52:32,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=8.0 2023-06-20 06:52:42,169 INFO [train.py:996] (0/4) Epoch 4, batch 17450, loss[loss=0.2499, simple_loss=0.3058, pruned_loss=0.09703, over 20124.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3404, pruned_loss=0.1014, over 4266488.54 frames. ], batch size: 707, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:52:43,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.974e+02 3.600e+02 4.231e+02 7.262e+02, threshold=7.200e+02, percent-clipped=0.0 2023-06-20 06:52:51,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=653604.0, ans=0.125 2023-06-20 06:53:20,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=653664.0, ans=0.125 2023-06-20 06:53:50,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2023-06-20 06:54:28,274 INFO [train.py:996] (0/4) Epoch 4, batch 17500, loss[loss=0.2245, simple_loss=0.3241, pruned_loss=0.06244, over 19856.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3349, pruned_loss=0.0976, over 4270803.88 frames. ], batch size: 703, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:54:40,697 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:54:52,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=653904.0, ans=0.0 2023-06-20 06:54:55,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-20 06:55:06,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=653964.0, ans=0.0 2023-06-20 06:55:26,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=15.0 2023-06-20 06:55:26,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=654024.0, ans=0.125 2023-06-20 06:55:33,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=654084.0, ans=0.1 2023-06-20 06:55:39,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=654084.0, ans=0.1 2023-06-20 06:56:02,454 INFO [train.py:996] (0/4) Epoch 4, batch 17550, loss[loss=0.2257, simple_loss=0.3041, pruned_loss=0.07367, over 16147.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3336, pruned_loss=0.09538, over 4261625.44 frames. ], batch size: 62, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 06:56:04,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 2.427e+02 2.904e+02 3.611e+02 6.733e+02, threshold=5.808e+02, percent-clipped=0.0 2023-06-20 06:56:11,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=654204.0, ans=0.2 2023-06-20 06:56:26,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=654204.0, ans=0.0 2023-06-20 06:57:28,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=654444.0, ans=0.0 2023-06-20 06:57:32,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=654444.0, ans=0.0 2023-06-20 06:57:43,306 INFO [train.py:996] (0/4) Epoch 4, batch 17600, loss[loss=0.2832, simple_loss=0.3486, pruned_loss=0.1089, over 21684.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3356, pruned_loss=0.09574, over 4271023.85 frames. ], batch size: 351, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 06:58:29,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=15.0 2023-06-20 06:58:46,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-20 06:58:47,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=654624.0, ans=0.0 2023-06-20 06:58:57,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=654684.0, ans=0.125 2023-06-20 06:59:32,097 INFO [train.py:996] (0/4) Epoch 4, batch 17650, loss[loss=0.2554, simple_loss=0.3288, pruned_loss=0.09103, over 21568.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3361, pruned_loss=0.09593, over 4260839.40 frames. ], batch size: 441, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 06:59:40,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.037e+02 3.780e+02 4.490e+02 8.251e+02, threshold=7.559e+02, percent-clipped=12.0 2023-06-20 06:59:42,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=654804.0, ans=0.1 2023-06-20 07:00:17,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=654864.0, ans=0.125 2023-06-20 07:01:00,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=655044.0, ans=0.125 2023-06-20 07:01:01,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=655044.0, ans=0.0 2023-06-20 07:01:20,976 INFO [train.py:996] (0/4) Epoch 4, batch 17700, loss[loss=0.2393, simple_loss=0.3108, pruned_loss=0.08391, over 21354.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3282, pruned_loss=0.09261, over 4255470.30 frames. ], batch size: 159, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 07:01:59,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=655224.0, ans=0.125 2023-06-20 07:01:59,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=655224.0, ans=0.125 2023-06-20 07:02:17,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=655284.0, ans=0.125 2023-06-20 07:02:45,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=655344.0, ans=0.125 2023-06-20 07:03:00,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=655344.0, ans=0.0 2023-06-20 07:03:06,212 INFO [train.py:996] (0/4) Epoch 4, batch 17750, loss[loss=0.2523, simple_loss=0.3334, pruned_loss=0.08559, over 19971.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3367, pruned_loss=0.09698, over 4256945.14 frames. ], batch size: 703, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 07:03:08,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=655404.0, ans=0.1 2023-06-20 07:03:09,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.874e+02 3.454e+02 4.122e+02 5.655e+02, threshold=6.909e+02, percent-clipped=0.0 2023-06-20 07:03:16,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=655404.0, ans=0.125 2023-06-20 07:03:31,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=655464.0, ans=0.125 2023-06-20 07:04:50,181 INFO [train.py:996] (0/4) Epoch 4, batch 17800, loss[loss=0.2526, simple_loss=0.3307, pruned_loss=0.08722, over 21916.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3378, pruned_loss=0.09748, over 4257887.58 frames. ], batch size: 317, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:05:12,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=655764.0, ans=0.0 2023-06-20 07:06:33,119 INFO [train.py:996] (0/4) Epoch 4, batch 17850, loss[loss=0.3088, simple_loss=0.364, pruned_loss=0.1268, over 21464.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3374, pruned_loss=0.0978, over 4261024.13 frames. ], batch size: 211, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:06:36,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.792e+02 3.277e+02 4.116e+02 6.981e+02, threshold=6.554e+02, percent-clipped=1.0 2023-06-20 07:06:47,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656004.0, ans=0.1 2023-06-20 07:07:23,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=656124.0, ans=0.125 2023-06-20 07:08:11,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=656244.0, ans=0.125 2023-06-20 07:08:17,942 INFO [train.py:996] (0/4) Epoch 4, batch 17900, loss[loss=0.3166, simple_loss=0.3939, pruned_loss=0.1197, over 21598.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3424, pruned_loss=0.09985, over 4262155.47 frames. ], batch size: 414, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:09:09,390 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:09:57,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=656544.0, ans=0.0 2023-06-20 07:10:00,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=656604.0, ans=0.0 2023-06-20 07:10:01,334 INFO [train.py:996] (0/4) Epoch 4, batch 17950, loss[loss=0.1777, simple_loss=0.2735, pruned_loss=0.041, over 21517.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3408, pruned_loss=0.09636, over 4259473.04 frames. ], batch size: 230, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:10:04,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.785e+02 3.219e+02 3.835e+02 8.514e+02, threshold=6.438e+02, percent-clipped=3.0 2023-06-20 07:10:27,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=656604.0, ans=0.125 2023-06-20 07:10:28,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=656664.0, ans=0.0 2023-06-20 07:11:00,024 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-20 07:11:01,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656724.0, ans=0.1 2023-06-20 07:11:44,091 INFO [train.py:996] (0/4) Epoch 4, batch 18000, loss[loss=0.2534, simple_loss=0.3079, pruned_loss=0.0995, over 21553.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3336, pruned_loss=0.09458, over 4266885.93 frames. ], batch size: 414, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 07:11:44,092 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 07:12:01,877 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.6694, 2.2235, 1.2828, 1.2889], device='cuda:0') 2023-06-20 07:12:05,674 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2767, simple_loss=0.3741, pruned_loss=0.08966, over 1796401.00 frames. 2023-06-20 07:12:05,675 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 07:12:27,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=656904.0, ans=0.125 2023-06-20 07:12:41,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656964.0, ans=0.1 2023-06-20 07:13:55,480 INFO [train.py:996] (0/4) Epoch 4, batch 18050, loss[loss=0.275, simple_loss=0.3348, pruned_loss=0.1075, over 21694.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.328, pruned_loss=0.09421, over 4271654.65 frames. ], batch size: 298, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 07:14:03,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 3.056e+02 3.833e+02 4.791e+02 8.139e+02, threshold=7.666e+02, percent-clipped=8.0 2023-06-20 07:14:04,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=657204.0, ans=0.125 2023-06-20 07:14:26,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=657264.0, ans=0.125 2023-06-20 07:14:45,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-06-20 07:14:55,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=657384.0, ans=0.2 2023-06-20 07:15:02,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=657384.0, ans=0.125 2023-06-20 07:15:08,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657384.0, ans=0.1 2023-06-20 07:15:08,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=657384.0, ans=0.2 2023-06-20 07:15:10,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=657444.0, ans=0.0 2023-06-20 07:15:38,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=657444.0, ans=0.125 2023-06-20 07:15:41,238 INFO [train.py:996] (0/4) Epoch 4, batch 18100, loss[loss=0.2626, simple_loss=0.3287, pruned_loss=0.09825, over 20695.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3333, pruned_loss=0.09746, over 4266835.29 frames. ], batch size: 607, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:15:53,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=657504.0, ans=0.2 2023-06-20 07:15:53,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=657504.0, ans=0.125 2023-06-20 07:16:22,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.02 vs. limit=12.0 2023-06-20 07:16:25,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=657624.0, ans=0.125 2023-06-20 07:17:18,298 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:17:26,430 INFO [train.py:996] (0/4) Epoch 4, batch 18150, loss[loss=0.2326, simple_loss=0.2936, pruned_loss=0.08584, over 21367.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3346, pruned_loss=0.09647, over 4270181.69 frames. ], batch size: 131, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:17:28,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=657804.0, ans=0.1 2023-06-20 07:17:31,382 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.784e+02 3.397e+02 4.397e+02 8.554e+02, threshold=6.794e+02, percent-clipped=1.0 2023-06-20 07:17:56,030 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:17:59,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-20 07:18:10,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-20 07:18:28,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=657984.0, ans=0.125 2023-06-20 07:18:53,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=658044.0, ans=0.125 2023-06-20 07:19:02,674 INFO [train.py:996] (0/4) Epoch 4, batch 18200, loss[loss=0.2323, simple_loss=0.2937, pruned_loss=0.08548, over 21769.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3293, pruned_loss=0.09688, over 4267573.49 frames. ], batch size: 118, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:19:16,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=658104.0, ans=0.125 2023-06-20 07:19:24,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-20 07:19:27,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=658164.0, ans=0.125 2023-06-20 07:19:44,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=658224.0, ans=0.02 2023-06-20 07:20:31,256 INFO [train.py:996] (0/4) Epoch 4, batch 18250, loss[loss=0.1836, simple_loss=0.2545, pruned_loss=0.05639, over 16950.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3212, pruned_loss=0.09356, over 4260543.92 frames. ], batch size: 64, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:20:41,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.684e+02 3.179e+02 3.917e+02 6.277e+02, threshold=6.359e+02, percent-clipped=0.0 2023-06-20 07:20:41,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=658404.0, ans=0.2 2023-06-20 07:20:55,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=658404.0, ans=0.125 2023-06-20 07:21:28,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=12.0 2023-06-20 07:21:46,352 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-20 07:22:08,079 INFO [train.py:996] (0/4) Epoch 4, batch 18300, loss[loss=0.2973, simple_loss=0.4004, pruned_loss=0.09708, over 21398.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3205, pruned_loss=0.0936, over 4272167.21 frames. ], batch size: 548, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:22:49,350 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-20 07:22:51,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.27 vs. limit=10.0 2023-06-20 07:23:24,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=658884.0, ans=0.125 2023-06-20 07:23:36,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-20 07:23:37,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=658944.0, ans=0.07 2023-06-20 07:23:38,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=658944.0, ans=0.07 2023-06-20 07:23:49,846 INFO [train.py:996] (0/4) Epoch 4, batch 18350, loss[loss=0.2442, simple_loss=0.3183, pruned_loss=0.08505, over 21219.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3271, pruned_loss=0.09359, over 4268658.29 frames. ], batch size: 176, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 07:23:50,233 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:23:50,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=659004.0, ans=0.1 2023-06-20 07:24:00,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.835e+02 3.516e+02 4.714e+02 7.993e+02, threshold=7.032e+02, percent-clipped=4.0 2023-06-20 07:24:45,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=659124.0, ans=0.125 2023-06-20 07:25:05,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-20 07:25:06,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=659184.0, ans=0.125 2023-06-20 07:25:38,249 INFO [train.py:996] (0/4) Epoch 4, batch 18400, loss[loss=0.2053, simple_loss=0.2803, pruned_loss=0.06519, over 21481.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3225, pruned_loss=0.09232, over 4273934.44 frames. ], batch size: 212, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:25:59,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=8.0 2023-06-20 07:26:19,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=659364.0, ans=0.2 2023-06-20 07:26:51,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.22 vs. limit=15.0 2023-06-20 07:26:57,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=659484.0, ans=0.0 2023-06-20 07:26:57,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=659484.0, ans=0.125 2023-06-20 07:27:27,673 INFO [train.py:996] (0/4) Epoch 4, batch 18450, loss[loss=0.2501, simple_loss=0.3174, pruned_loss=0.09133, over 21882.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.318, pruned_loss=0.08783, over 4261907.74 frames. ], batch size: 373, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:27:32,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-20 07:27:37,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.607e+02 3.111e+02 4.080e+02 7.142e+02, threshold=6.222e+02, percent-clipped=1.0 2023-06-20 07:27:46,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=659604.0, ans=0.0 2023-06-20 07:28:41,196 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-20 07:29:09,394 INFO [train.py:996] (0/4) Epoch 4, batch 18500, loss[loss=0.2438, simple_loss=0.2938, pruned_loss=0.09686, over 21188.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3134, pruned_loss=0.08668, over 4249563.63 frames. ], batch size: 548, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:29:36,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=659964.0, ans=0.125 2023-06-20 07:29:56,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=660024.0, ans=0.95 2023-06-20 07:30:51,290 INFO [train.py:996] (0/4) Epoch 4, batch 18550, loss[loss=0.2393, simple_loss=0.2968, pruned_loss=0.09091, over 21805.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3122, pruned_loss=0.0863, over 4238572.46 frames. ], batch size: 118, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:31:01,307 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.666e+02 3.227e+02 3.857e+02 6.093e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-20 07:31:16,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=660264.0, ans=0.025 2023-06-20 07:31:39,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-06-20 07:31:52,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=660384.0, ans=0.0 2023-06-20 07:32:27,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=660444.0, ans=0.125 2023-06-20 07:32:39,732 INFO [train.py:996] (0/4) Epoch 4, batch 18600, loss[loss=0.2036, simple_loss=0.27, pruned_loss=0.06856, over 21396.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3117, pruned_loss=0.08804, over 4232371.19 frames. ], batch size: 131, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:32:45,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=660504.0, ans=0.0 2023-06-20 07:32:53,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-20 07:33:27,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=660624.0, ans=0.1 2023-06-20 07:33:31,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=660684.0, ans=0.0 2023-06-20 07:33:48,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=660744.0, ans=0.1 2023-06-20 07:34:16,731 INFO [train.py:996] (0/4) Epoch 4, batch 18650, loss[loss=0.2488, simple_loss=0.3428, pruned_loss=0.07736, over 20850.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3113, pruned_loss=0.08853, over 4237946.02 frames. ], batch size: 609, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:34:26,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.903e+02 3.307e+02 4.145e+02 6.218e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-20 07:34:28,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=660804.0, ans=0.1 2023-06-20 07:35:47,939 INFO [train.py:996] (0/4) Epoch 4, batch 18700, loss[loss=0.2274, simple_loss=0.2817, pruned_loss=0.08657, over 21580.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.309, pruned_loss=0.09017, over 4244811.71 frames. ], batch size: 263, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:36:10,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-20 07:36:17,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=661104.0, ans=0.125 2023-06-20 07:36:30,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-20 07:36:57,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=661284.0, ans=0.0 2023-06-20 07:37:03,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.76 vs. limit=15.0 2023-06-20 07:37:29,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=661404.0, ans=0.125 2023-06-20 07:37:30,174 INFO [train.py:996] (0/4) Epoch 4, batch 18750, loss[loss=0.2827, simple_loss=0.3516, pruned_loss=0.1069, over 21598.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3127, pruned_loss=0.09361, over 4245579.38 frames. ], batch size: 230, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:37:45,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.659e+02 3.125e+02 3.916e+02 7.035e+02, threshold=6.249e+02, percent-clipped=1.0 2023-06-20 07:37:45,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=661404.0, ans=0.0 2023-06-20 07:38:01,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.81 vs. limit=6.0 2023-06-20 07:38:03,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=661464.0, ans=0.125 2023-06-20 07:38:34,327 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.98 vs. limit=6.0 2023-06-20 07:39:06,577 INFO [train.py:996] (0/4) Epoch 4, batch 18800, loss[loss=0.2643, simple_loss=0.3419, pruned_loss=0.09334, over 21685.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3186, pruned_loss=0.0944, over 4248607.75 frames. ], batch size: 441, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:40:03,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-20 07:40:10,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-20 07:40:20,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=661884.0, ans=0.125 2023-06-20 07:40:55,491 INFO [train.py:996] (0/4) Epoch 4, batch 18850, loss[loss=0.246, simple_loss=0.3039, pruned_loss=0.09401, over 21816.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3135, pruned_loss=0.0881, over 4240560.86 frames. ], batch size: 102, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:41:00,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.538e+02 3.009e+02 3.652e+02 6.341e+02, threshold=6.019e+02, percent-clipped=1.0 2023-06-20 07:41:27,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=662064.0, ans=0.125 2023-06-20 07:41:43,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=662124.0, ans=0.125 2023-06-20 07:41:56,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=662184.0, ans=0.125 2023-06-20 07:42:32,265 INFO [train.py:996] (0/4) Epoch 4, batch 18900, loss[loss=0.2443, simple_loss=0.2921, pruned_loss=0.09827, over 21455.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3099, pruned_loss=0.08828, over 4238711.36 frames. ], batch size: 194, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:43:03,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=662364.0, ans=0.125 2023-06-20 07:44:10,285 INFO [train.py:996] (0/4) Epoch 4, batch 18950, loss[loss=0.2927, simple_loss=0.3906, pruned_loss=0.09738, over 21769.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3127, pruned_loss=0.09119, over 4247949.53 frames. ], batch size: 415, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 07:44:25,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.815e+02 3.147e+02 3.726e+02 6.285e+02, threshold=6.294e+02, percent-clipped=0.0 2023-06-20 07:44:59,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=662724.0, ans=0.1 2023-06-20 07:45:09,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=662784.0, ans=0.125 2023-06-20 07:45:11,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=662784.0, ans=0.125 2023-06-20 07:45:46,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=662844.0, ans=0.125 2023-06-20 07:46:03,993 INFO [train.py:996] (0/4) Epoch 4, batch 19000, loss[loss=0.2928, simple_loss=0.3642, pruned_loss=0.1107, over 21604.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3225, pruned_loss=0.0934, over 4262042.22 frames. ], batch size: 389, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:46:12,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=662904.0, ans=0.0 2023-06-20 07:46:23,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=662964.0, ans=0.0 2023-06-20 07:47:44,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=663144.0, ans=0.125 2023-06-20 07:47:47,407 INFO [train.py:996] (0/4) Epoch 4, batch 19050, loss[loss=0.2634, simple_loss=0.3238, pruned_loss=0.1015, over 21881.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3286, pruned_loss=0.09826, over 4270991.57 frames. ], batch size: 298, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:47:53,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.33 vs. limit=15.0 2023-06-20 07:47:53,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.224e+02 3.773e+02 4.391e+02 1.056e+03, threshold=7.547e+02, percent-clipped=6.0 2023-06-20 07:48:02,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=663264.0, ans=0.125 2023-06-20 07:48:08,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=663264.0, ans=0.2 2023-06-20 07:48:13,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=663264.0, ans=0.1 2023-06-20 07:48:26,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=663324.0, ans=15.0 2023-06-20 07:48:26,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-20 07:48:28,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=663324.0, ans=0.125 2023-06-20 07:48:34,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=663384.0, ans=0.125 2023-06-20 07:48:37,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=663384.0, ans=0.1 2023-06-20 07:49:28,692 INFO [train.py:996] (0/4) Epoch 4, batch 19100, loss[loss=0.2627, simple_loss=0.3215, pruned_loss=0.102, over 20008.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3264, pruned_loss=0.09876, over 4278319.81 frames. ], batch size: 702, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:49:36,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=663504.0, ans=0.1 2023-06-20 07:50:21,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=663684.0, ans=0.125 2023-06-20 07:51:14,836 INFO [train.py:996] (0/4) Epoch 4, batch 19150, loss[loss=0.2773, simple_loss=0.3385, pruned_loss=0.108, over 20898.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3289, pruned_loss=0.09916, over 4270991.37 frames. ], batch size: 608, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:51:16,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=663804.0, ans=0.1 2023-06-20 07:51:21,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.031e+02 3.382e+02 4.089e+02 6.377e+02, threshold=6.763e+02, percent-clipped=0.0 2023-06-20 07:51:42,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=663864.0, ans=0.125 2023-06-20 07:51:50,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=663924.0, ans=0.0 2023-06-20 07:52:57,837 INFO [train.py:996] (0/4) Epoch 4, batch 19200, loss[loss=0.2578, simple_loss=0.3526, pruned_loss=0.08151, over 21376.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.339, pruned_loss=0.1003, over 4274170.66 frames. ], batch size: 211, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 07:53:12,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=664104.0, ans=0.0 2023-06-20 07:53:17,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=664164.0, ans=0.0 2023-06-20 07:53:21,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=664164.0, ans=0.125 2023-06-20 07:54:20,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-20 07:54:25,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-20 07:54:26,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=664344.0, ans=0.125 2023-06-20 07:54:37,025 INFO [train.py:996] (0/4) Epoch 4, batch 19250, loss[loss=0.1923, simple_loss=0.2779, pruned_loss=0.05334, over 21483.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.339, pruned_loss=0.09492, over 4272634.03 frames. ], batch size: 211, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:54:44,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.666e+02 2.530e+02 3.089e+02 4.105e+02 6.871e+02, threshold=6.178e+02, percent-clipped=1.0 2023-06-20 07:55:01,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-20 07:55:07,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=12.0 2023-06-20 07:55:08,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=664524.0, ans=0.1 2023-06-20 07:55:11,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=664524.0, ans=0.125 2023-06-20 07:56:19,737 INFO [train.py:996] (0/4) Epoch 4, batch 19300, loss[loss=0.2689, simple_loss=0.3315, pruned_loss=0.1032, over 21449.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3356, pruned_loss=0.09366, over 4277689.10 frames. ], batch size: 131, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:56:24,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.18 vs. limit=22.5 2023-06-20 07:56:42,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=664764.0, ans=0.1 2023-06-20 07:58:04,214 INFO [train.py:996] (0/4) Epoch 4, batch 19350, loss[loss=0.2241, simple_loss=0.3009, pruned_loss=0.07367, over 21591.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3279, pruned_loss=0.08843, over 4276592.93 frames. ], batch size: 230, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:58:12,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.563e+02 3.117e+02 3.921e+02 9.439e+02, threshold=6.235e+02, percent-clipped=6.0 2023-06-20 07:59:46,558 INFO [train.py:996] (0/4) Epoch 4, batch 19400, loss[loss=0.303, simple_loss=0.3546, pruned_loss=0.1257, over 21734.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3247, pruned_loss=0.08708, over 4272274.58 frames. ], batch size: 473, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 08:00:08,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=665364.0, ans=0.0 2023-06-20 08:00:47,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=665424.0, ans=0.125 2023-06-20 08:00:48,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2023-06-20 08:00:51,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=665484.0, ans=0.2 2023-06-20 08:01:02,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=665484.0, ans=0.125 2023-06-20 08:01:11,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=665544.0, ans=0.125 2023-06-20 08:01:17,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-20 08:01:23,627 INFO [train.py:996] (0/4) Epoch 4, batch 19450, loss[loss=0.2408, simple_loss=0.2909, pruned_loss=0.09538, over 21505.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3238, pruned_loss=0.09072, over 4280846.94 frames. ], batch size: 212, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 08:01:31,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.618e+02 3.250e+02 3.891e+02 5.569e+02, threshold=6.499e+02, percent-clipped=0.0 2023-06-20 08:01:48,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=665664.0, ans=0.2 2023-06-20 08:01:50,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=665664.0, ans=0.1 2023-06-20 08:02:42,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=665784.0, ans=0.125 2023-06-20 08:03:06,370 INFO [train.py:996] (0/4) Epoch 4, batch 19500, loss[loss=0.2389, simple_loss=0.3022, pruned_loss=0.08782, over 21793.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3192, pruned_loss=0.09192, over 4285791.87 frames. ], batch size: 352, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 08:03:09,234 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-20 08:04:16,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.99 vs. limit=6.0 2023-06-20 08:04:45,115 INFO [train.py:996] (0/4) Epoch 4, batch 19550, loss[loss=0.1977, simple_loss=0.28, pruned_loss=0.05767, over 21381.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3155, pruned_loss=0.09062, over 4275065.04 frames. ], batch size: 194, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 08:04:53,570 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 2.884e+02 3.306e+02 4.120e+02 1.024e+03, threshold=6.612e+02, percent-clipped=7.0 2023-06-20 08:05:53,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=666324.0, ans=0.125 2023-06-20 08:05:58,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=666384.0, ans=0.125 2023-06-20 08:06:00,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=666384.0, ans=0.125 2023-06-20 08:06:21,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=666444.0, ans=0.2 2023-06-20 08:06:24,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=666444.0, ans=0.125 2023-06-20 08:06:29,094 INFO [train.py:996] (0/4) Epoch 4, batch 19600, loss[loss=0.2621, simple_loss=0.328, pruned_loss=0.09815, over 21870.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3191, pruned_loss=0.09146, over 4273299.13 frames. ], batch size: 371, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:06:55,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=666564.0, ans=0.0 2023-06-20 08:07:26,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=666624.0, ans=0.04949747468305833 2023-06-20 08:07:47,772 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-06-20 08:08:02,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=666744.0, ans=0.07 2023-06-20 08:08:08,676 INFO [train.py:996] (0/4) Epoch 4, batch 19650, loss[loss=0.2725, simple_loss=0.3333, pruned_loss=0.1059, over 21775.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3244, pruned_loss=0.09575, over 4278466.08 frames. ], batch size: 298, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:08:14,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=666804.0, ans=0.125 2023-06-20 08:08:17,373 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.087e+02 3.531e+02 4.155e+02 7.951e+02, threshold=7.063e+02, percent-clipped=1.0 2023-06-20 08:08:29,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-20 08:09:02,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=666924.0, ans=0.0 2023-06-20 08:09:14,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=666924.0, ans=0.0 2023-06-20 08:09:33,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=666984.0, ans=0.125 2023-06-20 08:10:00,480 INFO [train.py:996] (0/4) Epoch 4, batch 19700, loss[loss=0.2338, simple_loss=0.3048, pruned_loss=0.08136, over 21494.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.328, pruned_loss=0.09707, over 4274506.76 frames. ], batch size: 195, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:10:34,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=667164.0, ans=0.125 2023-06-20 08:10:40,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=667164.0, ans=0.125 2023-06-20 08:10:47,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=667224.0, ans=0.2 2023-06-20 08:10:49,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=667224.0, ans=0.1 2023-06-20 08:11:38,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=667344.0, ans=0.1 2023-06-20 08:11:45,289 INFO [train.py:996] (0/4) Epoch 4, batch 19750, loss[loss=0.3447, simple_loss=0.4117, pruned_loss=0.1389, over 21786.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3399, pruned_loss=0.09992, over 4279293.21 frames. ], batch size: 414, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:12:04,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.193e+02 3.734e+02 5.091e+02 8.572e+02, threshold=7.467e+02, percent-clipped=4.0 2023-06-20 08:12:09,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=667404.0, ans=0.125 2023-06-20 08:12:28,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=667524.0, ans=0.125 2023-06-20 08:13:33,738 INFO [train.py:996] (0/4) Epoch 4, batch 19800, loss[loss=0.2284, simple_loss=0.3045, pruned_loss=0.07619, over 21827.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3386, pruned_loss=0.1002, over 4287011.83 frames. ], batch size: 351, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:13:59,519 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:15:08,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=667944.0, ans=0.1 2023-06-20 08:15:08,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-20 08:15:22,759 INFO [train.py:996] (0/4) Epoch 4, batch 19850, loss[loss=0.2177, simple_loss=0.2782, pruned_loss=0.07862, over 21785.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3273, pruned_loss=0.09267, over 4289848.99 frames. ], batch size: 112, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:15:30,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.674e+02 3.222e+02 4.278e+02 7.795e+02, threshold=6.444e+02, percent-clipped=1.0 2023-06-20 08:15:42,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=668064.0, ans=0.1 2023-06-20 08:15:42,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=668064.0, ans=0.125 2023-06-20 08:15:51,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=668064.0, ans=0.05 2023-06-20 08:16:02,861 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:16:37,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=668244.0, ans=0.2 2023-06-20 08:16:47,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=668244.0, ans=0.125 2023-06-20 08:16:59,880 INFO [train.py:996] (0/4) Epoch 4, batch 19900, loss[loss=0.1991, simple_loss=0.2729, pruned_loss=0.06267, over 21218.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3255, pruned_loss=0.08948, over 4292637.32 frames. ], batch size: 548, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:17:40,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=668424.0, ans=0.125 2023-06-20 08:17:41,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=668424.0, ans=0.125 2023-06-20 08:17:58,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=668484.0, ans=0.2 2023-06-20 08:18:33,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=668544.0, ans=0.1 2023-06-20 08:18:47,738 INFO [train.py:996] (0/4) Epoch 4, batch 19950, loss[loss=0.2274, simple_loss=0.2915, pruned_loss=0.08164, over 21604.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3194, pruned_loss=0.08923, over 4281099.54 frames. ], batch size: 332, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:18:56,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.748e+02 3.104e+02 3.905e+02 6.692e+02, threshold=6.208e+02, percent-clipped=1.0 2023-06-20 08:19:15,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=668664.0, ans=0.0 2023-06-20 08:19:40,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=668724.0, ans=0.125 2023-06-20 08:20:03,896 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:20:26,109 INFO [train.py:996] (0/4) Epoch 4, batch 20000, loss[loss=0.2334, simple_loss=0.3059, pruned_loss=0.08044, over 21487.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3231, pruned_loss=0.09104, over 4278315.74 frames. ], batch size: 194, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:21:06,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=669024.0, ans=0.125 2023-06-20 08:22:01,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.61 vs. limit=8.0 2023-06-20 08:22:13,392 INFO [train.py:996] (0/4) Epoch 4, batch 20050, loss[loss=0.2814, simple_loss=0.339, pruned_loss=0.1119, over 21843.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3252, pruned_loss=0.09398, over 4282919.66 frames. ], batch size: 124, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 08:22:21,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 2.833e+02 3.313e+02 4.048e+02 6.603e+02, threshold=6.626e+02, percent-clipped=1.0 2023-06-20 08:22:31,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=669264.0, ans=0.2 2023-06-20 08:23:54,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-20 08:23:57,203 INFO [train.py:996] (0/4) Epoch 4, batch 20100, loss[loss=0.2864, simple_loss=0.3399, pruned_loss=0.1165, over 21883.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3281, pruned_loss=0.0967, over 4289714.19 frames. ], batch size: 107, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 08:24:17,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=669564.0, ans=0.125 2023-06-20 08:24:58,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=669624.0, ans=0.015 2023-06-20 08:24:59,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.39 vs. limit=5.0 2023-06-20 08:25:02,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=669684.0, ans=0.2 2023-06-20 08:25:42,893 INFO [train.py:996] (0/4) Epoch 4, batch 20150, loss[loss=0.3254, simple_loss=0.3841, pruned_loss=0.1334, over 21738.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3398, pruned_loss=0.1012, over 4290688.45 frames. ], batch size: 332, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:25:53,570 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.395e+02 3.883e+02 5.021e+02 8.143e+02, threshold=7.766e+02, percent-clipped=4.0 2023-06-20 08:26:25,702 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-20 08:27:29,957 INFO [train.py:996] (0/4) Epoch 4, batch 20200, loss[loss=0.277, simple_loss=0.3705, pruned_loss=0.09175, over 21834.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3436, pruned_loss=0.1033, over 4269474.08 frames. ], batch size: 316, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:27:30,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=670104.0, ans=0.1 2023-06-20 08:27:58,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=670164.0, ans=0.125 2023-06-20 08:28:32,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=670224.0, ans=0.125 2023-06-20 08:28:55,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=670344.0, ans=0.125 2023-06-20 08:29:18,111 INFO [train.py:996] (0/4) Epoch 4, batch 20250, loss[loss=0.2597, simple_loss=0.3172, pruned_loss=0.1011, over 21421.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3437, pruned_loss=0.1013, over 4268450.99 frames. ], batch size: 176, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:29:26,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=670404.0, ans=0.1 2023-06-20 08:29:33,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 3.002e+02 3.510e+02 4.411e+02 6.052e+02, threshold=7.021e+02, percent-clipped=0.0 2023-06-20 08:29:40,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=670464.0, ans=0.0 2023-06-20 08:30:56,383 INFO [train.py:996] (0/4) Epoch 4, batch 20300, loss[loss=0.215, simple_loss=0.2919, pruned_loss=0.06909, over 21342.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3415, pruned_loss=0.09794, over 4269166.66 frames. ], batch size: 176, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:31:19,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=670764.0, ans=0.0 2023-06-20 08:31:31,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=670764.0, ans=0.125 2023-06-20 08:31:32,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=670764.0, ans=0.1 2023-06-20 08:31:53,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=670824.0, ans=0.125 2023-06-20 08:32:01,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=670884.0, ans=0.0 2023-06-20 08:32:09,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-20 08:32:35,123 INFO [train.py:996] (0/4) Epoch 4, batch 20350, loss[loss=0.3391, simple_loss=0.3822, pruned_loss=0.148, over 21545.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3412, pruned_loss=0.09871, over 4264379.25 frames. ], batch size: 507, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:32:35,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=671004.0, ans=0.07 2023-06-20 08:32:49,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.818e+02 3.132e+02 3.924e+02 7.054e+02, threshold=6.264e+02, percent-clipped=1.0 2023-06-20 08:33:06,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=671064.0, ans=0.125 2023-06-20 08:33:23,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-20 08:33:33,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=671124.0, ans=0.0 2023-06-20 08:34:21,797 INFO [train.py:996] (0/4) Epoch 4, batch 20400, loss[loss=0.204, simple_loss=0.2733, pruned_loss=0.0673, over 16410.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.345, pruned_loss=0.1025, over 4259953.01 frames. ], batch size: 61, lr: 7.78e-03, grad_scale: 32.0 2023-06-20 08:35:29,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=671484.0, ans=0.2 2023-06-20 08:35:40,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=671544.0, ans=0.2 2023-06-20 08:35:45,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=671544.0, ans=0.0 2023-06-20 08:36:04,146 INFO [train.py:996] (0/4) Epoch 4, batch 20450, loss[loss=0.2944, simple_loss=0.3547, pruned_loss=0.117, over 21374.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3473, pruned_loss=0.1056, over 4255107.43 frames. ], batch size: 548, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:36:20,593 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.947e+02 3.456e+02 4.255e+02 7.158e+02, threshold=6.912e+02, percent-clipped=2.0 2023-06-20 08:36:24,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-20 08:36:34,782 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=12.0 2023-06-20 08:36:35,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=671664.0, ans=0.0 2023-06-20 08:37:05,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=671784.0, ans=0.125 2023-06-20 08:37:12,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=671784.0, ans=0.125 2023-06-20 08:37:23,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=671844.0, ans=0.125 2023-06-20 08:37:26,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=671844.0, ans=0.0 2023-06-20 08:37:44,320 INFO [train.py:996] (0/4) Epoch 4, batch 20500, loss[loss=0.2788, simple_loss=0.3327, pruned_loss=0.1124, over 21841.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3423, pruned_loss=0.1051, over 4253036.67 frames. ], batch size: 124, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:38:13,779 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-112000.pt 2023-06-20 08:38:21,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=672024.0, ans=0.0 2023-06-20 08:38:37,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=672024.0, ans=0.0 2023-06-20 08:38:51,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=672084.0, ans=0.125 2023-06-20 08:38:54,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=672084.0, ans=0.0 2023-06-20 08:39:28,517 INFO [train.py:996] (0/4) Epoch 4, batch 20550, loss[loss=0.3442, simple_loss=0.4013, pruned_loss=0.1436, over 21473.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3334, pruned_loss=0.1023, over 4250316.80 frames. ], batch size: 509, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:39:28,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=672204.0, ans=0.1 2023-06-20 08:39:28,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=672204.0, ans=0.125 2023-06-20 08:39:40,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-20 08:39:45,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.765e+02 3.154e+02 3.666e+02 5.388e+02, threshold=6.309e+02, percent-clipped=0.0 2023-06-20 08:40:00,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=672264.0, ans=0.125 2023-06-20 08:40:18,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=672324.0, ans=0.07 2023-06-20 08:40:59,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=672444.0, ans=0.125 2023-06-20 08:41:08,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=672444.0, ans=0.125 2023-06-20 08:41:12,488 INFO [train.py:996] (0/4) Epoch 4, batch 20600, loss[loss=0.3179, simple_loss=0.3755, pruned_loss=0.1302, over 21875.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3353, pruned_loss=0.1, over 4241555.68 frames. ], batch size: 118, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:41:28,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=672504.0, ans=0.125 2023-06-20 08:42:21,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=672684.0, ans=0.1 2023-06-20 08:42:28,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-06-20 08:42:55,717 INFO [train.py:996] (0/4) Epoch 4, batch 20650, loss[loss=0.2548, simple_loss=0.3176, pruned_loss=0.096, over 21612.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3317, pruned_loss=0.1003, over 4247025.58 frames. ], batch size: 391, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:42:56,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=672804.0, ans=0.0 2023-06-20 08:43:12,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.784e+02 3.361e+02 3.771e+02 8.301e+02, threshold=6.721e+02, percent-clipped=1.0 2023-06-20 08:43:21,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=672864.0, ans=0.2 2023-06-20 08:43:45,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=672924.0, ans=0.125 2023-06-20 08:43:45,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=672924.0, ans=0.0 2023-06-20 08:43:48,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=672924.0, ans=0.025 2023-06-20 08:44:04,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=672984.0, ans=0.125 2023-06-20 08:44:34,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=673044.0, ans=0.95 2023-06-20 08:44:45,641 INFO [train.py:996] (0/4) Epoch 4, batch 20700, loss[loss=0.2682, simple_loss=0.3237, pruned_loss=0.1063, over 21421.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3239, pruned_loss=0.09618, over 4243785.91 frames. ], batch size: 473, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:44:51,070 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:45:01,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=673164.0, ans=0.5 2023-06-20 08:45:21,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-20 08:45:31,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=673224.0, ans=0.1 2023-06-20 08:46:03,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=673284.0, ans=0.125 2023-06-20 08:46:11,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=673344.0, ans=0.0 2023-06-20 08:46:31,835 INFO [train.py:996] (0/4) Epoch 4, batch 20750, loss[loss=0.2916, simple_loss=0.413, pruned_loss=0.08515, over 20782.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3269, pruned_loss=0.09542, over 4248563.91 frames. ], batch size: 607, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:46:37,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=673404.0, ans=0.125 2023-06-20 08:46:48,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.639e+02 3.184e+02 4.013e+02 6.063e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-20 08:47:14,631 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:48:01,090 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:48:09,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=673644.0, ans=0.125 2023-06-20 08:48:14,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=673704.0, ans=0.125 2023-06-20 08:48:15,627 INFO [train.py:996] (0/4) Epoch 4, batch 20800, loss[loss=0.2291, simple_loss=0.2906, pruned_loss=0.08386, over 21864.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3298, pruned_loss=0.09657, over 4254938.55 frames. ], batch size: 107, lr: 7.77e-03, grad_scale: 32.0 2023-06-20 08:48:21,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=673704.0, ans=0.5 2023-06-20 08:48:36,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=673764.0, ans=0.125 2023-06-20 08:48:36,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=673764.0, ans=0.2 2023-06-20 08:48:53,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=673824.0, ans=0.1 2023-06-20 08:49:21,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=673884.0, ans=0.125 2023-06-20 08:49:46,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.88 vs. limit=22.5 2023-06-20 08:49:57,088 INFO [train.py:996] (0/4) Epoch 4, batch 20850, loss[loss=0.2753, simple_loss=0.327, pruned_loss=0.1118, over 21313.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3218, pruned_loss=0.09386, over 4260657.48 frames. ], batch size: 143, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:50:06,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-20 08:50:15,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.828e+02 3.362e+02 3.995e+02 7.673e+02, threshold=6.724e+02, percent-clipped=2.0 2023-06-20 08:50:59,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.83 vs. limit=6.0 2023-06-20 08:51:23,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=674244.0, ans=0.1 2023-06-20 08:51:27,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-20 08:51:35,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=674244.0, ans=0.125 2023-06-20 08:51:39,458 INFO [train.py:996] (0/4) Epoch 4, batch 20900, loss[loss=0.2404, simple_loss=0.3165, pruned_loss=0.08212, over 21588.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3223, pruned_loss=0.09517, over 4271274.36 frames. ], batch size: 230, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:51:58,422 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-20 08:52:23,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=674424.0, ans=0.125 2023-06-20 08:53:06,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=674544.0, ans=0.0 2023-06-20 08:53:10,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=674544.0, ans=0.0 2023-06-20 08:53:12,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=674544.0, ans=0.2 2023-06-20 08:53:21,029 INFO [train.py:996] (0/4) Epoch 4, batch 20950, loss[loss=0.243, simple_loss=0.3062, pruned_loss=0.08992, over 21836.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3195, pruned_loss=0.09167, over 4260643.13 frames. ], batch size: 124, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:53:33,587 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.815e+02 3.248e+02 3.950e+02 6.519e+02, threshold=6.496e+02, percent-clipped=0.0 2023-06-20 08:53:34,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=674604.0, ans=0.125 2023-06-20 08:53:47,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=674664.0, ans=0.2 2023-06-20 08:54:11,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-20 08:54:13,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-20 08:54:14,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=674784.0, ans=0.125 2023-06-20 08:54:39,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=674844.0, ans=0.0 2023-06-20 08:54:56,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.36 vs. limit=15.0 2023-06-20 08:54:57,051 INFO [train.py:996] (0/4) Epoch 4, batch 21000, loss[loss=0.2472, simple_loss=0.3067, pruned_loss=0.09383, over 21685.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.319, pruned_loss=0.09227, over 4265265.56 frames. ], batch size: 263, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:54:57,066 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 08:55:14,691 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2759, simple_loss=0.3744, pruned_loss=0.08874, over 1796401.00 frames. 2023-06-20 08:55:14,692 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 08:55:21,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=674904.0, ans=0.1 2023-06-20 08:55:40,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=674964.0, ans=0.5 2023-06-20 08:55:54,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=22.5 2023-06-20 08:56:03,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=675024.0, ans=0.125 2023-06-20 08:56:05,588 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-20 08:56:08,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=675084.0, ans=0.125 2023-06-20 08:56:45,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=675144.0, ans=0.125 2023-06-20 08:56:51,325 INFO [train.py:996] (0/4) Epoch 4, batch 21050, loss[loss=0.2248, simple_loss=0.2889, pruned_loss=0.08032, over 21222.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3168, pruned_loss=0.09247, over 4255704.63 frames. ], batch size: 608, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:57:04,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.634e+02 3.115e+02 4.220e+02 7.961e+02, threshold=6.229e+02, percent-clipped=3.0 2023-06-20 08:58:02,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=675384.0, ans=0.0 2023-06-20 08:58:11,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=675384.0, ans=0.125 2023-06-20 08:58:24,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=675444.0, ans=0.125 2023-06-20 08:58:33,751 INFO [train.py:996] (0/4) Epoch 4, batch 21100, loss[loss=0.2526, simple_loss=0.3045, pruned_loss=0.1003, over 21469.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3131, pruned_loss=0.09169, over 4265869.43 frames. ], batch size: 132, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:58:52,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-20 08:59:03,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-20 08:59:12,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=675624.0, ans=0.2 2023-06-20 08:59:55,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=675684.0, ans=0.0 2023-06-20 09:00:09,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=22.5 2023-06-20 09:00:16,563 INFO [train.py:996] (0/4) Epoch 4, batch 21150, loss[loss=0.2605, simple_loss=0.305, pruned_loss=0.108, over 21312.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.311, pruned_loss=0.09286, over 4257609.11 frames. ], batch size: 473, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 09:00:20,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=675804.0, ans=10.0 2023-06-20 09:00:27,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-20 09:00:29,301 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.738e+02 3.201e+02 4.018e+02 7.456e+02, threshold=6.402e+02, percent-clipped=2.0 2023-06-20 09:00:33,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=675864.0, ans=0.0 2023-06-20 09:01:59,523 INFO [train.py:996] (0/4) Epoch 4, batch 21200, loss[loss=0.2395, simple_loss=0.3016, pruned_loss=0.08868, over 21766.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3075, pruned_loss=0.09254, over 4262328.97 frames. ], batch size: 316, lr: 7.76e-03, grad_scale: 32.0 2023-06-20 09:02:07,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-06-20 09:02:22,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=676164.0, ans=0.0 2023-06-20 09:03:29,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-20 09:03:40,023 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:03:44,353 INFO [train.py:996] (0/4) Epoch 4, batch 21250, loss[loss=0.2459, simple_loss=0.3457, pruned_loss=0.07301, over 19708.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3058, pruned_loss=0.09171, over 4263809.54 frames. ], batch size: 702, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:03:47,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=676404.0, ans=0.125 2023-06-20 09:04:02,651 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.827e+02 3.407e+02 4.227e+02 7.586e+02, threshold=6.813e+02, percent-clipped=3.0 2023-06-20 09:04:14,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=676464.0, ans=0.0 2023-06-20 09:04:32,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=676524.0, ans=0.125 2023-06-20 09:04:53,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=676584.0, ans=0.125 2023-06-20 09:05:16,271 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:05:27,312 INFO [train.py:996] (0/4) Epoch 4, batch 21300, loss[loss=0.3158, simple_loss=0.3992, pruned_loss=0.1162, over 20771.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3137, pruned_loss=0.09468, over 4264856.41 frames. ], batch size: 608, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:06:42,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-20 09:07:11,223 INFO [train.py:996] (0/4) Epoch 4, batch 21350, loss[loss=0.2988, simple_loss=0.3512, pruned_loss=0.1232, over 21353.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3177, pruned_loss=0.09546, over 4272554.89 frames. ], batch size: 159, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:07:13,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=677004.0, ans=0.05 2023-06-20 09:07:23,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-20 09:07:29,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.939e+02 3.451e+02 4.084e+02 6.160e+02, threshold=6.901e+02, percent-clipped=0.0 2023-06-20 09:08:02,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=677124.0, ans=0.0 2023-06-20 09:08:22,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=677184.0, ans=0.0 2023-06-20 09:08:25,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=677184.0, ans=0.125 2023-06-20 09:08:32,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=677184.0, ans=0.125 2023-06-20 09:08:42,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=677244.0, ans=0.125 2023-06-20 09:08:42,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=677244.0, ans=0.0 2023-06-20 09:08:45,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=677244.0, ans=0.125 2023-06-20 09:08:54,435 INFO [train.py:996] (0/4) Epoch 4, batch 21400, loss[loss=0.2302, simple_loss=0.307, pruned_loss=0.07672, over 21424.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3196, pruned_loss=0.09473, over 4277516.11 frames. ], batch size: 211, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:09:22,048 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-20 09:09:39,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=677424.0, ans=0.2 2023-06-20 09:09:55,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=677484.0, ans=0.125 2023-06-20 09:10:15,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=677544.0, ans=0.95 2023-06-20 09:10:31,827 INFO [train.py:996] (0/4) Epoch 4, batch 21450, loss[loss=0.3096, simple_loss=0.3589, pruned_loss=0.1302, over 21607.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3237, pruned_loss=0.09644, over 4286679.29 frames. ], batch size: 471, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:10:49,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.725e+02 3.459e+02 4.369e+02 7.075e+02, threshold=6.919e+02, percent-clipped=1.0 2023-06-20 09:11:33,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=677784.0, ans=0.0 2023-06-20 09:11:58,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=677844.0, ans=0.0 2023-06-20 09:12:05,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=677844.0, ans=0.125 2023-06-20 09:12:09,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-20 09:12:13,319 INFO [train.py:996] (0/4) Epoch 4, batch 21500, loss[loss=0.2693, simple_loss=0.3352, pruned_loss=0.1017, over 20846.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3231, pruned_loss=0.09776, over 4270326.95 frames. ], batch size: 607, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:12:14,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.92 vs. limit=22.5 2023-06-20 09:12:40,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=677964.0, ans=0.0 2023-06-20 09:12:54,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=677964.0, ans=0.125 2023-06-20 09:12:56,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=677964.0, ans=0.1 2023-06-20 09:13:05,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-20 09:13:15,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=678024.0, ans=0.125 2023-06-20 09:13:15,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=678024.0, ans=0.1 2023-06-20 09:13:29,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=678084.0, ans=0.125 2023-06-20 09:13:34,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-20 09:13:46,894 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:13:56,037 INFO [train.py:996] (0/4) Epoch 4, batch 21550, loss[loss=0.2664, simple_loss=0.3076, pruned_loss=0.1126, over 21193.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3156, pruned_loss=0.09478, over 4253249.42 frames. ], batch size: 176, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:14:01,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=678204.0, ans=0.0 2023-06-20 09:14:14,651 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.699e+02 3.223e+02 3.892e+02 7.035e+02, threshold=6.447e+02, percent-clipped=1.0 2023-06-20 09:14:15,852 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-20 09:14:42,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=678264.0, ans=0.125 2023-06-20 09:14:42,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.01 vs. limit=22.5 2023-06-20 09:14:48,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=678324.0, ans=0.5 2023-06-20 09:14:50,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=678324.0, ans=0.125 2023-06-20 09:15:45,561 INFO [train.py:996] (0/4) Epoch 4, batch 21600, loss[loss=0.2588, simple_loss=0.3098, pruned_loss=0.1039, over 21537.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3117, pruned_loss=0.0932, over 4256721.47 frames. ], batch size: 442, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:17:30,275 INFO [train.py:996] (0/4) Epoch 4, batch 21650, loss[loss=0.2359, simple_loss=0.3086, pruned_loss=0.08165, over 21810.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.314, pruned_loss=0.09062, over 4250181.19 frames. ], batch size: 102, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:17:32,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=678804.0, ans=0.04949747468305833 2023-06-20 09:17:53,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.809e+02 3.238e+02 3.645e+02 6.702e+02, threshold=6.475e+02, percent-clipped=1.0 2023-06-20 09:18:05,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=678864.0, ans=0.2 2023-06-20 09:19:05,637 INFO [train.py:996] (0/4) Epoch 4, batch 21700, loss[loss=0.2161, simple_loss=0.3053, pruned_loss=0.06345, over 21749.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3126, pruned_loss=0.0882, over 4262157.96 frames. ], batch size: 298, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:19:59,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=22.5 2023-06-20 09:20:14,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=679284.0, ans=0.1 2023-06-20 09:20:44,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=679344.0, ans=0.0 2023-06-20 09:20:47,262 INFO [train.py:996] (0/4) Epoch 4, batch 21750, loss[loss=0.2811, simple_loss=0.3227, pruned_loss=0.1197, over 21569.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3094, pruned_loss=0.08825, over 4265161.09 frames. ], batch size: 415, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:21:12,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.575e+02 3.289e+02 4.229e+02 7.703e+02, threshold=6.577e+02, percent-clipped=2.0 2023-06-20 09:21:25,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=679464.0, ans=0.0 2023-06-20 09:21:34,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=679524.0, ans=0.125 2023-06-20 09:22:01,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=679584.0, ans=0.1 2023-06-20 09:22:31,499 INFO [train.py:996] (0/4) Epoch 4, batch 21800, loss[loss=0.2393, simple_loss=0.2973, pruned_loss=0.0906, over 21636.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3073, pruned_loss=0.08943, over 4261606.05 frames. ], batch size: 264, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:23:20,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=679824.0, ans=0.125 2023-06-20 09:23:45,913 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-20 09:24:08,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=679944.0, ans=0.0 2023-06-20 09:24:17,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=679944.0, ans=0.2 2023-06-20 09:24:19,766 INFO [train.py:996] (0/4) Epoch 4, batch 21850, loss[loss=0.3513, simple_loss=0.3934, pruned_loss=0.1546, over 21688.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.313, pruned_loss=0.0901, over 4264418.31 frames. ], batch size: 507, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:24:39,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.758e+02 3.560e+02 4.592e+02 6.859e+02, threshold=7.120e+02, percent-clipped=3.0 2023-06-20 09:25:14,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=680124.0, ans=0.0 2023-06-20 09:25:24,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=680184.0, ans=0.2 2023-06-20 09:26:00,402 INFO [train.py:996] (0/4) Epoch 4, batch 21900, loss[loss=0.2126, simple_loss=0.2964, pruned_loss=0.06435, over 19774.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3145, pruned_loss=0.09116, over 4268547.00 frames. ], batch size: 702, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:26:41,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=680424.0, ans=0.125 2023-06-20 09:26:48,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=680424.0, ans=0.0 2023-06-20 09:26:59,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=680424.0, ans=0.2 2023-06-20 09:27:09,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=680484.0, ans=0.2 2023-06-20 09:27:39,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-20 09:27:42,113 INFO [train.py:996] (0/4) Epoch 4, batch 21950, loss[loss=0.2329, simple_loss=0.3058, pruned_loss=0.08004, over 21526.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3108, pruned_loss=0.09104, over 4268672.88 frames. ], batch size: 441, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:27:46,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-20 09:28:06,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.677e+02 3.027e+02 3.757e+02 5.142e+02, threshold=6.054e+02, percent-clipped=0.0 2023-06-20 09:28:42,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=680724.0, ans=0.1 2023-06-20 09:28:43,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=680724.0, ans=0.1 2023-06-20 09:29:24,082 INFO [train.py:996] (0/4) Epoch 4, batch 22000, loss[loss=0.2074, simple_loss=0.2722, pruned_loss=0.0713, over 21438.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3029, pruned_loss=0.08619, over 4269373.95 frames. ], batch size: 195, lr: 7.73e-03, grad_scale: 32.0 2023-06-20 09:29:25,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=680904.0, ans=0.0 2023-06-20 09:29:39,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=680904.0, ans=0.0 2023-06-20 09:30:01,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=680964.0, ans=0.125 2023-06-20 09:30:20,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=681024.0, ans=0.125 2023-06-20 09:31:01,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=681144.0, ans=0.0 2023-06-20 09:31:13,130 INFO [train.py:996] (0/4) Epoch 4, batch 22050, loss[loss=0.2864, simple_loss=0.3469, pruned_loss=0.1129, over 21243.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3061, pruned_loss=0.08746, over 4263597.52 frames. ], batch size: 143, lr: 7.73e-03, grad_scale: 32.0 2023-06-20 09:31:33,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.650e+02 3.249e+02 4.022e+02 7.710e+02, threshold=6.498e+02, percent-clipped=6.0 2023-06-20 09:31:57,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=681324.0, ans=0.125 2023-06-20 09:32:18,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=681384.0, ans=0.125 2023-06-20 09:32:55,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-06-20 09:32:57,315 INFO [train.py:996] (0/4) Epoch 4, batch 22100, loss[loss=0.2923, simple_loss=0.3423, pruned_loss=0.1211, over 21844.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3198, pruned_loss=0.09406, over 4267623.67 frames. ], batch size: 282, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:32:57,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=681504.0, ans=0.125 2023-06-20 09:33:22,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=681564.0, ans=0.125 2023-06-20 09:34:38,846 INFO [train.py:996] (0/4) Epoch 4, batch 22150, loss[loss=0.2824, simple_loss=0.3435, pruned_loss=0.1106, over 20882.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3231, pruned_loss=0.09669, over 4279203.60 frames. ], batch size: 608, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:34:57,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 3.000e+02 3.535e+02 4.245e+02 7.467e+02, threshold=7.071e+02, percent-clipped=4.0 2023-06-20 09:35:20,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=681924.0, ans=0.09899494936611666 2023-06-20 09:35:30,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=681924.0, ans=0.125 2023-06-20 09:35:37,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=681984.0, ans=0.125 2023-06-20 09:36:01,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-20 09:36:21,226 INFO [train.py:996] (0/4) Epoch 4, batch 22200, loss[loss=0.2486, simple_loss=0.3297, pruned_loss=0.08373, over 21254.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3251, pruned_loss=0.09815, over 4279028.41 frames. ], batch size: 159, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:36:23,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=682104.0, ans=0.125 2023-06-20 09:37:16,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=682224.0, ans=0.125 2023-06-20 09:37:45,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=682284.0, ans=0.1 2023-06-20 09:37:51,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=682344.0, ans=0.125 2023-06-20 09:38:08,609 INFO [train.py:996] (0/4) Epoch 4, batch 22250, loss[loss=0.3192, simple_loss=0.3803, pruned_loss=0.129, over 21330.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3329, pruned_loss=0.09966, over 4286496.43 frames. ], batch size: 143, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:38:23,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.823e+02 3.649e+02 4.510e+02 8.047e+02, threshold=7.298e+02, percent-clipped=1.0 2023-06-20 09:38:40,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=682464.0, ans=0.125 2023-06-20 09:38:43,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=682464.0, ans=0.0 2023-06-20 09:38:50,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=682524.0, ans=0.125 2023-06-20 09:38:55,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=682524.0, ans=0.0 2023-06-20 09:39:49,762 INFO [train.py:996] (0/4) Epoch 4, batch 22300, loss[loss=0.3131, simple_loss=0.3783, pruned_loss=0.124, over 20757.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3358, pruned_loss=0.1025, over 4284973.86 frames. ], batch size: 607, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:40:04,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=682764.0, ans=0.125 2023-06-20 09:40:07,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-20 09:40:10,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=682764.0, ans=0.0 2023-06-20 09:40:44,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=682884.0, ans=0.025 2023-06-20 09:41:07,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=682884.0, ans=0.125 2023-06-20 09:41:31,572 INFO [train.py:996] (0/4) Epoch 4, batch 22350, loss[loss=0.2843, simple_loss=0.3285, pruned_loss=0.1201, over 21310.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3344, pruned_loss=0.103, over 4287196.48 frames. ], batch size: 143, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:41:32,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=683004.0, ans=10.0 2023-06-20 09:41:46,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.465e+02 3.048e+02 3.517e+02 4.689e+02 8.292e+02, threshold=7.034e+02, percent-clipped=3.0 2023-06-20 09:41:47,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=683064.0, ans=0.0 2023-06-20 09:42:12,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=683124.0, ans=0.1 2023-06-20 09:43:15,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-20 09:43:16,286 INFO [train.py:996] (0/4) Epoch 4, batch 22400, loss[loss=0.2641, simple_loss=0.3415, pruned_loss=0.09338, over 21192.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3304, pruned_loss=0.09933, over 4289187.28 frames. ], batch size: 548, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:43:33,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=683304.0, ans=0.0 2023-06-20 09:43:36,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=683364.0, ans=0.5 2023-06-20 09:43:41,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=683364.0, ans=0.1 2023-06-20 09:43:50,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=683364.0, ans=0.0 2023-06-20 09:43:56,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=683424.0, ans=0.125 2023-06-20 09:44:45,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=683544.0, ans=0.0 2023-06-20 09:44:58,375 INFO [train.py:996] (0/4) Epoch 4, batch 22450, loss[loss=0.2375, simple_loss=0.2899, pruned_loss=0.09258, over 21496.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3234, pruned_loss=0.09779, over 4270346.84 frames. ], batch size: 230, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:44:58,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=683604.0, ans=0.95 2023-06-20 09:45:18,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.635e+02 3.000e+02 3.519e+02 5.856e+02, threshold=6.001e+02, percent-clipped=0.0 2023-06-20 09:45:27,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=683664.0, ans=0.125 2023-06-20 09:45:30,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=683664.0, ans=0.0 2023-06-20 09:46:04,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=683784.0, ans=0.2 2023-06-20 09:46:23,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-20 09:46:26,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=683844.0, ans=0.0 2023-06-20 09:46:48,155 INFO [train.py:996] (0/4) Epoch 4, batch 22500, loss[loss=0.2764, simple_loss=0.3453, pruned_loss=0.1037, over 21486.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3202, pruned_loss=0.09749, over 4277730.60 frames. ], batch size: 230, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:47:05,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=683964.0, ans=0.1 2023-06-20 09:47:31,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=684024.0, ans=0.0 2023-06-20 09:47:33,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=684024.0, ans=0.1 2023-06-20 09:48:12,286 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-20 09:48:30,864 INFO [train.py:996] (0/4) Epoch 4, batch 22550, loss[loss=0.2782, simple_loss=0.3426, pruned_loss=0.1069, over 21752.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3239, pruned_loss=0.09761, over 4274668.06 frames. ], batch size: 441, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:48:34,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=684204.0, ans=0.125 2023-06-20 09:48:45,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 2.905e+02 3.340e+02 4.222e+02 9.344e+02, threshold=6.680e+02, percent-clipped=7.0 2023-06-20 09:48:51,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-20 09:49:01,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=22.5 2023-06-20 09:49:04,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=684264.0, ans=0.0 2023-06-20 09:49:18,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=684324.0, ans=0.125 2023-06-20 09:49:36,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=684384.0, ans=0.2 2023-06-20 09:49:51,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=684444.0, ans=0.0 2023-06-20 09:50:06,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.40 vs. limit=12.0 2023-06-20 09:50:11,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=684444.0, ans=0.2 2023-06-20 09:50:14,906 INFO [train.py:996] (0/4) Epoch 4, batch 22600, loss[loss=0.1821, simple_loss=0.2406, pruned_loss=0.06177, over 21222.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3269, pruned_loss=0.09766, over 4279142.69 frames. ], batch size: 143, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:50:25,103 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:51:43,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=684744.0, ans=0.2 2023-06-20 09:51:57,257 INFO [train.py:996] (0/4) Epoch 4, batch 22650, loss[loss=0.2581, simple_loss=0.3104, pruned_loss=0.1029, over 21738.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3246, pruned_loss=0.09726, over 4277672.99 frames. ], batch size: 112, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:52:04,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=684804.0, ans=0.0 2023-06-20 09:52:12,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.311e+02 3.814e+02 4.832e+02 8.626e+02, threshold=7.628e+02, percent-clipped=4.0 2023-06-20 09:52:49,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=684924.0, ans=0.1 2023-06-20 09:52:59,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-20 09:53:02,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-20 09:53:24,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=685044.0, ans=0.0 2023-06-20 09:53:40,945 INFO [train.py:996] (0/4) Epoch 4, batch 22700, loss[loss=0.2328, simple_loss=0.3033, pruned_loss=0.08115, over 21819.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3181, pruned_loss=0.09627, over 4277301.76 frames. ], batch size: 102, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:53:54,558 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:54:01,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=685164.0, ans=0.0 2023-06-20 09:54:11,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=685164.0, ans=0.125 2023-06-20 09:54:41,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=685224.0, ans=0.2 2023-06-20 09:55:23,797 INFO [train.py:996] (0/4) Epoch 4, batch 22750, loss[loss=0.269, simple_loss=0.3343, pruned_loss=0.1019, over 21762.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3198, pruned_loss=0.09926, over 4277733.17 frames. ], batch size: 332, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:55:43,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.761e+02 3.069e+02 3.774e+02 7.547e+02, threshold=6.137e+02, percent-clipped=0.0 2023-06-20 09:55:51,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=685464.0, ans=0.125 2023-06-20 09:56:04,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=685464.0, ans=0.125 2023-06-20 09:56:05,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=685524.0, ans=0.0 2023-06-20 09:56:22,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=685524.0, ans=0.0 2023-06-20 09:56:53,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-20 09:57:05,741 INFO [train.py:996] (0/4) Epoch 4, batch 22800, loss[loss=0.238, simple_loss=0.3131, pruned_loss=0.08148, over 21458.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3248, pruned_loss=0.1014, over 4271121.16 frames. ], batch size: 131, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:58:00,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=685824.0, ans=0.0 2023-06-20 09:58:47,363 INFO [train.py:996] (0/4) Epoch 4, batch 22850, loss[loss=0.2814, simple_loss=0.3307, pruned_loss=0.1161, over 21581.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3212, pruned_loss=0.1009, over 4280919.47 frames. ], batch size: 263, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:58:50,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=686004.0, ans=0.125 2023-06-20 09:58:54,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=686004.0, ans=0.0 2023-06-20 09:59:07,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.174e+02 3.905e+02 4.697e+02 7.560e+02, threshold=7.810e+02, percent-clipped=8.0 2023-06-20 09:59:51,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=686184.0, ans=0.09899494936611666 2023-06-20 09:59:58,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=686184.0, ans=0.0 2023-06-20 10:00:31,666 INFO [train.py:996] (0/4) Epoch 4, batch 22900, loss[loss=0.2751, simple_loss=0.375, pruned_loss=0.08758, over 21676.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3219, pruned_loss=0.1, over 4274977.75 frames. ], batch size: 389, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 10:01:01,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=686364.0, ans=0.125 2023-06-20 10:01:03,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=686364.0, ans=0.0 2023-06-20 10:01:19,169 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=22.07 vs. limit=15.0 2023-06-20 10:01:45,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-20 10:02:22,324 INFO [train.py:996] (0/4) Epoch 4, batch 22950, loss[loss=0.2754, simple_loss=0.3917, pruned_loss=0.07952, over 21737.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3351, pruned_loss=0.09808, over 4271338.19 frames. ], batch size: 332, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 10:02:41,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.062e+02 3.428e+02 4.438e+02 8.217e+02, threshold=6.855e+02, percent-clipped=1.0 2023-06-20 10:04:09,124 INFO [train.py:996] (0/4) Epoch 4, batch 23000, loss[loss=0.2868, simple_loss=0.3668, pruned_loss=0.1034, over 19942.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3328, pruned_loss=0.09425, over 4270552.31 frames. ], batch size: 703, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:04:46,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=687024.0, ans=0.125 2023-06-20 10:04:46,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=687024.0, ans=0.0 2023-06-20 10:04:53,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=687024.0, ans=0.125 2023-06-20 10:04:58,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=687024.0, ans=0.09899494936611666 2023-06-20 10:05:00,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=687024.0, ans=0.1 2023-06-20 10:05:18,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=687084.0, ans=0.125 2023-06-20 10:05:27,227 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.743e-03 2023-06-20 10:05:30,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-20 10:05:52,794 INFO [train.py:996] (0/4) Epoch 4, batch 23050, loss[loss=0.2607, simple_loss=0.3274, pruned_loss=0.09697, over 21806.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3344, pruned_loss=0.0967, over 4273732.44 frames. ], batch size: 282, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:06:05,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-20 10:06:12,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.843e+02 3.308e+02 4.395e+02 9.677e+02, threshold=6.616e+02, percent-clipped=9.0 2023-06-20 10:07:18,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-20 10:07:35,956 INFO [train.py:996] (0/4) Epoch 4, batch 23100, loss[loss=0.2092, simple_loss=0.2696, pruned_loss=0.07446, over 21197.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3306, pruned_loss=0.0974, over 4271862.98 frames. ], batch size: 549, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:07:48,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=687504.0, ans=0.1 2023-06-20 10:08:21,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=687624.0, ans=0.125 2023-06-20 10:08:55,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=687744.0, ans=0.125 2023-06-20 10:09:17,235 INFO [train.py:996] (0/4) Epoch 4, batch 23150, loss[loss=0.2798, simple_loss=0.3351, pruned_loss=0.1123, over 21284.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3248, pruned_loss=0.09665, over 4275822.00 frames. ], batch size: 143, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:09:18,152 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=22.5 2023-06-20 10:09:24,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=687804.0, ans=0.0 2023-06-20 10:09:36,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.820e+02 3.256e+02 3.906e+02 5.764e+02, threshold=6.513e+02, percent-clipped=0.0 2023-06-20 10:09:37,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=687864.0, ans=0.04949747468305833 2023-06-20 10:09:43,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=687864.0, ans=0.125 2023-06-20 10:10:53,467 INFO [train.py:996] (0/4) Epoch 4, batch 23200, loss[loss=0.247, simple_loss=0.3085, pruned_loss=0.09281, over 21869.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3229, pruned_loss=0.09709, over 4281933.45 frames. ], batch size: 247, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:10:57,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=688104.0, ans=0.02 2023-06-20 10:11:15,698 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-20 10:11:18,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=688164.0, ans=0.07 2023-06-20 10:11:26,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=688164.0, ans=0.0 2023-06-20 10:11:40,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=688224.0, ans=0.0 2023-06-20 10:12:06,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=688284.0, ans=0.125 2023-06-20 10:12:36,046 INFO [train.py:996] (0/4) Epoch 4, batch 23250, loss[loss=0.2403, simple_loss=0.3051, pruned_loss=0.08773, over 21960.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3243, pruned_loss=0.09933, over 4291153.22 frames. ], batch size: 316, lr: 7.69e-03, grad_scale: 16.0 2023-06-20 10:12:38,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=688404.0, ans=0.125 2023-06-20 10:12:57,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.915e+02 3.512e+02 4.464e+02 9.491e+02, threshold=7.024e+02, percent-clipped=1.0 2023-06-20 10:13:06,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=688464.0, ans=0.125 2023-06-20 10:13:38,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=688584.0, ans=0.2 2023-06-20 10:13:43,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=688584.0, ans=0.0 2023-06-20 10:14:17,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=688704.0, ans=0.0 2023-06-20 10:14:18,814 INFO [train.py:996] (0/4) Epoch 4, batch 23300, loss[loss=0.3385, simple_loss=0.4319, pruned_loss=0.1225, over 21764.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3343, pruned_loss=0.1022, over 4289273.77 frames. ], batch size: 351, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:14:47,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=688764.0, ans=0.0 2023-06-20 10:14:52,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=688764.0, ans=0.0 2023-06-20 10:14:55,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=688764.0, ans=0.125 2023-06-20 10:15:27,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=688884.0, ans=0.125 2023-06-20 10:15:58,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=688944.0, ans=0.04949747468305833 2023-06-20 10:15:59,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=688944.0, ans=0.125 2023-06-20 10:16:05,792 INFO [train.py:996] (0/4) Epoch 4, batch 23350, loss[loss=0.2241, simple_loss=0.3156, pruned_loss=0.06628, over 20760.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3387, pruned_loss=0.1014, over 4275124.24 frames. ], batch size: 607, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:16:28,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.739e+02 3.338e+02 4.222e+02 6.703e+02, threshold=6.676e+02, percent-clipped=0.0 2023-06-20 10:16:50,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=689124.0, ans=0.035 2023-06-20 10:17:15,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=689184.0, ans=0.0 2023-06-20 10:17:37,295 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=22.5 2023-06-20 10:17:42,582 INFO [train.py:996] (0/4) Epoch 4, batch 23400, loss[loss=0.1826, simple_loss=0.2609, pruned_loss=0.05215, over 21405.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3311, pruned_loss=0.09663, over 4277426.87 frames. ], batch size: 211, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:18:15,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=689364.0, ans=0.2 2023-06-20 10:18:46,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=689484.0, ans=0.125 2023-06-20 10:19:30,580 INFO [train.py:996] (0/4) Epoch 4, batch 23450, loss[loss=0.2708, simple_loss=0.3335, pruned_loss=0.1041, over 21513.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3308, pruned_loss=0.09846, over 4271196.70 frames. ], batch size: 194, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:19:48,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.746e+02 3.178e+02 4.019e+02 6.793e+02, threshold=6.356e+02, percent-clipped=1.0 2023-06-20 10:20:14,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=689724.0, ans=0.125 2023-06-20 10:20:20,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=689724.0, ans=0.125 2023-06-20 10:21:08,142 INFO [train.py:996] (0/4) Epoch 4, batch 23500, loss[loss=0.2381, simple_loss=0.305, pruned_loss=0.08553, over 21886.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3317, pruned_loss=0.1005, over 4274258.14 frames. ], batch size: 332, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:21:15,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=689904.0, ans=0.125 2023-06-20 10:21:16,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=689904.0, ans=22.5 2023-06-20 10:21:37,039 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.64 vs. limit=6.0 2023-06-20 10:21:49,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-20 10:22:32,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=690144.0, ans=0.025 2023-06-20 10:22:51,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=690204.0, ans=0.125 2023-06-20 10:22:52,373 INFO [train.py:996] (0/4) Epoch 4, batch 23550, loss[loss=0.2342, simple_loss=0.2872, pruned_loss=0.09065, over 21273.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3258, pruned_loss=0.09968, over 4273339.15 frames. ], batch size: 159, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:23:03,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-20 10:23:10,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.904e+02 3.219e+02 3.862e+02 7.198e+02, threshold=6.438e+02, percent-clipped=1.0 2023-06-20 10:24:05,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=690384.0, ans=0.2 2023-06-20 10:24:30,335 INFO [train.py:996] (0/4) Epoch 4, batch 23600, loss[loss=0.2893, simple_loss=0.3503, pruned_loss=0.1142, over 21802.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3274, pruned_loss=0.1, over 4265407.90 frames. ], batch size: 282, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:25:32,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-06-20 10:25:33,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=22.5 2023-06-20 10:25:47,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=690684.0, ans=0.2 2023-06-20 10:25:48,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=690684.0, ans=0.0 2023-06-20 10:26:10,349 INFO [train.py:996] (0/4) Epoch 4, batch 23650, loss[loss=0.2456, simple_loss=0.2777, pruned_loss=0.1068, over 20068.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.327, pruned_loss=0.09772, over 4262884.45 frames. ], batch size: 704, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:26:38,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 3.037e+02 3.480e+02 4.305e+02 8.157e+02, threshold=6.960e+02, percent-clipped=7.0 2023-06-20 10:27:20,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=690984.0, ans=0.125 2023-06-20 10:27:43,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=691044.0, ans=0.125 2023-06-20 10:27:53,258 INFO [train.py:996] (0/4) Epoch 4, batch 23700, loss[loss=0.2374, simple_loss=0.3146, pruned_loss=0.08015, over 21761.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3303, pruned_loss=0.09696, over 4266628.70 frames. ], batch size: 332, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:28:13,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=691104.0, ans=0.125 2023-06-20 10:28:46,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.61 vs. limit=10.0 2023-06-20 10:29:06,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=691284.0, ans=0.2 2023-06-20 10:29:20,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=15.0 2023-06-20 10:29:48,045 INFO [train.py:996] (0/4) Epoch 4, batch 23750, loss[loss=0.2372, simple_loss=0.3327, pruned_loss=0.07088, over 21648.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.333, pruned_loss=0.09789, over 4269402.54 frames. ], batch size: 389, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:30:05,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.03 vs. limit=15.0 2023-06-20 10:30:11,539 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.867e+02 3.217e+02 4.313e+02 7.122e+02, threshold=6.434e+02, percent-clipped=1.0 2023-06-20 10:30:13,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=691464.0, ans=0.0 2023-06-20 10:30:18,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=691464.0, ans=0.0 2023-06-20 10:30:35,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.05 vs. limit=15.0 2023-06-20 10:30:35,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=691524.0, ans=0.5 2023-06-20 10:30:44,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=691524.0, ans=0.1 2023-06-20 10:30:53,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=691584.0, ans=0.0 2023-06-20 10:30:54,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=691584.0, ans=0.125 2023-06-20 10:31:04,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2023-06-20 10:31:37,966 INFO [train.py:996] (0/4) Epoch 4, batch 23800, loss[loss=0.2726, simple_loss=0.3905, pruned_loss=0.07731, over 20779.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3306, pruned_loss=0.09516, over 4268267.25 frames. ], batch size: 607, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:31:42,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=691704.0, ans=0.5 2023-06-20 10:32:03,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=691764.0, ans=0.125 2023-06-20 10:32:12,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=691764.0, ans=0.2 2023-06-20 10:32:37,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.28 vs. limit=22.5 2023-06-20 10:33:12,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=691944.0, ans=0.0 2023-06-20 10:33:23,377 INFO [train.py:996] (0/4) Epoch 4, batch 23850, loss[loss=0.2596, simple_loss=0.3303, pruned_loss=0.09443, over 21338.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3408, pruned_loss=0.09873, over 4268874.94 frames. ], batch size: 176, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:33:25,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=692004.0, ans=0.125 2023-06-20 10:33:27,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=692004.0, ans=0.125 2023-06-20 10:33:38,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=692004.0, ans=0.1 2023-06-20 10:33:47,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=692064.0, ans=0.125 2023-06-20 10:33:48,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.013e+02 3.868e+02 5.325e+02 1.077e+03, threshold=7.737e+02, percent-clipped=14.0 2023-06-20 10:34:07,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=692124.0, ans=0.0 2023-06-20 10:35:13,548 INFO [train.py:996] (0/4) Epoch 4, batch 23900, loss[loss=0.2492, simple_loss=0.3262, pruned_loss=0.08607, over 21818.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3491, pruned_loss=0.1015, over 4276262.08 frames. ], batch size: 124, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:36:08,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=692484.0, ans=0.0 2023-06-20 10:36:39,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=692544.0, ans=0.125 2023-06-20 10:36:52,322 INFO [train.py:996] (0/4) Epoch 4, batch 23950, loss[loss=0.2734, simple_loss=0.3328, pruned_loss=0.107, over 21298.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3415, pruned_loss=0.1009, over 4255124.20 frames. ], batch size: 176, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:37:07,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=692664.0, ans=0.1 2023-06-20 10:37:10,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.929e+02 3.410e+02 4.340e+02 7.845e+02, threshold=6.819e+02, percent-clipped=1.0 2023-06-20 10:37:35,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=692724.0, ans=0.125 2023-06-20 10:37:59,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=692784.0, ans=0.0 2023-06-20 10:38:10,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=692784.0, ans=0.125 2023-06-20 10:38:11,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=692784.0, ans=0.0 2023-06-20 10:38:27,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-20 10:38:36,300 INFO [train.py:996] (0/4) Epoch 4, batch 24000, loss[loss=0.2659, simple_loss=0.3374, pruned_loss=0.09723, over 20652.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3422, pruned_loss=0.1039, over 4255417.75 frames. ], batch size: 607, lr: 7.66e-03, grad_scale: 32.0 2023-06-20 10:38:36,303 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 10:38:53,763 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2722, simple_loss=0.3716, pruned_loss=0.08645, over 1796401.00 frames. 2023-06-20 10:38:53,764 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 10:38:59,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=692904.0, ans=0.125 2023-06-20 10:39:44,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=693024.0, ans=0.2 2023-06-20 10:39:44,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=693024.0, ans=0.125 2023-06-20 10:39:56,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=693024.0, ans=0.04949747468305833 2023-06-20 10:40:30,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=693144.0, ans=0.1 2023-06-20 10:40:37,778 INFO [train.py:996] (0/4) Epoch 4, batch 24050, loss[loss=0.2216, simple_loss=0.3082, pruned_loss=0.06753, over 21647.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3454, pruned_loss=0.1054, over 4253480.31 frames. ], batch size: 263, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:41:00,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=693264.0, ans=0.0 2023-06-20 10:41:03,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.885e+02 3.459e+02 4.129e+02 6.625e+02, threshold=6.917e+02, percent-clipped=0.0 2023-06-20 10:41:26,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=693324.0, ans=0.125 2023-06-20 10:41:26,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=693324.0, ans=0.0 2023-06-20 10:41:31,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=693324.0, ans=0.0 2023-06-20 10:41:31,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=693324.0, ans=0.5 2023-06-20 10:41:35,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=693324.0, ans=0.125 2023-06-20 10:41:50,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-06-20 10:41:55,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=693384.0, ans=0.1 2023-06-20 10:42:21,509 INFO [train.py:996] (0/4) Epoch 4, batch 24100, loss[loss=0.3167, simple_loss=0.3828, pruned_loss=0.1253, over 21868.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3457, pruned_loss=0.1034, over 4263607.90 frames. ], batch size: 371, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:42:23,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=693504.0, ans=0.125 2023-06-20 10:42:51,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=693564.0, ans=0.125 2023-06-20 10:43:29,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=693684.0, ans=0.125 2023-06-20 10:43:29,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=693684.0, ans=0.125 2023-06-20 10:43:50,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=693744.0, ans=0.125 2023-06-20 10:44:03,704 INFO [train.py:996] (0/4) Epoch 4, batch 24150, loss[loss=0.3118, simple_loss=0.3691, pruned_loss=0.1272, over 21854.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3452, pruned_loss=0.105, over 4269505.02 frames. ], batch size: 124, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:44:31,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=693864.0, ans=0.025 2023-06-20 10:44:38,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 3.099e+02 3.647e+02 4.955e+02 8.844e+02, threshold=7.295e+02, percent-clipped=4.0 2023-06-20 10:44:57,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=693924.0, ans=0.125 2023-06-20 10:45:13,311 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-20 10:45:34,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=694044.0, ans=0.0 2023-06-20 10:45:52,582 INFO [train.py:996] (0/4) Epoch 4, batch 24200, loss[loss=0.3161, simple_loss=0.3932, pruned_loss=0.1195, over 21665.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3463, pruned_loss=0.106, over 4275518.03 frames. ], batch size: 414, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:46:25,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.69 vs. limit=22.5 2023-06-20 10:47:02,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=694284.0, ans=0.1 2023-06-20 10:47:07,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=694284.0, ans=0.0 2023-06-20 10:47:46,471 INFO [train.py:996] (0/4) Epoch 4, batch 24250, loss[loss=0.219, simple_loss=0.2995, pruned_loss=0.06921, over 21178.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3427, pruned_loss=0.09846, over 4268842.64 frames. ], batch size: 159, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:48:11,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.800e+02 3.363e+02 4.220e+02 7.304e+02, threshold=6.726e+02, percent-clipped=1.0 2023-06-20 10:48:49,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=694584.0, ans=0.0 2023-06-20 10:48:49,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=694584.0, ans=0.125 2023-06-20 10:48:53,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=694584.0, ans=0.125 2023-06-20 10:49:22,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=694644.0, ans=0.0 2023-06-20 10:49:28,815 INFO [train.py:996] (0/4) Epoch 4, batch 24300, loss[loss=0.2414, simple_loss=0.3149, pruned_loss=0.08399, over 21672.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3327, pruned_loss=0.09144, over 4268420.05 frames. ], batch size: 441, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:51:12,298 INFO [train.py:996] (0/4) Epoch 4, batch 24350, loss[loss=0.2777, simple_loss=0.3349, pruned_loss=0.1102, over 21340.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3282, pruned_loss=0.09133, over 4268011.41 frames. ], batch size: 176, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:51:33,744 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-20 10:51:37,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.964e+02 3.769e+02 4.911e+02 1.046e+03, threshold=7.538e+02, percent-clipped=11.0 2023-06-20 10:52:08,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=695184.0, ans=0.125 2023-06-20 10:52:46,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=695244.0, ans=0.0 2023-06-20 10:52:56,617 INFO [train.py:996] (0/4) Epoch 4, batch 24400, loss[loss=0.3373, simple_loss=0.3968, pruned_loss=0.1389, over 21459.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3353, pruned_loss=0.09759, over 4274170.71 frames. ], batch size: 131, lr: 7.65e-03, grad_scale: 32.0 2023-06-20 10:53:07,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-20 10:53:25,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=695364.0, ans=0.125 2023-06-20 10:53:58,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-20 10:54:13,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=695484.0, ans=0.0 2023-06-20 10:54:37,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=695604.0, ans=0.5 2023-06-20 10:54:38,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=695604.0, ans=0.0 2023-06-20 10:54:39,032 INFO [train.py:996] (0/4) Epoch 4, batch 24450, loss[loss=0.2261, simple_loss=0.2911, pruned_loss=0.08054, over 21846.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3394, pruned_loss=0.09865, over 4270103.39 frames. ], batch size: 98, lr: 7.65e-03, grad_scale: 32.0 2023-06-20 10:54:39,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=695604.0, ans=0.125 2023-06-20 10:54:51,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=695604.0, ans=0.2 2023-06-20 10:54:59,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.828e+02 3.266e+02 3.769e+02 5.234e+02, threshold=6.531e+02, percent-clipped=0.0 2023-06-20 10:55:21,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=695724.0, ans=0.0 2023-06-20 10:55:55,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=695784.0, ans=0.125 2023-06-20 10:55:55,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=15.0 2023-06-20 10:55:58,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=695844.0, ans=0.0 2023-06-20 10:56:02,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=695844.0, ans=0.1 2023-06-20 10:56:21,865 INFO [train.py:996] (0/4) Epoch 4, batch 24500, loss[loss=0.2607, simple_loss=0.3215, pruned_loss=0.09998, over 21921.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3376, pruned_loss=0.09777, over 4274603.76 frames. ], batch size: 351, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:56:23,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=695904.0, ans=0.0 2023-06-20 10:56:46,691 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-116000.pt 2023-06-20 10:57:22,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=696084.0, ans=0.07 2023-06-20 10:57:27,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=696084.0, ans=0.0 2023-06-20 10:58:02,724 INFO [train.py:996] (0/4) Epoch 4, batch 24550, loss[loss=0.2888, simple_loss=0.3254, pruned_loss=0.1261, over 20245.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3404, pruned_loss=0.101, over 4268124.41 frames. ], batch size: 703, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:58:21,868 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.855e+02 3.314e+02 4.015e+02 6.051e+02, threshold=6.629e+02, percent-clipped=0.0 2023-06-20 10:59:14,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=696384.0, ans=0.0 2023-06-20 10:59:16,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=696384.0, ans=0.0 2023-06-20 10:59:29,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=696444.0, ans=0.125 2023-06-20 10:59:31,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=696444.0, ans=0.0 2023-06-20 10:59:39,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=696504.0, ans=0.125 2023-06-20 10:59:40,431 INFO [train.py:996] (0/4) Epoch 4, batch 24600, loss[loss=0.238, simple_loss=0.288, pruned_loss=0.094, over 21229.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3353, pruned_loss=0.1008, over 4263672.42 frames. ], batch size: 548, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:59:51,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=696504.0, ans=0.0 2023-06-20 10:59:59,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=696564.0, ans=0.0 2023-06-20 11:00:13,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=696564.0, ans=0.125 2023-06-20 11:00:14,388 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-20 11:00:15,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=696564.0, ans=0.0 2023-06-20 11:00:50,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=696684.0, ans=0.125 2023-06-20 11:01:18,240 INFO [train.py:996] (0/4) Epoch 4, batch 24650, loss[loss=0.2207, simple_loss=0.2703, pruned_loss=0.08553, over 21513.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3266, pruned_loss=0.09873, over 4258284.52 frames. ], batch size: 196, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 11:01:26,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=696804.0, ans=0.1 2023-06-20 11:01:39,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 3.151e+02 3.751e+02 4.902e+02 9.106e+02, threshold=7.501e+02, percent-clipped=6.0 2023-06-20 11:03:01,133 INFO [train.py:996] (0/4) Epoch 4, batch 24700, loss[loss=0.2763, simple_loss=0.361, pruned_loss=0.09576, over 21681.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3245, pruned_loss=0.09633, over 4252810.74 frames. ], batch size: 414, lr: 7.64e-03, grad_scale: 16.0 2023-06-20 11:03:22,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=697164.0, ans=0.125 2023-06-20 11:03:47,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-20 11:03:54,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-20 11:04:13,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=697284.0, ans=0.1 2023-06-20 11:04:34,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=697344.0, ans=0.125 2023-06-20 11:04:38,966 INFO [train.py:996] (0/4) Epoch 4, batch 24750, loss[loss=0.2129, simple_loss=0.2781, pruned_loss=0.07387, over 21363.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3171, pruned_loss=0.09312, over 4253324.57 frames. ], batch size: 131, lr: 7.64e-03, grad_scale: 16.0 2023-06-20 11:04:40,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=697404.0, ans=0.0 2023-06-20 11:04:50,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-20 11:05:00,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.625e+02 3.082e+02 3.571e+02 6.291e+02, threshold=6.165e+02, percent-clipped=0.0 2023-06-20 11:05:35,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=697524.0, ans=0.0 2023-06-20 11:05:53,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-20 11:06:22,061 INFO [train.py:996] (0/4) Epoch 4, batch 24800, loss[loss=0.2933, simple_loss=0.3287, pruned_loss=0.1289, over 21433.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3119, pruned_loss=0.0927, over 4257856.14 frames. ], batch size: 473, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:06:54,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=697764.0, ans=0.125 2023-06-20 11:07:17,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=697824.0, ans=0.2 2023-06-20 11:07:28,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=697824.0, ans=0.1 2023-06-20 11:08:05,517 INFO [train.py:996] (0/4) Epoch 4, batch 24850, loss[loss=0.3398, simple_loss=0.3956, pruned_loss=0.142, over 21618.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3143, pruned_loss=0.09546, over 4273409.93 frames. ], batch size: 508, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:08:22,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.34 vs. limit=22.5 2023-06-20 11:08:27,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.980e+02 3.585e+02 4.162e+02 8.983e+02, threshold=7.171e+02, percent-clipped=3.0 2023-06-20 11:08:58,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=698124.0, ans=0.2 2023-06-20 11:09:01,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=698124.0, ans=0.125 2023-06-20 11:09:10,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=698124.0, ans=0.0 2023-06-20 11:09:11,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=698124.0, ans=0.125 2023-06-20 11:09:31,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=698244.0, ans=15.0 2023-06-20 11:09:44,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=698244.0, ans=0.2 2023-06-20 11:09:49,557 INFO [train.py:996] (0/4) Epoch 4, batch 24900, loss[loss=0.2987, simple_loss=0.3606, pruned_loss=0.1184, over 21880.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3198, pruned_loss=0.0974, over 4275931.37 frames. ], batch size: 371, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:09:58,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=698304.0, ans=0.05 2023-06-20 11:10:32,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=698424.0, ans=0.125 2023-06-20 11:10:49,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=698424.0, ans=0.07 2023-06-20 11:11:04,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=698484.0, ans=0.0 2023-06-20 11:11:29,347 INFO [train.py:996] (0/4) Epoch 4, batch 24950, loss[loss=0.2597, simple_loss=0.3003, pruned_loss=0.1095, over 20345.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3282, pruned_loss=0.1018, over 4273411.15 frames. ], batch size: 703, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:11:48,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=698604.0, ans=0.05 2023-06-20 11:12:12,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 3.311e+02 3.992e+02 4.985e+02 7.150e+02, threshold=7.983e+02, percent-clipped=0.0 2023-06-20 11:12:35,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=698724.0, ans=0.0 2023-06-20 11:12:51,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=698784.0, ans=0.2 2023-06-20 11:12:55,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=698784.0, ans=0.2 2023-06-20 11:13:20,097 INFO [train.py:996] (0/4) Epoch 4, batch 25000, loss[loss=0.2504, simple_loss=0.3127, pruned_loss=0.09405, over 21645.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3332, pruned_loss=0.1037, over 4273863.14 frames. ], batch size: 332, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:13:33,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=698904.0, ans=0.125 2023-06-20 11:13:49,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.59 vs. limit=6.0 2023-06-20 11:15:02,955 INFO [train.py:996] (0/4) Epoch 4, batch 25050, loss[loss=0.2466, simple_loss=0.3029, pruned_loss=0.09509, over 21713.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3257, pruned_loss=0.1012, over 4275447.54 frames. ], batch size: 333, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:15:40,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.775e+02 3.166e+02 3.769e+02 6.146e+02, threshold=6.333e+02, percent-clipped=0.0 2023-06-20 11:15:53,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=699324.0, ans=0.125 2023-06-20 11:16:47,502 INFO [train.py:996] (0/4) Epoch 4, batch 25100, loss[loss=0.2443, simple_loss=0.3089, pruned_loss=0.08984, over 21263.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3185, pruned_loss=0.09923, over 4276106.53 frames. ], batch size: 176, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:17:15,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=699564.0, ans=0.125 2023-06-20 11:17:23,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=699564.0, ans=0.125 2023-06-20 11:17:43,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=699624.0, ans=0.09899494936611666 2023-06-20 11:17:45,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=699624.0, ans=0.125 2023-06-20 11:18:29,743 INFO [train.py:996] (0/4) Epoch 4, batch 25150, loss[loss=0.2507, simple_loss=0.3473, pruned_loss=0.07708, over 21674.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3228, pruned_loss=0.09734, over 4258372.01 frames. ], batch size: 414, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:18:31,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=699804.0, ans=0.1 2023-06-20 11:19:00,279 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.678e+02 3.105e+02 3.619e+02 6.270e+02, threshold=6.210e+02, percent-clipped=0.0 2023-06-20 11:19:07,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=699864.0, ans=0.0 2023-06-20 11:19:21,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=699924.0, ans=0.125 2023-06-20 11:19:38,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-20 11:19:54,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=700044.0, ans=0.125 2023-06-20 11:20:02,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=700044.0, ans=10.0 2023-06-20 11:20:06,422 INFO [train.py:996] (0/4) Epoch 4, batch 25200, loss[loss=0.2518, simple_loss=0.3414, pruned_loss=0.08104, over 21684.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3216, pruned_loss=0.09461, over 4254283.27 frames. ], batch size: 414, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:20:55,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-20 11:21:27,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=700344.0, ans=0.125 2023-06-20 11:21:43,494 INFO [train.py:996] (0/4) Epoch 4, batch 25250, loss[loss=0.2458, simple_loss=0.2908, pruned_loss=0.1004, over 21271.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3199, pruned_loss=0.09341, over 4249975.00 frames. ], batch size: 144, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:22:20,667 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.744e+02 3.112e+02 3.810e+02 6.947e+02, threshold=6.224e+02, percent-clipped=3.0 2023-06-20 11:22:49,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=700524.0, ans=0.0 2023-06-20 11:23:03,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=700584.0, ans=0.95 2023-06-20 11:23:11,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=700644.0, ans=0.0 2023-06-20 11:23:11,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=700644.0, ans=0.125 2023-06-20 11:23:26,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=700644.0, ans=0.125 2023-06-20 11:23:26,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=700644.0, ans=0.05 2023-06-20 11:23:32,843 INFO [train.py:996] (0/4) Epoch 4, batch 25300, loss[loss=0.2744, simple_loss=0.3369, pruned_loss=0.106, over 21617.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3168, pruned_loss=0.09223, over 4259924.14 frames. ], batch size: 230, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:23:49,960 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:23:58,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-20 11:24:01,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=700764.0, ans=0.05 2023-06-20 11:24:03,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=700764.0, ans=0.1 2023-06-20 11:25:02,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=700944.0, ans=0.2 2023-06-20 11:25:27,718 INFO [train.py:996] (0/4) Epoch 4, batch 25350, loss[loss=0.2345, simple_loss=0.308, pruned_loss=0.08046, over 21766.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.32, pruned_loss=0.09221, over 4253146.28 frames. ], batch size: 371, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:25:55,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.721e+02 3.100e+02 3.889e+02 7.002e+02, threshold=6.200e+02, percent-clipped=1.0 2023-06-20 11:26:35,431 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:27:05,590 INFO [train.py:996] (0/4) Epoch 4, batch 25400, loss[loss=0.2174, simple_loss=0.3139, pruned_loss=0.06047, over 19853.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3163, pruned_loss=0.09003, over 4249448.85 frames. ], batch size: 702, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:27:17,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=22.5 2023-06-20 11:27:30,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=701364.0, ans=0.035 2023-06-20 11:27:47,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-20 11:28:03,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=701484.0, ans=0.0 2023-06-20 11:28:38,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=701544.0, ans=0.05 2023-06-20 11:28:42,947 INFO [train.py:996] (0/4) Epoch 4, batch 25450, loss[loss=0.2131, simple_loss=0.3012, pruned_loss=0.06243, over 21559.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3166, pruned_loss=0.09117, over 4248969.54 frames. ], batch size: 230, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:29:17,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.926e+02 3.539e+02 4.386e+02 7.693e+02, threshold=7.077e+02, percent-clipped=6.0 2023-06-20 11:29:33,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=701724.0, ans=0.2 2023-06-20 11:29:42,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=701724.0, ans=0.125 2023-06-20 11:29:54,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-20 11:30:20,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=701844.0, ans=0.1 2023-06-20 11:30:29,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-20 11:30:33,832 INFO [train.py:996] (0/4) Epoch 4, batch 25500, loss[loss=0.3613, simple_loss=0.4197, pruned_loss=0.1515, over 21454.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3183, pruned_loss=0.08939, over 4253802.34 frames. ], batch size: 507, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:31:18,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=702024.0, ans=0.125 2023-06-20 11:31:24,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=702024.0, ans=0.125 2023-06-20 11:31:29,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=702084.0, ans=0.125 2023-06-20 11:32:07,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=702144.0, ans=0.0 2023-06-20 11:32:13,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=702144.0, ans=0.0 2023-06-20 11:32:16,678 INFO [train.py:996] (0/4) Epoch 4, batch 25550, loss[loss=0.2857, simple_loss=0.3819, pruned_loss=0.09478, over 21585.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3256, pruned_loss=0.09054, over 4252943.90 frames. ], batch size: 471, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:32:17,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=702204.0, ans=0.0 2023-06-20 11:32:27,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-20 11:32:35,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=702204.0, ans=0.2 2023-06-20 11:32:44,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=702264.0, ans=0.0 2023-06-20 11:32:45,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.927e+02 3.531e+02 4.668e+02 7.861e+02, threshold=7.061e+02, percent-clipped=1.0 2023-06-20 11:33:16,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=702384.0, ans=0.1 2023-06-20 11:33:21,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=702384.0, ans=0.5 2023-06-20 11:33:54,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.20 vs. limit=10.0 2023-06-20 11:34:05,319 INFO [train.py:996] (0/4) Epoch 4, batch 25600, loss[loss=0.2716, simple_loss=0.34, pruned_loss=0.1015, over 21399.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3302, pruned_loss=0.09141, over 4264244.19 frames. ], batch size: 211, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:34:10,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=702504.0, ans=0.1 2023-06-20 11:34:14,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=702504.0, ans=0.0 2023-06-20 11:34:48,636 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-20 11:35:24,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=702744.0, ans=0.125 2023-06-20 11:35:47,374 INFO [train.py:996] (0/4) Epoch 4, batch 25650, loss[loss=0.2699, simple_loss=0.323, pruned_loss=0.1084, over 21886.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3312, pruned_loss=0.09453, over 4258816.37 frames. ], batch size: 107, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:36:10,587 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 2.974e+02 3.726e+02 4.769e+02 9.123e+02, threshold=7.452e+02, percent-clipped=4.0 2023-06-20 11:36:27,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=702924.0, ans=0.035 2023-06-20 11:36:32,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=702924.0, ans=0.2 2023-06-20 11:36:46,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=702984.0, ans=0.1 2023-06-20 11:37:06,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=703044.0, ans=0.125 2023-06-20 11:37:31,119 INFO [train.py:996] (0/4) Epoch 4, batch 25700, loss[loss=0.2488, simple_loss=0.3157, pruned_loss=0.09098, over 21875.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3292, pruned_loss=0.09534, over 4248182.59 frames. ], batch size: 118, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:37:36,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-20 11:39:00,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=703344.0, ans=0.0 2023-06-20 11:39:12,216 INFO [train.py:996] (0/4) Epoch 4, batch 25750, loss[loss=0.2716, simple_loss=0.341, pruned_loss=0.1011, over 21615.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3339, pruned_loss=0.09841, over 4256644.40 frames. ], batch size: 389, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:39:15,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.11 vs. limit=10.0 2023-06-20 11:39:35,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.258e+02 3.853e+02 4.721e+02 7.384e+02, threshold=7.705e+02, percent-clipped=0.0 2023-06-20 11:40:36,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.75 vs. limit=6.0 2023-06-20 11:40:39,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=703644.0, ans=0.1 2023-06-20 11:40:57,306 INFO [train.py:996] (0/4) Epoch 4, batch 25800, loss[loss=0.3666, simple_loss=0.431, pruned_loss=0.1511, over 21817.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3448, pruned_loss=0.1034, over 4259496.30 frames. ], batch size: 118, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:41:13,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=703764.0, ans=0.125 2023-06-20 11:41:55,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-20 11:42:39,611 INFO [train.py:996] (0/4) Epoch 4, batch 25850, loss[loss=0.2715, simple_loss=0.3225, pruned_loss=0.1102, over 21439.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3487, pruned_loss=0.1033, over 4268889.97 frames. ], batch size: 177, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:43:23,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.030e+02 3.694e+02 4.405e+02 6.989e+02, threshold=7.387e+02, percent-clipped=0.0 2023-06-20 11:43:47,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=704124.0, ans=0.125 2023-06-20 11:44:19,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=704244.0, ans=0.125 2023-06-20 11:44:28,587 INFO [train.py:996] (0/4) Epoch 4, batch 25900, loss[loss=0.3376, simple_loss=0.4212, pruned_loss=0.127, over 21657.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3501, pruned_loss=0.1039, over 4271275.36 frames. ], batch size: 441, lr: 7.60e-03, grad_scale: 16.0 2023-06-20 11:44:39,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=704304.0, ans=0.2 2023-06-20 11:45:07,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=704364.0, ans=0.04949747468305833 2023-06-20 11:45:14,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=704364.0, ans=0.1 2023-06-20 11:45:28,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=704424.0, ans=0.1 2023-06-20 11:45:52,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=704544.0, ans=0.1 2023-06-20 11:46:02,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=704544.0, ans=0.125 2023-06-20 11:46:17,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-20 11:46:17,925 INFO [train.py:996] (0/4) Epoch 4, batch 25950, loss[loss=0.2275, simple_loss=0.2764, pruned_loss=0.08929, over 20397.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3542, pruned_loss=0.1067, over 4268116.39 frames. ], batch size: 702, lr: 7.60e-03, grad_scale: 16.0 2023-06-20 11:46:45,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-20 11:46:51,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=704664.0, ans=10.0 2023-06-20 11:46:52,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.258e+02 4.006e+02 4.613e+02 7.769e+02, threshold=8.011e+02, percent-clipped=1.0 2023-06-20 11:47:22,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.74 vs. limit=15.0 2023-06-20 11:47:39,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=704844.0, ans=0.125 2023-06-20 11:47:52,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-20 11:48:11,751 INFO [train.py:996] (0/4) Epoch 4, batch 26000, loss[loss=0.3057, simple_loss=0.3748, pruned_loss=0.1183, over 21965.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3543, pruned_loss=0.1056, over 4271576.54 frames. ], batch size: 372, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:48:18,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=704904.0, ans=0.02 2023-06-20 11:48:39,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=704964.0, ans=0.125 2023-06-20 11:48:50,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=705024.0, ans=0.2 2023-06-20 11:48:51,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=705024.0, ans=0.125 2023-06-20 11:49:53,033 INFO [train.py:996] (0/4) Epoch 4, batch 26050, loss[loss=0.2612, simple_loss=0.3168, pruned_loss=0.1028, over 21717.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3539, pruned_loss=0.1049, over 4269131.29 frames. ], batch size: 230, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:49:53,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=705204.0, ans=0.0 2023-06-20 11:49:54,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-20 11:49:54,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.77 vs. limit=15.0 2023-06-20 11:50:17,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.846e+02 3.300e+02 3.976e+02 7.984e+02, threshold=6.600e+02, percent-clipped=0.0 2023-06-20 11:50:28,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=705324.0, ans=0.0 2023-06-20 11:51:35,561 INFO [train.py:996] (0/4) Epoch 4, batch 26100, loss[loss=0.2764, simple_loss=0.3437, pruned_loss=0.1045, over 21877.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3475, pruned_loss=0.1045, over 4281921.69 frames. ], batch size: 107, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:51:37,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=705504.0, ans=0.1 2023-06-20 11:51:47,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=705504.0, ans=0.5 2023-06-20 11:52:48,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=705684.0, ans=0.0 2023-06-20 11:53:03,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-20 11:53:19,681 INFO [train.py:996] (0/4) Epoch 4, batch 26150, loss[loss=0.2802, simple_loss=0.3415, pruned_loss=0.1095, over 21811.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3437, pruned_loss=0.1047, over 4287390.29 frames. ], batch size: 282, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:53:31,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=705804.0, ans=0.0 2023-06-20 11:53:45,114 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.006e+02 3.444e+02 4.248e+02 6.303e+02, threshold=6.888e+02, percent-clipped=0.0 2023-06-20 11:54:12,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=705924.0, ans=0.125 2023-06-20 11:54:59,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-20 11:55:05,073 INFO [train.py:996] (0/4) Epoch 4, batch 26200, loss[loss=0.2668, simple_loss=0.366, pruned_loss=0.08376, over 21652.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3442, pruned_loss=0.1024, over 4286883.85 frames. ], batch size: 389, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:55:33,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=706164.0, ans=0.1 2023-06-20 11:55:45,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=706224.0, ans=0.125 2023-06-20 11:56:46,826 INFO [train.py:996] (0/4) Epoch 4, batch 26250, loss[loss=0.3038, simple_loss=0.3672, pruned_loss=0.1202, over 21889.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3483, pruned_loss=0.1004, over 4285148.59 frames. ], batch size: 124, lr: 7.59e-03, grad_scale: 16.0 2023-06-20 11:56:47,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=706404.0, ans=0.0 2023-06-20 11:56:49,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=706404.0, ans=0.0 2023-06-20 11:57:12,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 2.833e+02 3.243e+02 4.065e+02 7.438e+02, threshold=6.486e+02, percent-clipped=1.0 2023-06-20 11:57:47,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=706584.0, ans=0.125 2023-06-20 11:58:28,950 INFO [train.py:996] (0/4) Epoch 4, batch 26300, loss[loss=0.2386, simple_loss=0.306, pruned_loss=0.08564, over 21654.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3442, pruned_loss=0.1007, over 4292096.80 frames. ], batch size: 263, lr: 7.59e-03, grad_scale: 16.0 2023-06-20 11:59:06,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.91 vs. limit=15.0 2023-06-20 11:59:54,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=706884.0, ans=0.0 2023-06-20 12:00:03,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=706944.0, ans=0.025 2023-06-20 12:00:14,449 INFO [train.py:996] (0/4) Epoch 4, batch 26350, loss[loss=0.2604, simple_loss=0.3331, pruned_loss=0.09382, over 21862.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3422, pruned_loss=0.1014, over 4291814.01 frames. ], batch size: 371, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:00:15,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=707004.0, ans=0.0 2023-06-20 12:00:50,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.098e+02 3.455e+02 4.050e+02 6.767e+02, threshold=6.909e+02, percent-clipped=5.0 2023-06-20 12:01:01,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-20 12:01:03,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=707124.0, ans=0.0 2023-06-20 12:01:22,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=707184.0, ans=0.125 2023-06-20 12:01:25,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=707184.0, ans=0.0 2023-06-20 12:01:57,039 INFO [train.py:996] (0/4) Epoch 4, batch 26400, loss[loss=0.2747, simple_loss=0.3076, pruned_loss=0.1209, over 21455.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3362, pruned_loss=0.1017, over 4292572.69 frames. ], batch size: 510, lr: 7.58e-03, grad_scale: 32.0 2023-06-20 12:02:12,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=707304.0, ans=0.125 2023-06-20 12:03:46,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=707544.0, ans=0.05 2023-06-20 12:03:49,550 INFO [train.py:996] (0/4) Epoch 4, batch 26450, loss[loss=0.2615, simple_loss=0.312, pruned_loss=0.1055, over 21267.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3336, pruned_loss=0.1001, over 4286860.34 frames. ], batch size: 159, lr: 7.58e-03, grad_scale: 32.0 2023-06-20 12:04:07,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=707604.0, ans=0.125 2023-06-20 12:04:19,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=707664.0, ans=0.04949747468305833 2023-06-20 12:04:21,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.976e+02 3.665e+02 4.778e+02 9.045e+02, threshold=7.330e+02, percent-clipped=3.0 2023-06-20 12:04:58,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=707784.0, ans=0.125 2023-06-20 12:05:35,249 INFO [train.py:996] (0/4) Epoch 4, batch 26500, loss[loss=0.2525, simple_loss=0.3206, pruned_loss=0.09218, over 21657.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3382, pruned_loss=0.09995, over 4282432.71 frames. ], batch size: 263, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:06:47,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-20 12:07:09,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=708144.0, ans=0.125 2023-06-20 12:07:12,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=708144.0, ans=0.125 2023-06-20 12:07:32,538 INFO [train.py:996] (0/4) Epoch 4, batch 26550, loss[loss=0.2461, simple_loss=0.3391, pruned_loss=0.07654, over 21705.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3352, pruned_loss=0.0964, over 4281388.78 frames. ], batch size: 391, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:07:58,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=708264.0, ans=0.07 2023-06-20 12:08:00,824 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.832e+02 3.306e+02 3.943e+02 6.835e+02, threshold=6.613e+02, percent-clipped=0.0 2023-06-20 12:08:15,972 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-20 12:08:33,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708384.0, ans=0.1 2023-06-20 12:09:10,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=708444.0, ans=0.125 2023-06-20 12:09:10,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=708444.0, ans=0.2 2023-06-20 12:09:16,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-20 12:09:16,676 INFO [train.py:996] (0/4) Epoch 4, batch 26600, loss[loss=0.2445, simple_loss=0.3229, pruned_loss=0.08307, over 21724.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.334, pruned_loss=0.09309, over 4274493.89 frames. ], batch size: 351, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:09:39,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708564.0, ans=0.1 2023-06-20 12:10:20,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=708684.0, ans=0.04949747468305833 2023-06-20 12:10:51,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=708744.0, ans=0.125 2023-06-20 12:10:54,717 INFO [train.py:996] (0/4) Epoch 4, batch 26650, loss[loss=0.1881, simple_loss=0.2733, pruned_loss=0.05142, over 21662.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3257, pruned_loss=0.09176, over 4250165.70 frames. ], batch size: 415, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:11:24,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=708864.0, ans=0.2 2023-06-20 12:11:27,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.914e+02 3.392e+02 4.054e+02 7.182e+02, threshold=6.783e+02, percent-clipped=2.0 2023-06-20 12:11:37,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=708924.0, ans=0.0 2023-06-20 12:12:32,368 INFO [train.py:996] (0/4) Epoch 4, batch 26700, loss[loss=0.3376, simple_loss=0.3671, pruned_loss=0.1541, over 21811.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3188, pruned_loss=0.08866, over 4256300.17 frames. ], batch size: 508, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:13:35,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=709284.0, ans=0.2 2023-06-20 12:13:48,005 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.94 vs. limit=15.0 2023-06-20 12:14:10,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=709404.0, ans=0.1 2023-06-20 12:14:11,950 INFO [train.py:996] (0/4) Epoch 4, batch 26750, loss[loss=0.2723, simple_loss=0.3503, pruned_loss=0.09714, over 21559.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3193, pruned_loss=0.08786, over 4264754.36 frames. ], batch size: 131, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:14:33,024 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.28 vs. limit=15.0 2023-06-20 12:14:35,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=709464.0, ans=0.0 2023-06-20 12:14:50,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.700e+02 3.315e+02 4.013e+02 5.519e+02, threshold=6.631e+02, percent-clipped=0.0 2023-06-20 12:15:56,472 INFO [train.py:996] (0/4) Epoch 4, batch 26800, loss[loss=0.2966, simple_loss=0.3667, pruned_loss=0.1133, over 21457.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3272, pruned_loss=0.09314, over 4265035.61 frames. ], batch size: 131, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:16:27,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=709764.0, ans=0.0 2023-06-20 12:16:43,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=709824.0, ans=0.0 2023-06-20 12:16:59,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=709824.0, ans=0.125 2023-06-20 12:17:10,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=709884.0, ans=0.95 2023-06-20 12:17:43,560 INFO [train.py:996] (0/4) Epoch 4, batch 26850, loss[loss=0.2406, simple_loss=0.2921, pruned_loss=0.09453, over 21254.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3288, pruned_loss=0.0963, over 4265765.97 frames. ], batch size: 159, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:17:51,220 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-20 12:18:21,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 2.953e+02 3.297e+02 3.985e+02 6.841e+02, threshold=6.593e+02, percent-clipped=1.0 2023-06-20 12:18:42,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=710124.0, ans=0.125 2023-06-20 12:18:43,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=710124.0, ans=0.0 2023-06-20 12:18:43,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=710124.0, ans=0.2 2023-06-20 12:18:45,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-20 12:19:20,907 INFO [train.py:996] (0/4) Epoch 4, batch 26900, loss[loss=0.2545, simple_loss=0.3043, pruned_loss=0.1023, over 21637.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3211, pruned_loss=0.09624, over 4265296.58 frames. ], batch size: 333, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:19:32,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=710304.0, ans=0.125 2023-06-20 12:20:10,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=710424.0, ans=0.125 2023-06-20 12:20:19,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=710424.0, ans=0.0 2023-06-20 12:20:35,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=710484.0, ans=0.2 2023-06-20 12:20:42,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=710484.0, ans=0.1 2023-06-20 12:21:01,686 INFO [train.py:996] (0/4) Epoch 4, batch 26950, loss[loss=0.3099, simple_loss=0.3945, pruned_loss=0.1127, over 21624.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3205, pruned_loss=0.09626, over 4267471.42 frames. ], batch size: 389, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:21:12,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-20 12:21:18,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=710604.0, ans=0.125 2023-06-20 12:21:20,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-20 12:21:38,726 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.998e+02 3.332e+02 4.075e+02 6.086e+02, threshold=6.663e+02, percent-clipped=0.0 2023-06-20 12:21:39,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=710664.0, ans=0.125 2023-06-20 12:22:02,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2023-06-20 12:22:17,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=710784.0, ans=0.125 2023-06-20 12:22:22,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=710844.0, ans=0.0 2023-06-20 12:22:23,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=710844.0, ans=0.125 2023-06-20 12:22:29,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=22.5 2023-06-20 12:22:31,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-20 12:22:44,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-06-20 12:22:45,202 INFO [train.py:996] (0/4) Epoch 4, batch 27000, loss[loss=0.2139, simple_loss=0.3052, pruned_loss=0.06128, over 21751.00 frames. ], tot_loss[loss=0.255, simple_loss=0.322, pruned_loss=0.09399, over 4268081.58 frames. ], batch size: 282, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:22:45,203 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 12:23:07,057 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2473, simple_loss=0.3466, pruned_loss=0.07399, over 1796401.00 frames. 2023-06-20 12:23:07,058 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 12:23:29,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=710964.0, ans=0.125 2023-06-20 12:23:38,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-20 12:23:53,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=711024.0, ans=0.2 2023-06-20 12:24:39,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=711144.0, ans=0.1 2023-06-20 12:24:50,446 INFO [train.py:996] (0/4) Epoch 4, batch 27050, loss[loss=0.1912, simple_loss=0.2934, pruned_loss=0.04448, over 21398.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.324, pruned_loss=0.09038, over 4268694.80 frames. ], batch size: 211, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:25:11,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-20 12:25:23,367 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.528e+02 2.843e+02 3.432e+02 6.081e+02, threshold=5.686e+02, percent-clipped=0.0 2023-06-20 12:25:57,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=711384.0, ans=0.125 2023-06-20 12:26:23,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=711444.0, ans=0.1 2023-06-20 12:26:28,433 INFO [train.py:996] (0/4) Epoch 4, batch 27100, loss[loss=0.242, simple_loss=0.3224, pruned_loss=0.08077, over 21456.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3257, pruned_loss=0.09209, over 4276324.49 frames. ], batch size: 131, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:26:38,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=711504.0, ans=0.125 2023-06-20 12:26:42,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=711504.0, ans=0.1 2023-06-20 12:26:49,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.76 vs. limit=22.5 2023-06-20 12:26:51,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=711564.0, ans=0.1 2023-06-20 12:26:55,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-20 12:28:14,878 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:28:19,209 INFO [train.py:996] (0/4) Epoch 4, batch 27150, loss[loss=0.2442, simple_loss=0.3215, pruned_loss=0.08344, over 21145.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3375, pruned_loss=0.09627, over 4285506.92 frames. ], batch size: 143, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:28:29,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=711804.0, ans=0.125 2023-06-20 12:28:43,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=711864.0, ans=0.2 2023-06-20 12:28:47,388 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.195e+02 3.755e+02 4.562e+02 7.359e+02, threshold=7.509e+02, percent-clipped=7.0 2023-06-20 12:28:54,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=711924.0, ans=0.1 2023-06-20 12:29:57,455 INFO [train.py:996] (0/4) Epoch 4, batch 27200, loss[loss=0.2848, simple_loss=0.3392, pruned_loss=0.1151, over 20013.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3452, pruned_loss=0.09868, over 4277167.72 frames. ], batch size: 703, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:30:18,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=712164.0, ans=0.125 2023-06-20 12:30:29,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=712164.0, ans=0.125 2023-06-20 12:30:44,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-20 12:31:42,702 INFO [train.py:996] (0/4) Epoch 4, batch 27250, loss[loss=0.2944, simple_loss=0.3514, pruned_loss=0.1187, over 21948.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3486, pruned_loss=0.1037, over 4277228.31 frames. ], batch size: 372, lr: 7.56e-03, grad_scale: 16.0 2023-06-20 12:31:45,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-20 12:32:18,770 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.455e+02 4.039e+02 4.883e+02 8.665e+02, threshold=8.078e+02, percent-clipped=1.0 2023-06-20 12:32:19,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=712464.0, ans=0.125 2023-06-20 12:32:19,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=712464.0, ans=0.125 2023-06-20 12:32:49,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=712524.0, ans=0.125 2023-06-20 12:32:56,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=712584.0, ans=0.0 2023-06-20 12:32:58,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=712584.0, ans=0.0 2023-06-20 12:33:30,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=712644.0, ans=0.0 2023-06-20 12:33:32,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=712704.0, ans=0.125 2023-06-20 12:33:33,308 INFO [train.py:996] (0/4) Epoch 4, batch 27300, loss[loss=0.312, simple_loss=0.3827, pruned_loss=0.1206, over 21793.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.351, pruned_loss=0.1052, over 4268531.27 frames. ], batch size: 124, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:33:33,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=712704.0, ans=0.2 2023-06-20 12:35:16,735 INFO [train.py:996] (0/4) Epoch 4, batch 27350, loss[loss=0.2879, simple_loss=0.3623, pruned_loss=0.1067, over 21816.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3543, pruned_loss=0.1059, over 4272946.53 frames. ], batch size: 118, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:35:20,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=713004.0, ans=0.125 2023-06-20 12:35:20,974 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-20 12:35:25,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=713004.0, ans=0.0 2023-06-20 12:36:00,964 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.787e+02 3.114e+02 3.820e+02 5.936e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-20 12:36:04,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=713124.0, ans=0.2 2023-06-20 12:36:13,712 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-20 12:36:18,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=713124.0, ans=0.2 2023-06-20 12:36:21,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=713184.0, ans=0.0 2023-06-20 12:36:56,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-20 12:36:58,525 INFO [train.py:996] (0/4) Epoch 4, batch 27400, loss[loss=0.25, simple_loss=0.3118, pruned_loss=0.09413, over 21770.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3491, pruned_loss=0.1049, over 4264717.04 frames. ], batch size: 316, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:37:08,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=713304.0, ans=0.2 2023-06-20 12:37:19,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-06-20 12:37:59,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=713424.0, ans=0.2 2023-06-20 12:38:41,439 INFO [train.py:996] (0/4) Epoch 4, batch 27450, loss[loss=0.2728, simple_loss=0.3429, pruned_loss=0.1014, over 21639.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3424, pruned_loss=0.1034, over 4261472.51 frames. ], batch size: 298, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:38:45,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=713604.0, ans=0.0 2023-06-20 12:39:04,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=713604.0, ans=0.125 2023-06-20 12:39:26,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.637e+02 2.964e+02 3.334e+02 5.036e+02, threshold=5.928e+02, percent-clipped=0.0 2023-06-20 12:40:07,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=713844.0, ans=0.2 2023-06-20 12:40:23,680 INFO [train.py:996] (0/4) Epoch 4, batch 27500, loss[loss=0.2812, simple_loss=0.3431, pruned_loss=0.1096, over 21886.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3413, pruned_loss=0.1041, over 4261932.82 frames. ], batch size: 124, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:41:00,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=713964.0, ans=0.04949747468305833 2023-06-20 12:41:54,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=714144.0, ans=0.2 2023-06-20 12:42:02,903 INFO [train.py:996] (0/4) Epoch 4, batch 27550, loss[loss=0.2359, simple_loss=0.2974, pruned_loss=0.08725, over 21773.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3371, pruned_loss=0.1008, over 4264248.52 frames. ], batch size: 371, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:42:35,132 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-20 12:42:46,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=714264.0, ans=0.125 2023-06-20 12:42:49,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.202e+02 3.879e+02 5.081e+02 9.458e+02, threshold=7.759e+02, percent-clipped=14.0 2023-06-20 12:43:00,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=714324.0, ans=0.125 2023-06-20 12:43:23,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-20 12:43:33,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=714444.0, ans=0.125 2023-06-20 12:43:39,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=714444.0, ans=0.125 2023-06-20 12:43:50,135 INFO [train.py:996] (0/4) Epoch 4, batch 27600, loss[loss=0.2448, simple_loss=0.297, pruned_loss=0.09629, over 21321.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3302, pruned_loss=0.09954, over 4260562.21 frames. ], batch size: 160, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:44:03,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=714504.0, ans=0.0 2023-06-20 12:44:05,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=714504.0, ans=0.1 2023-06-20 12:44:51,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=714684.0, ans=0.125 2023-06-20 12:45:06,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=714684.0, ans=0.125 2023-06-20 12:45:26,209 INFO [train.py:996] (0/4) Epoch 4, batch 27650, loss[loss=0.2594, simple_loss=0.3299, pruned_loss=0.09447, over 21453.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3225, pruned_loss=0.09777, over 4259129.90 frames. ], batch size: 194, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:45:45,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=714864.0, ans=0.0 2023-06-20 12:46:05,389 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-20 12:46:05,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 2.761e+02 3.105e+02 3.536e+02 5.675e+02, threshold=6.210e+02, percent-clipped=0.0 2023-06-20 12:46:25,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=714984.0, ans=0.0 2023-06-20 12:46:40,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=714984.0, ans=0.125 2023-06-20 12:46:45,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=715044.0, ans=0.1 2023-06-20 12:47:03,683 INFO [train.py:996] (0/4) Epoch 4, batch 27700, loss[loss=0.2334, simple_loss=0.3102, pruned_loss=0.07831, over 21384.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3205, pruned_loss=0.09531, over 4257583.92 frames. ], batch size: 211, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:47:13,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=715104.0, ans=0.1 2023-06-20 12:47:32,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=715104.0, ans=0.125 2023-06-20 12:48:51,182 INFO [train.py:996] (0/4) Epoch 4, batch 27750, loss[loss=0.229, simple_loss=0.3116, pruned_loss=0.07322, over 21802.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3241, pruned_loss=0.09458, over 4262606.82 frames. ], batch size: 351, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:49:33,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.026e+02 3.500e+02 4.221e+02 6.656e+02, threshold=7.000e+02, percent-clipped=2.0 2023-06-20 12:49:41,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=715524.0, ans=0.2 2023-06-20 12:50:29,352 INFO [train.py:996] (0/4) Epoch 4, batch 27800, loss[loss=0.3051, simple_loss=0.3603, pruned_loss=0.1249, over 21869.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.324, pruned_loss=0.09488, over 4275491.30 frames. ], batch size: 107, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:50:52,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=715704.0, ans=0.1 2023-06-20 12:51:41,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-20 12:51:57,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=715944.0, ans=0.125 2023-06-20 12:52:17,433 INFO [train.py:996] (0/4) Epoch 4, batch 27850, loss[loss=0.2549, simple_loss=0.3101, pruned_loss=0.09982, over 21577.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3228, pruned_loss=0.09581, over 4283444.72 frames. ], batch size: 548, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:53:00,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.961e+02 3.530e+02 4.217e+02 1.068e+03, threshold=7.060e+02, percent-clipped=1.0 2023-06-20 12:53:25,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=716184.0, ans=0.07 2023-06-20 12:54:13,194 INFO [train.py:996] (0/4) Epoch 4, batch 27900, loss[loss=0.238, simple_loss=0.3295, pruned_loss=0.07331, over 21464.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3346, pruned_loss=0.09786, over 4275748.86 frames. ], batch size: 194, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:54:26,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=716304.0, ans=0.1 2023-06-20 12:54:51,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=716424.0, ans=0.0 2023-06-20 12:55:56,727 INFO [train.py:996] (0/4) Epoch 4, batch 27950, loss[loss=0.2943, simple_loss=0.3738, pruned_loss=0.1074, over 21722.00 frames. ], tot_loss[loss=0.261, simple_loss=0.334, pruned_loss=0.09398, over 4272297.19 frames. ], batch size: 441, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 12:56:33,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.859e+02 3.547e+02 4.362e+02 7.820e+02, threshold=7.095e+02, percent-clipped=2.0 2023-06-20 12:57:00,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=716784.0, ans=0.125 2023-06-20 12:57:34,420 INFO [train.py:996] (0/4) Epoch 4, batch 28000, loss[loss=0.2818, simple_loss=0.3383, pruned_loss=0.1126, over 21858.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3305, pruned_loss=0.0916, over 4276342.23 frames. ], batch size: 414, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 12:59:17,605 INFO [train.py:996] (0/4) Epoch 4, batch 28050, loss[loss=0.316, simple_loss=0.3784, pruned_loss=0.1268, over 21595.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3285, pruned_loss=0.09311, over 4278077.93 frames. ], batch size: 508, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 12:59:40,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=717264.0, ans=0.2 2023-06-20 12:59:51,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=717264.0, ans=0.0 2023-06-20 12:59:54,023 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.872e+02 3.334e+02 4.118e+02 8.421e+02, threshold=6.667e+02, percent-clipped=2.0 2023-06-20 13:00:21,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=717384.0, ans=0.0 2023-06-20 13:00:37,871 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:00:41,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=717384.0, ans=0.0 2023-06-20 13:00:53,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=717444.0, ans=0.2 2023-06-20 13:01:00,680 INFO [train.py:996] (0/4) Epoch 4, batch 28100, loss[loss=0.2354, simple_loss=0.3133, pruned_loss=0.07873, over 20721.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.327, pruned_loss=0.09195, over 4274930.81 frames. ], batch size: 608, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 13:01:39,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-20 13:01:50,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.44 vs. limit=10.0 2023-06-20 13:01:51,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=717624.0, ans=0.125 2023-06-20 13:01:56,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=717624.0, ans=0.0 2023-06-20 13:02:12,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=717684.0, ans=0.1 2023-06-20 13:02:40,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=717804.0, ans=0.04949747468305833 2023-06-20 13:02:41,119 INFO [train.py:996] (0/4) Epoch 4, batch 28150, loss[loss=0.2299, simple_loss=0.2881, pruned_loss=0.08583, over 21606.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3208, pruned_loss=0.09253, over 4273826.55 frames. ], batch size: 332, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 13:02:56,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=717804.0, ans=0.125 2023-06-20 13:02:58,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=717804.0, ans=0.05 2023-06-20 13:03:23,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.192e+02 4.033e+02 4.957e+02 1.192e+03, threshold=8.065e+02, percent-clipped=8.0 2023-06-20 13:03:54,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=717984.0, ans=0.0 2023-06-20 13:04:00,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=717984.0, ans=0.125 2023-06-20 13:04:15,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=718044.0, ans=0.125 2023-06-20 13:04:27,757 INFO [train.py:996] (0/4) Epoch 4, batch 28200, loss[loss=0.2462, simple_loss=0.3125, pruned_loss=0.0899, over 21688.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3161, pruned_loss=0.09345, over 4278603.05 frames. ], batch size: 112, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 13:06:10,221 INFO [train.py:996] (0/4) Epoch 4, batch 28250, loss[loss=0.2193, simple_loss=0.2859, pruned_loss=0.07634, over 21690.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3215, pruned_loss=0.09732, over 4270465.88 frames. ], batch size: 282, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:06:47,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=718464.0, ans=0.2 2023-06-20 13:06:49,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=718464.0, ans=0.1 2023-06-20 13:06:53,559 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 3.131e+02 3.683e+02 4.386e+02 7.452e+02, threshold=7.367e+02, percent-clipped=0.0 2023-06-20 13:07:33,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=718644.0, ans=0.025 2023-06-20 13:07:46,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=718644.0, ans=0.1 2023-06-20 13:07:54,532 INFO [train.py:996] (0/4) Epoch 4, batch 28300, loss[loss=0.2141, simple_loss=0.2989, pruned_loss=0.06462, over 21772.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3201, pruned_loss=0.09521, over 4255159.19 frames. ], batch size: 282, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:08:24,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=718764.0, ans=0.0 2023-06-20 13:08:30,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=718764.0, ans=0.125 2023-06-20 13:09:17,016 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:09:43,138 INFO [train.py:996] (0/4) Epoch 4, batch 28350, loss[loss=0.2439, simple_loss=0.3062, pruned_loss=0.09078, over 21652.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3159, pruned_loss=0.08873, over 4260667.99 frames. ], batch size: 332, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:09:53,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=719004.0, ans=0.125 2023-06-20 13:10:26,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.689e+02 3.125e+02 3.923e+02 6.563e+02, threshold=6.250e+02, percent-clipped=0.0 2023-06-20 13:10:57,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=719184.0, ans=0.0 2023-06-20 13:11:00,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-20 13:11:30,296 INFO [train.py:996] (0/4) Epoch 4, batch 28400, loss[loss=0.2496, simple_loss=0.3127, pruned_loss=0.09325, over 16284.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3127, pruned_loss=0.0881, over 4250967.65 frames. ], batch size: 61, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:11:36,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=719304.0, ans=0.125 2023-06-20 13:12:21,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=719424.0, ans=0.0 2023-06-20 13:12:58,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=719544.0, ans=0.0 2023-06-20 13:13:07,868 INFO [train.py:996] (0/4) Epoch 4, batch 28450, loss[loss=0.2722, simple_loss=0.3375, pruned_loss=0.1035, over 21692.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3178, pruned_loss=0.09195, over 4258816.31 frames. ], batch size: 298, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:13:50,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.987e+02 3.410e+02 4.090e+02 6.526e+02, threshold=6.821e+02, percent-clipped=2.0 2023-06-20 13:13:59,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=719724.0, ans=0.0 2023-06-20 13:14:23,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.93 vs. limit=15.0 2023-06-20 13:14:25,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=719784.0, ans=0.125 2023-06-20 13:14:28,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=719844.0, ans=0.1 2023-06-20 13:14:30,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=719844.0, ans=0.07 2023-06-20 13:14:50,143 INFO [train.py:996] (0/4) Epoch 4, batch 28500, loss[loss=0.2725, simple_loss=0.3438, pruned_loss=0.1006, over 21413.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3215, pruned_loss=0.09533, over 4267665.82 frames. ], batch size: 131, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:15:17,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=719964.0, ans=0.125 2023-06-20 13:15:20,565 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-120000.pt 2023-06-20 13:15:42,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=720024.0, ans=0.0 2023-06-20 13:15:48,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=720024.0, ans=0.1 2023-06-20 13:15:52,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=720084.0, ans=0.125 2023-06-20 13:15:59,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=720084.0, ans=0.95 2023-06-20 13:16:41,766 INFO [train.py:996] (0/4) Epoch 4, batch 28550, loss[loss=0.3165, simple_loss=0.396, pruned_loss=0.1185, over 21274.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3313, pruned_loss=0.09915, over 4270575.83 frames. ], batch size: 548, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:16:59,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=720264.0, ans=0.0 2023-06-20 13:17:00,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=720264.0, ans=0.2 2023-06-20 13:17:19,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=720324.0, ans=0.0 2023-06-20 13:17:19,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-20 13:17:20,274 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.197e+02 3.692e+02 4.478e+02 6.914e+02, threshold=7.384e+02, percent-clipped=1.0 2023-06-20 13:18:25,253 INFO [train.py:996] (0/4) Epoch 4, batch 28600, loss[loss=0.2573, simple_loss=0.3148, pruned_loss=0.09992, over 20033.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3388, pruned_loss=0.1022, over 4272128.42 frames. ], batch size: 703, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:19:24,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=720624.0, ans=0.1 2023-06-20 13:19:40,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=720684.0, ans=0.0 2023-06-20 13:19:42,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=720684.0, ans=0.0 2023-06-20 13:20:04,507 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:20:07,389 INFO [train.py:996] (0/4) Epoch 4, batch 28650, loss[loss=0.2463, simple_loss=0.2966, pruned_loss=0.09798, over 21641.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3331, pruned_loss=0.1012, over 4279074.24 frames. ], batch size: 282, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:20:40,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-20 13:20:46,274 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 2.893e+02 3.240e+02 3.665e+02 6.143e+02, threshold=6.480e+02, percent-clipped=0.0 2023-06-20 13:21:01,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=720924.0, ans=0.0 2023-06-20 13:21:05,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=720924.0, ans=0.0 2023-06-20 13:21:09,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-20 13:21:51,603 INFO [train.py:996] (0/4) Epoch 4, batch 28700, loss[loss=0.2333, simple_loss=0.3001, pruned_loss=0.08326, over 21145.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3328, pruned_loss=0.1027, over 4271602.21 frames. ], batch size: 143, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:22:08,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.19 vs. limit=12.0 2023-06-20 13:22:21,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=721164.0, ans=0.0 2023-06-20 13:22:25,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=721164.0, ans=0.1 2023-06-20 13:23:14,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=721344.0, ans=0.125 2023-06-20 13:23:32,191 INFO [train.py:996] (0/4) Epoch 4, batch 28750, loss[loss=0.3326, simple_loss=0.3846, pruned_loss=0.1403, over 21819.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.331, pruned_loss=0.1026, over 4274918.10 frames. ], batch size: 112, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:24:15,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.193e+02 3.841e+02 4.687e+02 9.363e+02, threshold=7.682e+02, percent-clipped=10.0 2023-06-20 13:24:26,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=721524.0, ans=0.1 2023-06-20 13:24:44,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=721584.0, ans=0.1 2023-06-20 13:25:15,130 INFO [train.py:996] (0/4) Epoch 4, batch 28800, loss[loss=0.2602, simple_loss=0.3342, pruned_loss=0.09315, over 21758.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3351, pruned_loss=0.1021, over 4263612.82 frames. ], batch size: 332, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:26:43,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-20 13:26:50,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=721944.0, ans=0.0 2023-06-20 13:26:52,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=22.5 2023-06-20 13:26:58,354 INFO [train.py:996] (0/4) Epoch 4, batch 28850, loss[loss=0.2842, simple_loss=0.343, pruned_loss=0.1127, over 21882.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3381, pruned_loss=0.1046, over 4269821.97 frames. ], batch size: 371, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:27:36,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.137e+02 3.550e+02 4.295e+02 6.856e+02, threshold=7.100e+02, percent-clipped=0.0 2023-06-20 13:27:40,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=722124.0, ans=0.125 2023-06-20 13:27:43,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=722124.0, ans=0.125 2023-06-20 13:28:00,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=722184.0, ans=0.125 2023-06-20 13:28:07,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=722184.0, ans=0.125 2023-06-20 13:28:13,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-20 13:28:23,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=722244.0, ans=0.125 2023-06-20 13:28:42,266 INFO [train.py:996] (0/4) Epoch 4, batch 28900, loss[loss=0.2844, simple_loss=0.3519, pruned_loss=0.1085, over 21357.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3405, pruned_loss=0.1064, over 4279576.25 frames. ], batch size: 548, lr: 7.50e-03, grad_scale: 32.0 2023-06-20 13:29:21,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=722424.0, ans=0.125 2023-06-20 13:30:06,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=722544.0, ans=0.1 2023-06-20 13:30:19,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=722544.0, ans=0.125 2023-06-20 13:30:28,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=722544.0, ans=0.125 2023-06-20 13:30:30,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=22.5 2023-06-20 13:30:30,854 INFO [train.py:996] (0/4) Epoch 4, batch 28950, loss[loss=0.2244, simple_loss=0.3051, pruned_loss=0.07189, over 21695.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3422, pruned_loss=0.1052, over 4283088.97 frames. ], batch size: 247, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:30:46,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=722664.0, ans=0.125 2023-06-20 13:31:11,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.092e+02 3.646e+02 4.374e+02 7.156e+02, threshold=7.293e+02, percent-clipped=1.0 2023-06-20 13:31:13,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-20 13:31:33,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=722784.0, ans=0.1 2023-06-20 13:32:13,706 INFO [train.py:996] (0/4) Epoch 4, batch 29000, loss[loss=0.2818, simple_loss=0.3445, pruned_loss=0.1096, over 19982.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3469, pruned_loss=0.105, over 4279784.76 frames. ], batch size: 702, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:33:56,567 INFO [train.py:996] (0/4) Epoch 4, batch 29050, loss[loss=0.2738, simple_loss=0.3435, pruned_loss=0.102, over 21787.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3436, pruned_loss=0.1052, over 4283421.48 frames. ], batch size: 112, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:34:04,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=723204.0, ans=0.125 2023-06-20 13:34:40,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.777e+02 3.114e+02 3.614e+02 7.723e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-20 13:34:43,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=723324.0, ans=0.2 2023-06-20 13:34:57,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=723324.0, ans=0.125 2023-06-20 13:35:24,879 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:35:36,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=723504.0, ans=0.0 2023-06-20 13:35:37,379 INFO [train.py:996] (0/4) Epoch 4, batch 29100, loss[loss=0.2275, simple_loss=0.2788, pruned_loss=0.08813, over 21261.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3342, pruned_loss=0.1024, over 4287042.81 frames. ], batch size: 160, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:35:46,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=723504.0, ans=0.2 2023-06-20 13:35:56,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=723504.0, ans=0.0 2023-06-20 13:36:05,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-20 13:36:06,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=723564.0, ans=0.05 2023-06-20 13:36:33,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=15.0 2023-06-20 13:37:00,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=723684.0, ans=0.125 2023-06-20 13:37:13,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=723744.0, ans=0.1 2023-06-20 13:37:19,168 INFO [train.py:996] (0/4) Epoch 4, batch 29150, loss[loss=0.286, simple_loss=0.375, pruned_loss=0.09854, over 21844.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3314, pruned_loss=0.1001, over 4279744.07 frames. ], batch size: 371, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:37:53,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.39 vs. limit=15.0 2023-06-20 13:38:03,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.983e+02 3.477e+02 4.696e+02 8.127e+02, threshold=6.954e+02, percent-clipped=11.0 2023-06-20 13:38:09,646 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-20 13:38:48,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=724044.0, ans=0.0 2023-06-20 13:39:00,562 INFO [train.py:996] (0/4) Epoch 4, batch 29200, loss[loss=0.214, simple_loss=0.2725, pruned_loss=0.07772, over 20750.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3262, pruned_loss=0.0986, over 4271318.54 frames. ], batch size: 607, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:39:09,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=724104.0, ans=0.125 2023-06-20 13:39:16,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=724104.0, ans=0.0 2023-06-20 13:39:23,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=724164.0, ans=0.125 2023-06-20 13:39:54,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-20 13:40:07,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=724224.0, ans=0.125 2023-06-20 13:40:47,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=724404.0, ans=0.1 2023-06-20 13:40:48,583 INFO [train.py:996] (0/4) Epoch 4, batch 29250, loss[loss=0.2826, simple_loss=0.362, pruned_loss=0.1016, over 21858.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3237, pruned_loss=0.09547, over 4266739.35 frames. ], batch size: 373, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:40:49,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=22.5 2023-06-20 13:41:11,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=724464.0, ans=0.125 2023-06-20 13:41:13,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.11 vs. limit=10.0 2023-06-20 13:41:32,106 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.828e+02 3.224e+02 4.215e+02 8.591e+02, threshold=6.449e+02, percent-clipped=2.0 2023-06-20 13:42:27,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=724644.0, ans=22.5 2023-06-20 13:42:29,517 INFO [train.py:996] (0/4) Epoch 4, batch 29300, loss[loss=0.2178, simple_loss=0.2812, pruned_loss=0.07725, over 21408.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3259, pruned_loss=0.09491, over 4272885.74 frames. ], batch size: 131, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:42:31,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=724704.0, ans=0.125 2023-06-20 13:42:47,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=724704.0, ans=0.125 2023-06-20 13:43:32,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=724884.0, ans=0.125 2023-06-20 13:43:36,168 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:43:48,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-20 13:44:01,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=724944.0, ans=0.125 2023-06-20 13:44:07,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=724944.0, ans=0.1 2023-06-20 13:44:16,747 INFO [train.py:996] (0/4) Epoch 4, batch 29350, loss[loss=0.2613, simple_loss=0.3452, pruned_loss=0.08867, over 21856.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3232, pruned_loss=0.09495, over 4277059.21 frames. ], batch size: 373, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:44:17,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=725004.0, ans=0.125 2023-06-20 13:45:02,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 2.825e+02 3.145e+02 3.740e+02 7.269e+02, threshold=6.289e+02, percent-clipped=1.0 2023-06-20 13:45:19,719 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:45:44,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=725244.0, ans=10.0 2023-06-20 13:46:00,223 INFO [train.py:996] (0/4) Epoch 4, batch 29400, loss[loss=0.2718, simple_loss=0.349, pruned_loss=0.09728, over 21454.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.322, pruned_loss=0.09243, over 4272231.66 frames. ], batch size: 471, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:46:35,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=725364.0, ans=0.05 2023-06-20 13:46:59,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=725484.0, ans=0.1 2023-06-20 13:47:18,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-20 13:47:42,424 INFO [train.py:996] (0/4) Epoch 4, batch 29450, loss[loss=0.2843, simple_loss=0.3561, pruned_loss=0.1062, over 21329.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3192, pruned_loss=0.09066, over 4275053.33 frames. ], batch size: 549, lr: 7.49e-03, grad_scale: 16.0 2023-06-20 13:48:21,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=725664.0, ans=0.125 2023-06-20 13:48:28,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 3.113e+02 3.518e+02 4.347e+02 7.926e+02, threshold=7.036e+02, percent-clipped=6.0 2023-06-20 13:48:32,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=725724.0, ans=0.125 2023-06-20 13:49:05,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=725784.0, ans=0.0 2023-06-20 13:49:24,094 INFO [train.py:996] (0/4) Epoch 4, batch 29500, loss[loss=0.2767, simple_loss=0.3307, pruned_loss=0.1113, over 21809.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3256, pruned_loss=0.0951, over 4279773.42 frames. ], batch size: 124, lr: 7.49e-03, grad_scale: 16.0 2023-06-20 13:50:37,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=726084.0, ans=0.125 2023-06-20 13:51:04,353 INFO [train.py:996] (0/4) Epoch 4, batch 29550, loss[loss=0.2529, simple_loss=0.3156, pruned_loss=0.09508, over 21887.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3259, pruned_loss=0.09796, over 4292023.23 frames. ], batch size: 414, lr: 7.48e-03, grad_scale: 16.0 2023-06-20 13:51:44,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=726264.0, ans=0.1 2023-06-20 13:51:48,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=726324.0, ans=0.2 2023-06-20 13:51:50,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.392e+02 2.813e+02 3.221e+02 3.823e+02 6.296e+02, threshold=6.442e+02, percent-clipped=0.0 2023-06-20 13:52:16,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-20 13:52:26,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.34 vs. limit=15.0 2023-06-20 13:52:42,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=726444.0, ans=0.0 2023-06-20 13:52:51,284 INFO [train.py:996] (0/4) Epoch 4, batch 29600, loss[loss=0.3046, simple_loss=0.3837, pruned_loss=0.1127, over 21724.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3314, pruned_loss=0.09985, over 4290805.52 frames. ], batch size: 351, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:53:53,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=726684.0, ans=0.04949747468305833 2023-06-20 13:53:58,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=726684.0, ans=0.125 2023-06-20 13:54:18,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-20 13:54:33,052 INFO [train.py:996] (0/4) Epoch 4, batch 29650, loss[loss=0.3018, simple_loss=0.4082, pruned_loss=0.09773, over 19841.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3316, pruned_loss=0.0968, over 4288204.21 frames. ], batch size: 702, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:55:05,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=726864.0, ans=0.07 2023-06-20 13:55:13,477 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 3.008e+02 3.849e+02 5.213e+02 1.335e+03, threshold=7.697e+02, percent-clipped=14.0 2023-06-20 13:55:19,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=726924.0, ans=15.0 2023-06-20 13:56:14,510 INFO [train.py:996] (0/4) Epoch 4, batch 29700, loss[loss=0.2797, simple_loss=0.3664, pruned_loss=0.09645, over 19912.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3312, pruned_loss=0.09642, over 4287599.98 frames. ], batch size: 702, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:57:18,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-20 13:57:56,356 INFO [train.py:996] (0/4) Epoch 4, batch 29750, loss[loss=0.2641, simple_loss=0.3462, pruned_loss=0.09094, over 21613.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3363, pruned_loss=0.09589, over 4291604.99 frames. ], batch size: 230, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:58:07,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=727404.0, ans=0.125 2023-06-20 13:58:10,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=727404.0, ans=0.1 2023-06-20 13:58:13,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=727404.0, ans=0.0 2023-06-20 13:58:16,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=727464.0, ans=0.125 2023-06-20 13:58:36,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 2.793e+02 3.232e+02 3.849e+02 7.208e+02, threshold=6.464e+02, percent-clipped=0.0 2023-06-20 13:58:50,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=727524.0, ans=0.0 2023-06-20 13:59:08,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=727584.0, ans=0.1 2023-06-20 13:59:15,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.17 vs. limit=6.0 2023-06-20 13:59:23,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-20 13:59:36,828 INFO [train.py:996] (0/4) Epoch 4, batch 29800, loss[loss=0.2843, simple_loss=0.3392, pruned_loss=0.1147, over 21787.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3376, pruned_loss=0.09692, over 4293212.91 frames. ], batch size: 441, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 14:00:34,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=727824.0, ans=0.0 2023-06-20 14:01:10,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-20 14:01:23,952 INFO [train.py:996] (0/4) Epoch 4, batch 29850, loss[loss=0.2244, simple_loss=0.2991, pruned_loss=0.07486, over 21622.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3335, pruned_loss=0.09501, over 4288916.55 frames. ], batch size: 263, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:01:42,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=728064.0, ans=0.0 2023-06-20 14:01:46,664 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=12.0 2023-06-20 14:01:49,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=728064.0, ans=0.125 2023-06-20 14:02:05,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.765e+02 3.288e+02 3.804e+02 6.956e+02, threshold=6.577e+02, percent-clipped=2.0 2023-06-20 14:02:20,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=728124.0, ans=0.2 2023-06-20 14:02:31,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-20 14:03:06,620 INFO [train.py:996] (0/4) Epoch 4, batch 29900, loss[loss=0.3278, simple_loss=0.3771, pruned_loss=0.1392, over 21582.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.331, pruned_loss=0.09654, over 4292200.65 frames. ], batch size: 389, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:04:12,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=728484.0, ans=0.125 2023-06-20 14:04:49,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=728604.0, ans=0.125 2023-06-20 14:04:50,341 INFO [train.py:996] (0/4) Epoch 4, batch 29950, loss[loss=0.2276, simple_loss=0.276, pruned_loss=0.08956, over 20164.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3355, pruned_loss=0.1007, over 4289582.31 frames. ], batch size: 704, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:05:35,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=728724.0, ans=0.125 2023-06-20 14:05:36,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.386e+02 3.024e+02 3.410e+02 4.010e+02 6.604e+02, threshold=6.821e+02, percent-clipped=1.0 2023-06-20 14:06:33,133 INFO [train.py:996] (0/4) Epoch 4, batch 30000, loss[loss=0.2185, simple_loss=0.3146, pruned_loss=0.06123, over 21665.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3382, pruned_loss=0.1015, over 4289747.79 frames. ], batch size: 263, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:06:33,134 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 14:06:48,293 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.5487, 4.5874, 4.2499, 4.3096], device='cuda:0') 2023-06-20 14:06:48,894 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.8154, 4.0003, 2.1282, 2.1952], device='cuda:0') 2023-06-20 14:06:55,182 INFO [train.py:1028] (0/4) Epoch 4, validation: loss=0.2513, simple_loss=0.3514, pruned_loss=0.07557, over 1796401.00 frames. 2023-06-20 14:06:55,183 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 14:07:55,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=729024.0, ans=0.1 2023-06-20 14:08:03,161 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-20 14:08:04,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=729084.0, ans=0.1 2023-06-20 14:08:48,656 INFO [train.py:996] (0/4) Epoch 4, batch 30050, loss[loss=0.2746, simple_loss=0.3705, pruned_loss=0.08939, over 21700.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3402, pruned_loss=0.09775, over 4283634.55 frames. ], batch size: 298, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:09:20,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-20 14:09:34,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.668e+02 3.143e+02 3.876e+02 8.051e+02, threshold=6.286e+02, percent-clipped=2.0 2023-06-20 14:09:43,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=729324.0, ans=0.125 2023-06-20 14:10:30,492 INFO [train.py:996] (0/4) Epoch 4, batch 30100, loss[loss=0.259, simple_loss=0.3041, pruned_loss=0.1069, over 21769.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.339, pruned_loss=0.09777, over 4281118.54 frames. ], batch size: 112, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:10:38,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-20 14:11:00,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=729564.0, ans=0.0 2023-06-20 14:11:10,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=729564.0, ans=0.0 2023-06-20 14:11:25,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=729624.0, ans=0.125 2023-06-20 14:11:51,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=729684.0, ans=0.1 2023-06-20 14:12:04,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=729744.0, ans=0.1 2023-06-20 14:12:04,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=729744.0, ans=0.125 2023-06-20 14:12:08,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=729744.0, ans=0.2 2023-06-20 14:12:13,046 INFO [train.py:996] (0/4) Epoch 4, batch 30150, loss[loss=0.3435, simple_loss=0.3893, pruned_loss=0.1488, over 21855.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3353, pruned_loss=0.09895, over 4275731.68 frames. ], batch size: 441, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:13:07,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.395e+02 3.286e+02 3.695e+02 4.286e+02 6.850e+02, threshold=7.389e+02, percent-clipped=1.0 2023-06-20 14:13:40,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=729984.0, ans=0.125 2023-06-20 14:13:59,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=730044.0, ans=0.125 2023-06-20 14:14:05,646 INFO [train.py:996] (0/4) Epoch 4, batch 30200, loss[loss=0.3222, simple_loss=0.3874, pruned_loss=0.1285, over 19958.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3393, pruned_loss=0.09852, over 4266229.46 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:14:12,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=730104.0, ans=0.125 2023-06-20 14:14:53,318 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.84 vs. limit=10.0 2023-06-20 14:14:59,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=730224.0, ans=0.125 2023-06-20 14:15:04,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=730224.0, ans=0.125 2023-06-20 14:15:55,851 INFO [train.py:996] (0/4) Epoch 4, batch 30250, loss[loss=0.305, simple_loss=0.3856, pruned_loss=0.1122, over 21255.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3443, pruned_loss=0.1003, over 4269268.24 frames. ], batch size: 159, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:15:59,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=730404.0, ans=0.0 2023-06-20 14:16:09,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=730404.0, ans=0.2 2023-06-20 14:16:29,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=730464.0, ans=0.125 2023-06-20 14:16:34,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=730524.0, ans=0.04949747468305833 2023-06-20 14:16:35,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.958e+02 3.586e+02 4.420e+02 6.930e+02, threshold=7.173e+02, percent-clipped=0.0 2023-06-20 14:17:08,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=730584.0, ans=0.125 2023-06-20 14:17:37,303 INFO [train.py:996] (0/4) Epoch 4, batch 30300, loss[loss=0.2282, simple_loss=0.2814, pruned_loss=0.08746, over 21300.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3426, pruned_loss=0.09998, over 4269794.24 frames. ], batch size: 160, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:17:38,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=730704.0, ans=0.07 2023-06-20 14:18:06,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=730764.0, ans=0.125 2023-06-20 14:18:43,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=730884.0, ans=0.0 2023-06-20 14:19:01,124 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-06-20 14:19:05,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=730944.0, ans=0.025 2023-06-20 14:19:16,753 INFO [train.py:996] (0/4) Epoch 4, batch 30350, loss[loss=0.2344, simple_loss=0.2956, pruned_loss=0.08655, over 21451.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3419, pruned_loss=0.1008, over 4266929.33 frames. ], batch size: 194, lr: 7.46e-03, grad_scale: 16.0 2023-06-20 14:19:44,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=731064.0, ans=0.125 2023-06-20 14:19:46,132 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:19:49,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=731064.0, ans=0.5 2023-06-20 14:19:55,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-06-20 14:19:55,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.242e+02 3.707e+02 4.467e+02 6.879e+02, threshold=7.414e+02, percent-clipped=0.0 2023-06-20 14:20:07,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=731184.0, ans=0.1 2023-06-20 14:20:43,909 INFO [train.py:996] (0/4) Epoch 4, batch 30400, loss[loss=0.2786, simple_loss=0.3351, pruned_loss=0.1111, over 20165.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3357, pruned_loss=0.09873, over 4254726.29 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:20:59,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-20 14:21:04,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=731364.0, ans=0.125 2023-06-20 14:21:10,437 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-20 14:21:26,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=731424.0, ans=0.0 2023-06-20 14:22:05,350 INFO [train.py:996] (0/4) Epoch 4, batch 30450, loss[loss=0.3047, simple_loss=0.4094, pruned_loss=0.09999, over 19904.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3368, pruned_loss=0.09818, over 4198047.87 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:22:25,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=731664.0, ans=0.0 2023-06-20 14:22:43,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.654e+02 3.992e+02 5.785e+02 8.199e+02 3.035e+03, threshold=1.157e+03, percent-clipped=30.0 2023-06-20 14:22:53,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=731784.0, ans=0.125 2023-06-20 14:23:08,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=731844.0, ans=0.125 2023-06-20 14:23:12,432 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-4.pt 2023-06-20 14:25:02,477 INFO [train.py:996] (0/4) Epoch 5, batch 0, loss[loss=0.2855, simple_loss=0.3329, pruned_loss=0.1191, over 21544.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3329, pruned_loss=0.1191, over 21544.00 frames. ], batch size: 391, lr: 6.61e-03, grad_scale: 32.0 2023-06-20 14:25:02,479 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 14:25:18,276 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2519, simple_loss=0.3587, pruned_loss=0.07257, over 1796401.00 frames. 2023-06-20 14:25:18,277 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 14:25:20,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-20 14:25:51,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-20 14:26:30,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-20 14:26:33,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-20 14:26:45,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=732114.0, ans=0.125 2023-06-20 14:26:54,778 INFO [train.py:996] (0/4) Epoch 5, batch 50, loss[loss=0.2548, simple_loss=0.3278, pruned_loss=0.0909, over 21412.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.333, pruned_loss=0.09762, over 955888.49 frames. ], batch size: 211, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:27:53,568 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.548e+02 3.294e+02 4.063e+02 6.432e+02 1.595e+03, threshold=8.127e+02, percent-clipped=6.0 2023-06-20 14:28:32,496 INFO [train.py:996] (0/4) Epoch 5, batch 100, loss[loss=0.2702, simple_loss=0.3737, pruned_loss=0.08338, over 21851.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3556, pruned_loss=0.101, over 1687860.39 frames. ], batch size: 316, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:28:45,720 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:28:47,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=732474.0, ans=0.2 2023-06-20 14:29:06,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=732534.0, ans=0.125 2023-06-20 14:29:11,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=732534.0, ans=0.0 2023-06-20 14:29:57,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=732714.0, ans=0.2 2023-06-20 14:30:09,621 INFO [train.py:996] (0/4) Epoch 5, batch 150, loss[loss=0.2761, simple_loss=0.3879, pruned_loss=0.08219, over 19799.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3561, pruned_loss=0.1006, over 2260810.68 frames. ], batch size: 702, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:30:23,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=732774.0, ans=0.0 2023-06-20 14:30:43,995 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-20 14:30:52,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=732894.0, ans=0.0 2023-06-20 14:31:07,039 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.823e+02 3.207e+02 3.913e+02 7.422e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-20 14:31:49,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=733074.0, ans=0.2 2023-06-20 14:31:50,647 INFO [train.py:996] (0/4) Epoch 5, batch 200, loss[loss=0.2972, simple_loss=0.3696, pruned_loss=0.1124, over 21442.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3533, pruned_loss=0.09925, over 2695750.31 frames. ], batch size: 131, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:32:11,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=733134.0, ans=0.125 2023-06-20 14:32:18,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=733134.0, ans=0.05 2023-06-20 14:32:36,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=733194.0, ans=0.125 2023-06-20 14:32:39,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=733194.0, ans=0.1 2023-06-20 14:32:49,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=733254.0, ans=0.125 2023-06-20 14:32:50,063 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=12.0 2023-06-20 14:33:19,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=733314.0, ans=0.125 2023-06-20 14:33:24,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=733314.0, ans=0.0 2023-06-20 14:33:27,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=733314.0, ans=0.1 2023-06-20 14:33:27,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=733314.0, ans=0.2 2023-06-20 14:33:32,171 INFO [train.py:996] (0/4) Epoch 5, batch 250, loss[loss=0.2975, simple_loss=0.3565, pruned_loss=0.1192, over 21825.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3486, pruned_loss=0.1004, over 3043844.58 frames. ], batch size: 282, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:34:27,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-20 14:34:30,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.953e+02 3.443e+02 4.116e+02 7.444e+02, threshold=6.886e+02, percent-clipped=2.0 2023-06-20 14:34:40,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=733554.0, ans=0.1 2023-06-20 14:35:14,593 INFO [train.py:996] (0/4) Epoch 5, batch 300, loss[loss=0.2285, simple_loss=0.3091, pruned_loss=0.07398, over 21309.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3434, pruned_loss=0.09903, over 3319628.12 frames. ], batch size: 159, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:35:15,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=733674.0, ans=10.0 2023-06-20 14:35:23,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=733674.0, ans=0.125 2023-06-20 14:35:43,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-20 14:36:06,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=733794.0, ans=0.015 2023-06-20 14:36:37,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=733914.0, ans=0.0 2023-06-20 14:36:51,862 INFO [train.py:996] (0/4) Epoch 5, batch 350, loss[loss=0.2378, simple_loss=0.3064, pruned_loss=0.08465, over 21723.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3368, pruned_loss=0.09793, over 3528680.67 frames. ], batch size: 282, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:37:51,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.253e+02 2.828e+02 3.218e+02 3.899e+02 6.662e+02, threshold=6.437e+02, percent-clipped=0.0 2023-06-20 14:38:12,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=734154.0, ans=0.2 2023-06-20 14:38:12,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=734154.0, ans=0.125 2023-06-20 14:38:18,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=734214.0, ans=0.0 2023-06-20 14:38:33,751 INFO [train.py:996] (0/4) Epoch 5, batch 400, loss[loss=0.1977, simple_loss=0.3011, pruned_loss=0.0471, over 21803.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3301, pruned_loss=0.09484, over 3702922.40 frames. ], batch size: 316, lr: 6.59e-03, grad_scale: 32.0 2023-06-20 14:38:38,224 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-20 14:39:01,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=734334.0, ans=0.2 2023-06-20 14:39:14,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=734394.0, ans=0.0 2023-06-20 14:39:55,099 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-20 14:40:15,225 INFO [train.py:996] (0/4) Epoch 5, batch 450, loss[loss=0.245, simple_loss=0.2893, pruned_loss=0.1003, over 21298.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3302, pruned_loss=0.09518, over 3831655.68 frames. ], batch size: 160, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:40:17,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=734574.0, ans=0.0 2023-06-20 14:40:39,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=734634.0, ans=0.1 2023-06-20 14:41:20,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.914e+02 3.934e+02 5.626e+02 1.302e+03, threshold=7.868e+02, percent-clipped=18.0 2023-06-20 14:41:32,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=734754.0, ans=0.2 2023-06-20 14:41:46,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-20 14:41:55,844 INFO [train.py:996] (0/4) Epoch 5, batch 500, loss[loss=0.2157, simple_loss=0.2749, pruned_loss=0.07821, over 20732.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3277, pruned_loss=0.09398, over 3933660.71 frames. ], batch size: 608, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:42:06,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-06-20 14:42:14,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=734874.0, ans=0.2 2023-06-20 14:42:23,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=734934.0, ans=0.015 2023-06-20 14:42:25,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=734934.0, ans=0.125 2023-06-20 14:43:37,126 INFO [train.py:996] (0/4) Epoch 5, batch 550, loss[loss=0.3598, simple_loss=0.4254, pruned_loss=0.1471, over 21576.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3285, pruned_loss=0.09342, over 4013554.35 frames. ], batch size: 441, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:43:54,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-06-20 14:44:00,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=735234.0, ans=0.125 2023-06-20 14:44:27,780 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-20 14:44:36,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.996e+02 3.563e+02 4.226e+02 6.619e+02, threshold=7.127e+02, percent-clipped=0.0 2023-06-20 14:44:49,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=735354.0, ans=0.125 2023-06-20 14:45:01,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-20 14:45:14,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=735414.0, ans=0.2 2023-06-20 14:45:16,874 INFO [train.py:996] (0/4) Epoch 5, batch 600, loss[loss=0.2994, simple_loss=0.3722, pruned_loss=0.1133, over 21675.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3308, pruned_loss=0.09363, over 4081761.94 frames. ], batch size: 230, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:45:19,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=735474.0, ans=0.125 2023-06-20 14:45:25,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=735474.0, ans=0.2 2023-06-20 14:45:32,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=735474.0, ans=0.0 2023-06-20 14:45:50,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=735534.0, ans=0.1 2023-06-20 14:46:48,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=735714.0, ans=0.125 2023-06-20 14:46:51,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=735714.0, ans=0.1 2023-06-20 14:46:59,135 INFO [train.py:996] (0/4) Epoch 5, batch 650, loss[loss=0.2385, simple_loss=0.3288, pruned_loss=0.07409, over 19950.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3332, pruned_loss=0.09265, over 4125160.88 frames. ], batch size: 703, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:47:01,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=735774.0, ans=0.125 2023-06-20 14:47:19,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=735834.0, ans=0.07 2023-06-20 14:47:35,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=735894.0, ans=0.0 2023-06-20 14:47:43,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=735894.0, ans=0.2 2023-06-20 14:47:48,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=735894.0, ans=0.0 2023-06-20 14:47:58,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.943e+02 3.470e+02 4.276e+02 7.197e+02, threshold=6.941e+02, percent-clipped=1.0 2023-06-20 14:48:12,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=735954.0, ans=0.125 2023-06-20 14:48:40,867 INFO [train.py:996] (0/4) Epoch 5, batch 700, loss[loss=0.302, simple_loss=0.364, pruned_loss=0.12, over 21886.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.334, pruned_loss=0.09354, over 4167270.67 frames. ], batch size: 124, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:48:41,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=736074.0, ans=0.07 2023-06-20 14:49:02,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=736134.0, ans=0.0 2023-06-20 14:49:23,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=736194.0, ans=0.125 2023-06-20 14:49:36,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-20 14:50:21,193 INFO [train.py:996] (0/4) Epoch 5, batch 750, loss[loss=0.2525, simple_loss=0.3153, pruned_loss=0.09488, over 21945.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3314, pruned_loss=0.09336, over 4193472.09 frames. ], batch size: 316, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:50:30,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.00 vs. limit=10.0 2023-06-20 14:51:22,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 2.979e+02 3.412e+02 4.334e+02 7.194e+02, threshold=6.824e+02, percent-clipped=1.0 2023-06-20 14:52:04,155 INFO [train.py:996] (0/4) Epoch 5, batch 800, loss[loss=0.2601, simple_loss=0.3086, pruned_loss=0.1058, over 21347.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3284, pruned_loss=0.09377, over 4203818.82 frames. ], batch size: 471, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:53:46,652 INFO [train.py:996] (0/4) Epoch 5, batch 850, loss[loss=0.2033, simple_loss=0.2599, pruned_loss=0.07331, over 17053.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3251, pruned_loss=0.09309, over 4212070.33 frames. ], batch size: 60, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:54:06,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=737034.0, ans=0.2 2023-06-20 14:54:09,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=737034.0, ans=0.2 2023-06-20 14:54:43,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-20 14:54:57,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.960e+02 3.459e+02 4.425e+02 7.988e+02, threshold=6.917e+02, percent-clipped=3.0 2023-06-20 14:55:32,816 INFO [train.py:996] (0/4) Epoch 5, batch 900, loss[loss=0.2531, simple_loss=0.3075, pruned_loss=0.09938, over 21695.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3241, pruned_loss=0.09233, over 4230852.25 frames. ], batch size: 230, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:55:41,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=737274.0, ans=0.125 2023-06-20 14:55:46,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=737274.0, ans=0.125 2023-06-20 14:56:08,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=737394.0, ans=0.0 2023-06-20 14:57:13,171 INFO [train.py:996] (0/4) Epoch 5, batch 950, loss[loss=0.2457, simple_loss=0.3324, pruned_loss=0.07948, over 21763.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3233, pruned_loss=0.09217, over 4243085.19 frames. ], batch size: 351, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:58:18,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.716e+02 3.192e+02 3.705e+02 5.586e+02, threshold=6.385e+02, percent-clipped=0.0 2023-06-20 14:58:30,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=737754.0, ans=0.125 2023-06-20 14:58:54,050 INFO [train.py:996] (0/4) Epoch 5, batch 1000, loss[loss=0.2991, simple_loss=0.3637, pruned_loss=0.1173, over 21339.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3248, pruned_loss=0.09318, over 4258672.48 frames. ], batch size: 176, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:59:53,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=737994.0, ans=0.125 2023-06-20 15:00:37,276 INFO [train.py:996] (0/4) Epoch 5, batch 1050, loss[loss=0.2249, simple_loss=0.2915, pruned_loss=0.07914, over 21667.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3244, pruned_loss=0.09373, over 4260585.96 frames. ], batch size: 263, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 15:00:53,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=738174.0, ans=0.125 2023-06-20 15:01:24,912 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-06-20 15:01:45,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.880e+02 3.329e+02 4.012e+02 6.640e+02, threshold=6.657e+02, percent-clipped=1.0 2023-06-20 15:02:06,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=738414.0, ans=0.125 2023-06-20 15:02:21,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=738474.0, ans=0.5 2023-06-20 15:02:22,439 INFO [train.py:996] (0/4) Epoch 5, batch 1100, loss[loss=0.2511, simple_loss=0.3028, pruned_loss=0.09964, over 20334.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3245, pruned_loss=0.09327, over 4262645.69 frames. ], batch size: 703, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 15:02:50,262 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.78 vs. limit=6.0 2023-06-20 15:03:30,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=738654.0, ans=0.125 2023-06-20 15:03:56,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=738714.0, ans=0.0 2023-06-20 15:04:10,625 INFO [train.py:996] (0/4) Epoch 5, batch 1150, loss[loss=0.3219, simple_loss=0.3698, pruned_loss=0.137, over 21717.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3255, pruned_loss=0.09343, over 4269701.69 frames. ], batch size: 507, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:04:52,604 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-20 15:05:15,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=738954.0, ans=0.0 2023-06-20 15:05:18,295 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.853e+02 3.547e+02 4.502e+02 9.164e+02, threshold=7.095e+02, percent-clipped=7.0 2023-06-20 15:06:05,928 INFO [train.py:996] (0/4) Epoch 5, batch 1200, loss[loss=0.2317, simple_loss=0.2765, pruned_loss=0.09347, over 20807.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3267, pruned_loss=0.09384, over 4271685.69 frames. ], batch size: 608, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:06:42,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=739194.0, ans=0.2 2023-06-20 15:07:02,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=739254.0, ans=0.125 2023-06-20 15:07:13,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=739254.0, ans=10.0 2023-06-20 15:07:37,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=739314.0, ans=0.09899494936611666 2023-06-20 15:07:48,975 INFO [train.py:996] (0/4) Epoch 5, batch 1250, loss[loss=0.2677, simple_loss=0.3366, pruned_loss=0.09938, over 21186.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3289, pruned_loss=0.09432, over 4269779.83 frames. ], batch size: 143, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:08:13,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.29 vs. limit=10.0 2023-06-20 15:08:44,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=739554.0, ans=0.125 2023-06-20 15:08:45,309 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.897e+02 3.475e+02 4.049e+02 7.365e+02, threshold=6.950e+02, percent-clipped=1.0 2023-06-20 15:09:33,277 INFO [train.py:996] (0/4) Epoch 5, batch 1300, loss[loss=0.262, simple_loss=0.3318, pruned_loss=0.09613, over 21727.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3303, pruned_loss=0.09532, over 4278131.63 frames. ], batch size: 298, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:09:43,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=739674.0, ans=0.1 2023-06-20 15:09:51,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=739734.0, ans=0.95 2023-06-20 15:11:16,833 INFO [train.py:996] (0/4) Epoch 5, batch 1350, loss[loss=0.2809, simple_loss=0.3385, pruned_loss=0.1117, over 21850.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3314, pruned_loss=0.09573, over 4287323.34 frames. ], batch size: 107, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:11:18,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=739974.0, ans=0.2 2023-06-20 15:11:20,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=739974.0, ans=0.125 2023-06-20 15:11:27,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=739974.0, ans=0.1 2023-06-20 15:11:52,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=740094.0, ans=0.1 2023-06-20 15:12:03,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=740094.0, ans=0.2 2023-06-20 15:12:05,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=740094.0, ans=0.0 2023-06-20 15:12:12,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.977e+02 3.601e+02 4.425e+02 6.870e+02, threshold=7.202e+02, percent-clipped=0.0 2023-06-20 15:12:49,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=740214.0, ans=0.2 2023-06-20 15:13:00,166 INFO [train.py:996] (0/4) Epoch 5, batch 1400, loss[loss=0.2544, simple_loss=0.3158, pruned_loss=0.09649, over 21691.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3305, pruned_loss=0.09675, over 4282990.16 frames. ], batch size: 282, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:13:54,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=740454.0, ans=0.125 2023-06-20 15:14:43,282 INFO [train.py:996] (0/4) Epoch 5, batch 1450, loss[loss=0.2675, simple_loss=0.3192, pruned_loss=0.1079, over 21663.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3293, pruned_loss=0.09727, over 4289616.16 frames. ], batch size: 414, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:14:55,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=740574.0, ans=0.2 2023-06-20 15:15:33,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=740694.0, ans=0.04949747468305833 2023-06-20 15:15:37,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 2.878e+02 3.343e+02 3.968e+02 7.161e+02, threshold=6.685e+02, percent-clipped=0.0 2023-06-20 15:15:45,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.79 vs. limit=5.0 2023-06-20 15:16:23,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=740874.0, ans=0.0 2023-06-20 15:16:24,299 INFO [train.py:996] (0/4) Epoch 5, batch 1500, loss[loss=0.2226, simple_loss=0.317, pruned_loss=0.06413, over 21621.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3307, pruned_loss=0.09851, over 4294938.18 frames. ], batch size: 263, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:18:10,016 INFO [train.py:996] (0/4) Epoch 5, batch 1550, loss[loss=0.2251, simple_loss=0.3121, pruned_loss=0.06902, over 21765.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3291, pruned_loss=0.0974, over 4288917.44 frames. ], batch size: 298, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:18:14,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=741174.0, ans=0.125 2023-06-20 15:18:19,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=741174.0, ans=0.5 2023-06-20 15:18:24,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-20 15:18:28,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.78 vs. limit=22.5 2023-06-20 15:19:09,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.929e+02 3.428e+02 4.021e+02 6.196e+02, threshold=6.855e+02, percent-clipped=0.0 2023-06-20 15:19:33,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=741414.0, ans=0.125 2023-06-20 15:19:50,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=741474.0, ans=0.125 2023-06-20 15:19:51,940 INFO [train.py:996] (0/4) Epoch 5, batch 1600, loss[loss=0.3844, simple_loss=0.4408, pruned_loss=0.1639, over 21438.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3272, pruned_loss=0.09583, over 4288849.78 frames. ], batch size: 507, lr: 6.56e-03, grad_scale: 32.0 2023-06-20 15:21:17,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-20 15:21:36,268 INFO [train.py:996] (0/4) Epoch 5, batch 1650, loss[loss=0.2304, simple_loss=0.2984, pruned_loss=0.08114, over 21719.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3287, pruned_loss=0.09611, over 4280230.25 frames. ], batch size: 263, lr: 6.56e-03, grad_scale: 32.0 2023-06-20 15:22:03,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=741834.0, ans=0.125 2023-06-20 15:22:51,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.044e+02 3.519e+02 4.349e+02 7.461e+02, threshold=7.039e+02, percent-clipped=1.0 2023-06-20 15:22:52,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=741954.0, ans=0.125 2023-06-20 15:22:58,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=741954.0, ans=0.125 2023-06-20 15:23:19,953 INFO [train.py:996] (0/4) Epoch 5, batch 1700, loss[loss=0.2317, simple_loss=0.3064, pruned_loss=0.07847, over 21665.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3314, pruned_loss=0.09601, over 4285663.88 frames. ], batch size: 263, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:23:27,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=742074.0, ans=0.125 2023-06-20 15:24:21,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=742194.0, ans=15.0 2023-06-20 15:24:27,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=742254.0, ans=0.1 2023-06-20 15:24:37,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=742254.0, ans=0.125 2023-06-20 15:25:00,575 INFO [train.py:996] (0/4) Epoch 5, batch 1750, loss[loss=0.2879, simple_loss=0.37, pruned_loss=0.103, over 21472.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.331, pruned_loss=0.09437, over 4277799.59 frames. ], batch size: 471, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:25:11,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=742374.0, ans=0.2 2023-06-20 15:26:13,033 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.877e+02 3.732e+02 4.371e+02 8.077e+02, threshold=7.464e+02, percent-clipped=3.0 2023-06-20 15:26:46,884 INFO [train.py:996] (0/4) Epoch 5, batch 1800, loss[loss=0.1922, simple_loss=0.2448, pruned_loss=0.06981, over 21834.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3273, pruned_loss=0.09133, over 4272760.36 frames. ], batch size: 118, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:27:02,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=742674.0, ans=0.0 2023-06-20 15:27:41,086 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-20 15:27:46,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=742794.0, ans=0.125 2023-06-20 15:28:14,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=742914.0, ans=0.0 2023-06-20 15:28:18,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=742914.0, ans=0.0 2023-06-20 15:28:19,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=742914.0, ans=0.125 2023-06-20 15:28:30,596 INFO [train.py:996] (0/4) Epoch 5, batch 1850, loss[loss=0.2453, simple_loss=0.3262, pruned_loss=0.0822, over 21811.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.33, pruned_loss=0.08931, over 4276669.91 frames. ], batch size: 282, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:29:07,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=743034.0, ans=0.2 2023-06-20 15:29:12,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=743034.0, ans=0.125 2023-06-20 15:29:19,133 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:29:25,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=743094.0, ans=0.125 2023-06-20 15:29:40,615 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.730e+02 3.304e+02 4.070e+02 7.005e+02, threshold=6.608e+02, percent-clipped=0.0 2023-06-20 15:30:03,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=743214.0, ans=0.125 2023-06-20 15:30:14,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-20 15:30:15,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=743214.0, ans=0.0 2023-06-20 15:30:18,508 INFO [train.py:996] (0/4) Epoch 5, batch 1900, loss[loss=0.2385, simple_loss=0.3185, pruned_loss=0.07919, over 21806.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3287, pruned_loss=0.08887, over 4280445.89 frames. ], batch size: 351, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:32:02,712 INFO [train.py:996] (0/4) Epoch 5, batch 1950, loss[loss=0.2729, simple_loss=0.3711, pruned_loss=0.08734, over 19775.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3263, pruned_loss=0.08997, over 4284495.36 frames. ], batch size: 703, lr: 6.55e-03, grad_scale: 16.0 2023-06-20 15:32:20,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=743574.0, ans=0.125 2023-06-20 15:32:31,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=743634.0, ans=0.1 2023-06-20 15:32:44,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=743634.0, ans=0.0 2023-06-20 15:32:46,536 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-20 15:33:10,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 3.018e+02 3.404e+02 4.237e+02 8.100e+02, threshold=6.807e+02, percent-clipped=3.0 2023-06-20 15:33:38,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=743814.0, ans=0.0 2023-06-20 15:33:49,096 INFO [train.py:996] (0/4) Epoch 5, batch 2000, loss[loss=0.2956, simple_loss=0.3791, pruned_loss=0.1061, over 21841.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3213, pruned_loss=0.0894, over 4274437.94 frames. ], batch size: 372, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:34:33,376 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-124000.pt 2023-06-20 15:35:24,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=744114.0, ans=0.0 2023-06-20 15:35:33,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-20 15:35:34,276 INFO [train.py:996] (0/4) Epoch 5, batch 2050, loss[loss=0.2337, simple_loss=0.2972, pruned_loss=0.08511, over 21510.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.322, pruned_loss=0.08886, over 4273528.63 frames. ], batch size: 389, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:35:38,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=744174.0, ans=22.5 2023-06-20 15:36:00,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=744234.0, ans=0.125 2023-06-20 15:36:20,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=744294.0, ans=0.0 2023-06-20 15:36:39,227 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 2.809e+02 3.320e+02 4.171e+02 6.443e+02, threshold=6.640e+02, percent-clipped=0.0 2023-06-20 15:36:51,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=744414.0, ans=0.09899494936611666 2023-06-20 15:36:59,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=744414.0, ans=0.0 2023-06-20 15:37:12,253 INFO [train.py:996] (0/4) Epoch 5, batch 2100, loss[loss=0.2541, simple_loss=0.324, pruned_loss=0.09216, over 21690.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.325, pruned_loss=0.0909, over 4279473.43 frames. ], batch size: 112, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:38:12,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=744594.0, ans=0.09899494936611666 2023-06-20 15:39:06,022 INFO [train.py:996] (0/4) Epoch 5, batch 2150, loss[loss=0.2251, simple_loss=0.3325, pruned_loss=0.0589, over 21200.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3258, pruned_loss=0.09185, over 4278782.00 frames. ], batch size: 548, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:39:15,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=744774.0, ans=0.125 2023-06-20 15:39:58,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=744894.0, ans=0.025 2023-06-20 15:40:11,003 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 3.145e+02 3.806e+02 5.141e+02 9.299e+02, threshold=7.611e+02, percent-clipped=10.0 2023-06-20 15:40:22,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-06-20 15:40:28,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=745014.0, ans=0.125 2023-06-20 15:40:44,306 INFO [train.py:996] (0/4) Epoch 5, batch 2200, loss[loss=0.2234, simple_loss=0.289, pruned_loss=0.07892, over 16868.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3301, pruned_loss=0.09254, over 4274351.12 frames. ], batch size: 64, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:41:23,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=745134.0, ans=0.0 2023-06-20 15:41:26,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=745134.0, ans=0.125 2023-06-20 15:41:27,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-20 15:41:31,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=745194.0, ans=0.125 2023-06-20 15:42:00,443 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-20 15:42:02,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-20 15:42:15,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=745314.0, ans=0.0 2023-06-20 15:42:21,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=745314.0, ans=0.2 2023-06-20 15:42:40,442 INFO [train.py:996] (0/4) Epoch 5, batch 2250, loss[loss=0.2693, simple_loss=0.332, pruned_loss=0.1033, over 21542.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.327, pruned_loss=0.09088, over 4273604.26 frames. ], batch size: 441, lr: 6.55e-03, grad_scale: 16.0 2023-06-20 15:42:52,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=745374.0, ans=0.5 2023-06-20 15:43:09,432 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:43:21,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=745494.0, ans=0.07 2023-06-20 15:43:27,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=745494.0, ans=0.125 2023-06-20 15:43:28,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-20 15:43:29,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=745494.0, ans=0.125 2023-06-20 15:43:46,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.695e+02 3.113e+02 3.722e+02 7.366e+02, threshold=6.226e+02, percent-clipped=0.0 2023-06-20 15:44:23,186 INFO [train.py:996] (0/4) Epoch 5, batch 2300, loss[loss=0.2419, simple_loss=0.2952, pruned_loss=0.09433, over 21385.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3208, pruned_loss=0.09083, over 4271947.09 frames. ], batch size: 211, lr: 6.54e-03, grad_scale: 16.0 2023-06-20 15:44:26,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=745674.0, ans=0.1 2023-06-20 15:45:00,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=745794.0, ans=0.125 2023-06-20 15:45:09,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=745794.0, ans=0.125 2023-06-20 15:46:06,143 INFO [train.py:996] (0/4) Epoch 5, batch 2350, loss[loss=0.2382, simple_loss=0.2925, pruned_loss=0.09194, over 21213.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3168, pruned_loss=0.09068, over 4269246.60 frames. ], batch size: 159, lr: 6.54e-03, grad_scale: 16.0 2023-06-20 15:46:58,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=746094.0, ans=0.0 2023-06-20 15:47:06,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=746154.0, ans=0.0 2023-06-20 15:47:12,565 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.470e+02 3.194e+02 3.678e+02 4.653e+02 7.153e+02, threshold=7.356e+02, percent-clipped=3.0 2023-06-20 15:47:36,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-20 15:47:50,578 INFO [train.py:996] (0/4) Epoch 5, batch 2400, loss[loss=0.2698, simple_loss=0.3324, pruned_loss=0.1036, over 21809.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3186, pruned_loss=0.09222, over 4262182.24 frames. ], batch size: 282, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:48:29,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-20 15:48:31,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=746394.0, ans=0.125 2023-06-20 15:49:36,798 INFO [train.py:996] (0/4) Epoch 5, batch 2450, loss[loss=0.2454, simple_loss=0.308, pruned_loss=0.09141, over 21600.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3267, pruned_loss=0.09502, over 4250834.43 frames. ], batch size: 247, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:50:47,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.072e+02 3.519e+02 4.096e+02 7.474e+02, threshold=7.039e+02, percent-clipped=1.0 2023-06-20 15:50:49,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=746754.0, ans=0.1 2023-06-20 15:51:17,874 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.12 vs. limit=15.0 2023-06-20 15:51:19,735 INFO [train.py:996] (0/4) Epoch 5, batch 2500, loss[loss=0.2658, simple_loss=0.3302, pruned_loss=0.1007, over 21994.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3243, pruned_loss=0.09474, over 4255271.95 frames. ], batch size: 103, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:51:29,488 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:51:37,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=746934.0, ans=0.125 2023-06-20 15:52:02,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-20 15:52:25,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=747054.0, ans=0.125 2023-06-20 15:52:38,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-20 15:52:49,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=747114.0, ans=0.04949747468305833 2023-06-20 15:52:58,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=747114.0, ans=0.2 2023-06-20 15:53:02,362 INFO [train.py:996] (0/4) Epoch 5, batch 2550, loss[loss=0.2469, simple_loss=0.31, pruned_loss=0.0919, over 21721.00 frames. ], tot_loss[loss=0.255, simple_loss=0.322, pruned_loss=0.09404, over 4265662.82 frames. ], batch size: 282, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:53:20,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=747234.0, ans=0.125 2023-06-20 15:54:06,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 2.818e+02 3.202e+02 3.816e+02 6.010e+02, threshold=6.403e+02, percent-clipped=0.0 2023-06-20 15:54:24,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=747414.0, ans=0.125 2023-06-20 15:54:44,539 INFO [train.py:996] (0/4) Epoch 5, batch 2600, loss[loss=0.2868, simple_loss=0.3473, pruned_loss=0.1132, over 21737.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3244, pruned_loss=0.09468, over 4262119.37 frames. ], batch size: 247, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:55:29,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=747594.0, ans=0.125 2023-06-20 15:55:34,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=747594.0, ans=0.5 2023-06-20 15:55:44,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=747654.0, ans=0.125 2023-06-20 15:56:04,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=747714.0, ans=0.0 2023-06-20 15:56:27,548 INFO [train.py:996] (0/4) Epoch 5, batch 2650, loss[loss=0.2667, simple_loss=0.3314, pruned_loss=0.101, over 21452.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3254, pruned_loss=0.09642, over 4264510.83 frames. ], batch size: 131, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:57:39,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.025e+02 3.669e+02 4.481e+02 6.938e+02, threshold=7.338e+02, percent-clipped=2.0 2023-06-20 15:58:10,860 INFO [train.py:996] (0/4) Epoch 5, batch 2700, loss[loss=0.2916, simple_loss=0.3621, pruned_loss=0.1106, over 21576.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3241, pruned_loss=0.09549, over 4264685.60 frames. ], batch size: 473, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 15:58:12,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=748074.0, ans=0.125 2023-06-20 15:58:35,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=748134.0, ans=0.0 2023-06-20 15:59:05,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-20 15:59:52,554 INFO [train.py:996] (0/4) Epoch 5, batch 2750, loss[loss=0.3025, simple_loss=0.343, pruned_loss=0.131, over 21792.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.322, pruned_loss=0.09515, over 4271522.83 frames. ], batch size: 508, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 15:59:52,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=748374.0, ans=0.0 2023-06-20 16:00:22,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=748434.0, ans=0.2 2023-06-20 16:00:41,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=748494.0, ans=0.0 2023-06-20 16:00:44,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748494.0, ans=0.1 2023-06-20 16:00:44,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=748494.0, ans=0.125 2023-06-20 16:01:05,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=748554.0, ans=0.125 2023-06-20 16:01:06,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 3.120e+02 3.768e+02 4.829e+02 8.745e+02, threshold=7.536e+02, percent-clipped=5.0 2023-06-20 16:01:25,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748614.0, ans=0.1 2023-06-20 16:01:36,410 INFO [train.py:996] (0/4) Epoch 5, batch 2800, loss[loss=0.2455, simple_loss=0.3014, pruned_loss=0.09477, over 21315.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3279, pruned_loss=0.09682, over 4273115.84 frames. ], batch size: 131, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 16:01:50,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=748674.0, ans=0.2 2023-06-20 16:01:52,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=748674.0, ans=0.125 2023-06-20 16:01:56,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-06-20 16:03:02,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=748854.0, ans=0.125 2023-06-20 16:03:27,354 INFO [train.py:996] (0/4) Epoch 5, batch 2850, loss[loss=0.1878, simple_loss=0.2435, pruned_loss=0.0661, over 21430.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3304, pruned_loss=0.09663, over 4266693.04 frames. ], batch size: 194, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:03:36,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=748974.0, ans=0.125 2023-06-20 16:04:42,073 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.197e+02 3.952e+02 5.010e+02 9.652e+02, threshold=7.904e+02, percent-clipped=7.0 2023-06-20 16:05:10,333 INFO [train.py:996] (0/4) Epoch 5, batch 2900, loss[loss=0.1902, simple_loss=0.2449, pruned_loss=0.06773, over 20732.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3263, pruned_loss=0.09526, over 4261366.24 frames. ], batch size: 608, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:05:19,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=749274.0, ans=0.1 2023-06-20 16:05:37,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=749334.0, ans=0.125 2023-06-20 16:06:21,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.32 vs. limit=15.0 2023-06-20 16:06:52,779 INFO [train.py:996] (0/4) Epoch 5, batch 2950, loss[loss=0.1878, simple_loss=0.2443, pruned_loss=0.06568, over 21905.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3271, pruned_loss=0.09571, over 4270782.08 frames. ], batch size: 98, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:06:58,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=749574.0, ans=6.0 2023-06-20 16:07:21,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-20 16:07:23,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-06-20 16:07:29,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=749694.0, ans=0.1 2023-06-20 16:08:05,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=749754.0, ans=0.125 2023-06-20 16:08:10,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.876e+02 3.268e+02 4.025e+02 7.097e+02, threshold=6.536e+02, percent-clipped=0.0 2023-06-20 16:08:22,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=749814.0, ans=0.125 2023-06-20 16:08:36,216 INFO [train.py:996] (0/4) Epoch 5, batch 3000, loss[loss=0.2141, simple_loss=0.2817, pruned_loss=0.07329, over 21628.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3304, pruned_loss=0.09639, over 4277624.28 frames. ], batch size: 263, lr: 6.53e-03, grad_scale: 8.0 2023-06-20 16:08:36,218 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 16:08:55,134 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2579, simple_loss=0.3533, pruned_loss=0.08129, over 1796401.00 frames. 2023-06-20 16:08:55,134 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 16:10:12,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=750054.0, ans=0.2 2023-06-20 16:10:22,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=750114.0, ans=0.0 2023-06-20 16:10:31,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=750114.0, ans=0.5 2023-06-20 16:10:39,780 INFO [train.py:996] (0/4) Epoch 5, batch 3050, loss[loss=0.3165, simple_loss=0.3772, pruned_loss=0.1279, over 21543.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3307, pruned_loss=0.09488, over 4272219.95 frames. ], batch size: 508, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:11:58,462 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-20 16:11:58,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.749e+02 3.160e+02 3.983e+02 6.617e+02, threshold=6.320e+02, percent-clipped=1.0 2023-06-20 16:12:25,614 INFO [train.py:996] (0/4) Epoch 5, batch 3100, loss[loss=0.217, simple_loss=0.3068, pruned_loss=0.06356, over 21577.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3309, pruned_loss=0.09392, over 4273824.44 frames. ], batch size: 230, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:14:15,808 INFO [train.py:996] (0/4) Epoch 5, batch 3150, loss[loss=0.2954, simple_loss=0.3646, pruned_loss=0.1131, over 21571.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3326, pruned_loss=0.0946, over 4273326.52 frames. ], batch size: 389, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:14:21,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=750774.0, ans=0.0 2023-06-20 16:14:23,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-06-20 16:14:32,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=750834.0, ans=0.0 2023-06-20 16:14:52,959 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:15:09,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=750954.0, ans=0.1 2023-06-20 16:15:28,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.671e+02 3.239e+02 3.868e+02 6.706e+02, threshold=6.479e+02, percent-clipped=2.0 2023-06-20 16:15:30,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=750954.0, ans=0.2 2023-06-20 16:15:55,788 INFO [train.py:996] (0/4) Epoch 5, batch 3200, loss[loss=0.339, simple_loss=0.4014, pruned_loss=0.1383, over 21429.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3329, pruned_loss=0.09395, over 4277007.69 frames. ], batch size: 507, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:16:56,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=751254.0, ans=0.125 2023-06-20 16:17:20,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=751254.0, ans=0.0 2023-06-20 16:17:29,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=751314.0, ans=0.125 2023-06-20 16:17:40,160 INFO [train.py:996] (0/4) Epoch 5, batch 3250, loss[loss=0.2455, simple_loss=0.3064, pruned_loss=0.09231, over 21855.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3337, pruned_loss=0.09567, over 4276703.94 frames. ], batch size: 98, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:18:15,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=15.0 2023-06-20 16:19:02,279 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.071e+02 3.454e+02 4.020e+02 6.852e+02, threshold=6.907e+02, percent-clipped=1.0 2023-06-20 16:19:21,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=751614.0, ans=0.0 2023-06-20 16:19:23,823 INFO [train.py:996] (0/4) Epoch 5, batch 3300, loss[loss=0.2557, simple_loss=0.3094, pruned_loss=0.1011, over 15188.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3305, pruned_loss=0.09637, over 4274059.30 frames. ], batch size: 61, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:20:07,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=751794.0, ans=0.125 2023-06-20 16:20:08,042 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-20 16:20:20,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=751794.0, ans=0.0 2023-06-20 16:20:39,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=751854.0, ans=0.0 2023-06-20 16:20:45,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=751854.0, ans=0.125 2023-06-20 16:21:09,025 INFO [train.py:996] (0/4) Epoch 5, batch 3350, loss[loss=0.2513, simple_loss=0.3021, pruned_loss=0.1003, over 20058.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3328, pruned_loss=0.09663, over 4276969.51 frames. ], batch size: 704, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:21:55,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-20 16:22:31,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.245e+02 3.953e+02 4.970e+02 1.057e+03, threshold=7.906e+02, percent-clipped=6.0 2023-06-20 16:22:57,027 INFO [train.py:996] (0/4) Epoch 5, batch 3400, loss[loss=0.3017, simple_loss=0.4093, pruned_loss=0.09706, over 20758.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3349, pruned_loss=0.09807, over 4275430.53 frames. ], batch size: 607, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:22:57,519 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:23:10,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=752274.0, ans=0.0 2023-06-20 16:23:30,312 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:23:35,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=752394.0, ans=0.0 2023-06-20 16:24:09,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=752454.0, ans=0.125 2023-06-20 16:24:16,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-06-20 16:24:41,925 INFO [train.py:996] (0/4) Epoch 5, batch 3450, loss[loss=0.247, simple_loss=0.3052, pruned_loss=0.0944, over 21503.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3297, pruned_loss=0.09679, over 4276389.32 frames. ], batch size: 230, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:25:11,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=752634.0, ans=0.1 2023-06-20 16:25:49,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=752754.0, ans=0.125 2023-06-20 16:26:05,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.288e+02 3.795e+02 4.947e+02 8.128e+02, threshold=7.589e+02, percent-clipped=1.0 2023-06-20 16:26:26,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=12.0 2023-06-20 16:26:27,183 INFO [train.py:996] (0/4) Epoch 5, batch 3500, loss[loss=0.3022, simple_loss=0.371, pruned_loss=0.1166, over 21707.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3386, pruned_loss=0.101, over 4281932.51 frames. ], batch size: 351, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:26:39,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=752874.0, ans=0.125 2023-06-20 16:26:41,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=752874.0, ans=0.0 2023-06-20 16:27:43,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=753054.0, ans=0.125 2023-06-20 16:27:47,194 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-20 16:28:10,895 INFO [train.py:996] (0/4) Epoch 5, batch 3550, loss[loss=0.2199, simple_loss=0.2836, pruned_loss=0.0781, over 21616.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3412, pruned_loss=0.1026, over 4284372.04 frames. ], batch size: 247, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:29:07,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=753294.0, ans=0.0 2023-06-20 16:29:15,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=753294.0, ans=0.125 2023-06-20 16:29:17,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=753354.0, ans=0.2 2023-06-20 16:29:20,836 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-20 16:29:30,316 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:29:35,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.975e+02 3.456e+02 4.245e+02 7.529e+02, threshold=6.912e+02, percent-clipped=0.0 2023-06-20 16:29:37,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-20 16:29:42,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=753414.0, ans=0.125 2023-06-20 16:29:52,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=753414.0, ans=0.125 2023-06-20 16:30:01,839 INFO [train.py:996] (0/4) Epoch 5, batch 3600, loss[loss=0.2885, simple_loss=0.3393, pruned_loss=0.1189, over 21477.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3364, pruned_loss=0.1019, over 4275434.02 frames. ], batch size: 211, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:30:02,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=753474.0, ans=0.125 2023-06-20 16:30:11,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=753474.0, ans=0.0 2023-06-20 16:30:25,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=753534.0, ans=0.2 2023-06-20 16:30:28,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=753534.0, ans=0.125 2023-06-20 16:30:54,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-20 16:31:18,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-20 16:31:46,501 INFO [train.py:996] (0/4) Epoch 5, batch 3650, loss[loss=0.2523, simple_loss=0.3364, pruned_loss=0.08409, over 21665.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3373, pruned_loss=0.1015, over 4280415.04 frames. ], batch size: 389, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:32:02,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=753774.0, ans=0.1 2023-06-20 16:32:41,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-20 16:32:41,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=753894.0, ans=0.125 2023-06-20 16:32:53,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=753954.0, ans=0.0 2023-06-20 16:33:02,489 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 3.073e+02 3.466e+02 4.352e+02 7.872e+02, threshold=6.931e+02, percent-clipped=1.0 2023-06-20 16:33:29,249 INFO [train.py:996] (0/4) Epoch 5, batch 3700, loss[loss=0.2646, simple_loss=0.3196, pruned_loss=0.1048, over 21818.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3354, pruned_loss=0.1002, over 4275787.27 frames. ], batch size: 102, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:33:31,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=754074.0, ans=0.125 2023-06-20 16:33:43,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=754074.0, ans=0.125 2023-06-20 16:34:33,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=754254.0, ans=0.125 2023-06-20 16:34:35,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=754254.0, ans=0.2 2023-06-20 16:34:39,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=22.5 2023-06-20 16:34:39,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-20 16:34:41,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=754254.0, ans=0.125 2023-06-20 16:35:18,249 INFO [train.py:996] (0/4) Epoch 5, batch 3750, loss[loss=0.197, simple_loss=0.2688, pruned_loss=0.06265, over 21501.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.333, pruned_loss=0.09987, over 4272136.51 frames. ], batch size: 212, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:35:40,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=754434.0, ans=0.0 2023-06-20 16:35:55,681 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-20 16:36:36,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.776e+02 3.273e+02 3.853e+02 7.611e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-20 16:36:52,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=754614.0, ans=0.2 2023-06-20 16:37:07,432 INFO [train.py:996] (0/4) Epoch 5, batch 3800, loss[loss=0.3377, simple_loss=0.391, pruned_loss=0.1422, over 21488.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3313, pruned_loss=0.09745, over 4273888.29 frames. ], batch size: 509, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:37:32,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=754734.0, ans=0.05 2023-06-20 16:38:42,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=754914.0, ans=0.1 2023-06-20 16:38:45,000 INFO [train.py:996] (0/4) Epoch 5, batch 3850, loss[loss=0.2264, simple_loss=0.2826, pruned_loss=0.08509, over 21901.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3287, pruned_loss=0.0977, over 4256325.12 frames. ], batch size: 107, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:39:36,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=755094.0, ans=0.125 2023-06-20 16:39:36,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=755094.0, ans=0.125 2023-06-20 16:39:51,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=755154.0, ans=0.125 2023-06-20 16:40:01,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.004e+02 3.563e+02 4.477e+02 7.369e+02, threshold=7.126e+02, percent-clipped=2.0 2023-06-20 16:40:21,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=755214.0, ans=0.125 2023-06-20 16:40:24,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=755214.0, ans=0.125 2023-06-20 16:40:27,658 INFO [train.py:996] (0/4) Epoch 5, batch 3900, loss[loss=0.2364, simple_loss=0.2975, pruned_loss=0.08764, over 21828.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3233, pruned_loss=0.09691, over 4263948.08 frames. ], batch size: 332, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:41:02,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=755334.0, ans=0.2 2023-06-20 16:41:04,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=755334.0, ans=0.125 2023-06-20 16:42:07,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.70 vs. limit=15.0 2023-06-20 16:42:10,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=755574.0, ans=0.0 2023-06-20 16:42:11,403 INFO [train.py:996] (0/4) Epoch 5, batch 3950, loss[loss=0.2747, simple_loss=0.359, pruned_loss=0.0952, over 21517.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3259, pruned_loss=0.09613, over 4271563.98 frames. ], batch size: 471, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:42:14,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=15.0 2023-06-20 16:43:33,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.896e+02 3.603e+02 4.962e+02 8.484e+02, threshold=7.206e+02, percent-clipped=4.0 2023-06-20 16:43:52,926 INFO [train.py:996] (0/4) Epoch 5, batch 4000, loss[loss=0.224, simple_loss=0.2859, pruned_loss=0.08107, over 21638.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3183, pruned_loss=0.09172, over 4273080.34 frames. ], batch size: 282, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:45:11,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=756054.0, ans=0.0 2023-06-20 16:45:15,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=756054.0, ans=0.125 2023-06-20 16:45:41,090 INFO [train.py:996] (0/4) Epoch 5, batch 4050, loss[loss=0.2891, simple_loss=0.3533, pruned_loss=0.1125, over 21498.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3197, pruned_loss=0.091, over 4271306.49 frames. ], batch size: 507, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:46:57,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.627e+02 3.094e+02 3.740e+02 6.411e+02, threshold=6.189e+02, percent-clipped=0.0 2023-06-20 16:47:23,001 INFO [train.py:996] (0/4) Epoch 5, batch 4100, loss[loss=0.2338, simple_loss=0.315, pruned_loss=0.07629, over 16817.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3194, pruned_loss=0.0911, over 4269335.45 frames. ], batch size: 60, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:47:28,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=756474.0, ans=0.1 2023-06-20 16:48:10,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-20 16:48:30,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=756654.0, ans=0.125 2023-06-20 16:49:06,428 INFO [train.py:996] (0/4) Epoch 5, batch 4150, loss[loss=0.2192, simple_loss=0.2985, pruned_loss=0.06999, over 21637.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3185, pruned_loss=0.08701, over 4275300.62 frames. ], batch size: 263, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:49:21,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-20 16:49:31,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=756834.0, ans=0.125 2023-06-20 16:50:01,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=756894.0, ans=0.125 2023-06-20 16:50:15,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=756954.0, ans=0.0 2023-06-20 16:50:26,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.812e+02 3.283e+02 4.437e+02 7.520e+02, threshold=6.566e+02, percent-clipped=5.0 2023-06-20 16:50:55,602 INFO [train.py:996] (0/4) Epoch 5, batch 4200, loss[loss=0.2791, simple_loss=0.3606, pruned_loss=0.09882, over 21537.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3206, pruned_loss=0.08812, over 4268411.73 frames. ], batch size: 389, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:51:17,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=757134.0, ans=0.1 2023-06-20 16:51:23,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-20 16:51:34,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-20 16:52:39,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=757374.0, ans=0.1 2023-06-20 16:52:40,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-20 16:52:40,377 INFO [train.py:996] (0/4) Epoch 5, batch 4250, loss[loss=0.2975, simple_loss=0.3752, pruned_loss=0.1099, over 21738.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3278, pruned_loss=0.09067, over 4266992.30 frames. ], batch size: 118, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:52:54,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=757374.0, ans=0.0 2023-06-20 16:53:22,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=757494.0, ans=0.0 2023-06-20 16:53:25,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=757494.0, ans=0.0 2023-06-20 16:54:03,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 2.999e+02 3.547e+02 4.279e+02 1.014e+03, threshold=7.094e+02, percent-clipped=7.0 2023-06-20 16:54:22,774 INFO [train.py:996] (0/4) Epoch 5, batch 4300, loss[loss=0.2133, simple_loss=0.3005, pruned_loss=0.06302, over 21403.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3329, pruned_loss=0.09227, over 4258996.68 frames. ], batch size: 211, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:55:51,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=757914.0, ans=0.125 2023-06-20 16:56:01,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=757914.0, ans=0.0 2023-06-20 16:56:17,268 INFO [train.py:996] (0/4) Epoch 5, batch 4350, loss[loss=0.244, simple_loss=0.3054, pruned_loss=0.09133, over 21450.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3296, pruned_loss=0.09141, over 4250349.96 frames. ], batch size: 212, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:57:12,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=758094.0, ans=0.125 2023-06-20 16:57:12,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-20 16:57:16,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-20 16:57:22,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=758154.0, ans=0.0 2023-06-20 16:57:23,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=758154.0, ans=0.1 2023-06-20 16:57:31,422 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.864e+02 3.148e+02 3.712e+02 7.836e+02, threshold=6.297e+02, percent-clipped=1.0 2023-06-20 16:57:39,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=758214.0, ans=0.1 2023-06-20 16:57:54,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=758214.0, ans=0.125 2023-06-20 16:57:57,344 INFO [train.py:996] (0/4) Epoch 5, batch 4400, loss[loss=0.2689, simple_loss=0.3573, pruned_loss=0.09023, over 21790.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3263, pruned_loss=0.09043, over 4251971.38 frames. ], batch size: 282, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 16:57:58,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=1.98 vs. limit=12.0 2023-06-20 16:58:05,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-20 16:59:16,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-20 16:59:27,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=758514.0, ans=0.125 2023-06-20 16:59:34,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=758514.0, ans=0.125 2023-06-20 16:59:41,604 INFO [train.py:996] (0/4) Epoch 5, batch 4450, loss[loss=0.2144, simple_loss=0.2757, pruned_loss=0.07661, over 20756.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3325, pruned_loss=0.09183, over 4253017.01 frames. ], batch size: 609, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 16:59:42,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=22.5 2023-06-20 17:00:04,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=758574.0, ans=0.2 2023-06-20 17:00:05,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=758574.0, ans=0.0 2023-06-20 17:00:35,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=758694.0, ans=0.2 2023-06-20 17:00:52,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-20 17:01:08,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 2.910e+02 3.386e+02 4.171e+02 6.417e+02, threshold=6.772e+02, percent-clipped=2.0 2023-06-20 17:01:17,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=758814.0, ans=0.125 2023-06-20 17:01:22,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=758814.0, ans=0.0 2023-06-20 17:01:32,543 INFO [train.py:996] (0/4) Epoch 5, batch 4500, loss[loss=0.2405, simple_loss=0.3184, pruned_loss=0.08125, over 21179.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3349, pruned_loss=0.09415, over 4265288.00 frames. ], batch size: 143, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 17:01:55,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-06-20 17:02:58,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=759114.0, ans=0.125 2023-06-20 17:03:09,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=759114.0, ans=0.125 2023-06-20 17:03:17,253 INFO [train.py:996] (0/4) Epoch 5, batch 4550, loss[loss=0.2183, simple_loss=0.2799, pruned_loss=0.07841, over 21184.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3371, pruned_loss=0.09426, over 4268553.31 frames. ], batch size: 548, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 17:03:31,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-20 17:04:20,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-20 17:04:21,552 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:04:34,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.512e+02 3.034e+02 3.831e+02 5.015e+02 1.154e+03, threshold=7.663e+02, percent-clipped=6.0 2023-06-20 17:05:00,194 INFO [train.py:996] (0/4) Epoch 5, batch 4600, loss[loss=0.29, simple_loss=0.3628, pruned_loss=0.1086, over 21475.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3388, pruned_loss=0.09498, over 4273675.41 frames. ], batch size: 211, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 17:05:10,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=759474.0, ans=0.125 2023-06-20 17:05:20,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=759534.0, ans=0.04949747468305833 2023-06-20 17:05:21,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=759534.0, ans=0.0 2023-06-20 17:05:22,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=759534.0, ans=0.125 2023-06-20 17:05:36,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=759594.0, ans=0.125 2023-06-20 17:05:48,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=759594.0, ans=10.0 2023-06-20 17:05:59,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-20 17:06:32,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=759714.0, ans=0.125 2023-06-20 17:06:34,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=759714.0, ans=0.0 2023-06-20 17:06:37,014 INFO [train.py:996] (0/4) Epoch 5, batch 4650, loss[loss=0.1827, simple_loss=0.2592, pruned_loss=0.05314, over 21758.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.333, pruned_loss=0.09358, over 4278567.66 frames. ], batch size: 298, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:06:37,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=759774.0, ans=0.0 2023-06-20 17:07:14,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=759894.0, ans=0.0 2023-06-20 17:07:57,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.721e+02 3.118e+02 3.617e+02 7.093e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-20 17:08:12,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=760014.0, ans=0.0 2023-06-20 17:08:14,598 INFO [train.py:996] (0/4) Epoch 5, batch 4700, loss[loss=0.226, simple_loss=0.2905, pruned_loss=0.08076, over 21401.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.324, pruned_loss=0.09145, over 4274878.46 frames. ], batch size: 131, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:09:02,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=760194.0, ans=0.125 2023-06-20 17:09:08,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.76 vs. limit=15.0 2023-06-20 17:09:11,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=760194.0, ans=0.125 2023-06-20 17:09:56,849 INFO [train.py:996] (0/4) Epoch 5, batch 4750, loss[loss=0.2071, simple_loss=0.2715, pruned_loss=0.07131, over 21594.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3198, pruned_loss=0.09099, over 4269574.24 frames. ], batch size: 231, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:10:08,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=760374.0, ans=0.125 2023-06-20 17:10:27,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=760434.0, ans=0.125 2023-06-20 17:11:18,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.439e+02 2.865e+02 3.322e+02 3.733e+02 5.818e+02, threshold=6.645e+02, percent-clipped=0.0 2023-06-20 17:11:34,499 INFO [train.py:996] (0/4) Epoch 5, batch 4800, loss[loss=0.2886, simple_loss=0.3446, pruned_loss=0.1162, over 21920.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3209, pruned_loss=0.09189, over 4278133.54 frames. ], batch size: 316, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:11:40,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-20 17:12:29,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=760794.0, ans=0.0 2023-06-20 17:12:34,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=760854.0, ans=0.125 2023-06-20 17:12:38,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-20 17:13:07,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=760914.0, ans=0.125 2023-06-20 17:13:09,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.88 vs. limit=6.0 2023-06-20 17:13:15,581 INFO [train.py:996] (0/4) Epoch 5, batch 4850, loss[loss=0.2753, simple_loss=0.3439, pruned_loss=0.1034, over 21839.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3211, pruned_loss=0.09198, over 4277388.31 frames. ], batch size: 332, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:13:59,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=761094.0, ans=0.0 2023-06-20 17:14:41,980 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.733e+02 3.099e+02 3.561e+02 5.577e+02, threshold=6.198e+02, percent-clipped=0.0 2023-06-20 17:14:58,692 INFO [train.py:996] (0/4) Epoch 5, batch 4900, loss[loss=0.2941, simple_loss=0.3516, pruned_loss=0.1183, over 21270.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3241, pruned_loss=0.09295, over 4279956.62 frames. ], batch size: 143, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:15:00,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=761274.0, ans=0.2 2023-06-20 17:15:38,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=761334.0, ans=0.125 2023-06-20 17:15:45,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=761394.0, ans=0.0 2023-06-20 17:16:33,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=761514.0, ans=0.05 2023-06-20 17:16:41,598 INFO [train.py:996] (0/4) Epoch 5, batch 4950, loss[loss=0.2094, simple_loss=0.301, pruned_loss=0.05894, over 21737.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3257, pruned_loss=0.08998, over 4279248.59 frames. ], batch size: 298, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:17:54,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=761754.0, ans=0.1 2023-06-20 17:18:08,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.799e+02 3.225e+02 3.689e+02 6.231e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-20 17:18:22,791 INFO [train.py:996] (0/4) Epoch 5, batch 5000, loss[loss=0.2587, simple_loss=0.3276, pruned_loss=0.09492, over 21517.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3242, pruned_loss=0.08778, over 4283207.58 frames. ], batch size: 194, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:18:40,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=761934.0, ans=0.125 2023-06-20 17:19:05,861 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:19:07,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=761994.0, ans=0.125 2023-06-20 17:19:35,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=762054.0, ans=0.125 2023-06-20 17:20:03,405 INFO [train.py:996] (0/4) Epoch 5, batch 5050, loss[loss=0.2608, simple_loss=0.3219, pruned_loss=0.09988, over 21687.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3247, pruned_loss=0.08961, over 4292099.76 frames. ], batch size: 230, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:21:00,782 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-20 17:21:14,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-20 17:21:31,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.896e+02 3.588e+02 4.285e+02 7.263e+02, threshold=7.176e+02, percent-clipped=2.0 2023-06-20 17:21:36,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-20 17:21:39,307 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:21:45,607 INFO [train.py:996] (0/4) Epoch 5, batch 5100, loss[loss=0.2536, simple_loss=0.3183, pruned_loss=0.09449, over 21300.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3238, pruned_loss=0.09039, over 4292854.22 frames. ], batch size: 176, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:22:12,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-06-20 17:22:40,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=762594.0, ans=0.0 2023-06-20 17:22:44,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=762654.0, ans=0.0 2023-06-20 17:23:29,430 INFO [train.py:996] (0/4) Epoch 5, batch 5150, loss[loss=0.2595, simple_loss=0.3145, pruned_loss=0.1023, over 21607.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3227, pruned_loss=0.09207, over 4297018.05 frames. ], batch size: 263, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:23:35,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=762774.0, ans=0.0 2023-06-20 17:24:51,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=762954.0, ans=0.0 2023-06-20 17:24:57,421 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.960e+02 3.348e+02 3.858e+02 5.752e+02, threshold=6.696e+02, percent-clipped=0.0 2023-06-20 17:25:13,105 INFO [train.py:996] (0/4) Epoch 5, batch 5200, loss[loss=0.2511, simple_loss=0.3312, pruned_loss=0.08546, over 21275.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3242, pruned_loss=0.09267, over 4288132.54 frames. ], batch size: 159, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:25:26,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=763074.0, ans=0.125 2023-06-20 17:25:46,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=763134.0, ans=0.125 2023-06-20 17:26:31,394 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:26:39,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=763314.0, ans=0.125 2023-06-20 17:26:45,259 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.91 vs. limit=22.5 2023-06-20 17:26:46,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=763314.0, ans=0.5 2023-06-20 17:26:54,747 INFO [train.py:996] (0/4) Epoch 5, batch 5250, loss[loss=0.2561, simple_loss=0.3362, pruned_loss=0.08804, over 21734.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3262, pruned_loss=0.09007, over 4287986.38 frames. ], batch size: 298, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:27:31,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=763434.0, ans=0.0 2023-06-20 17:28:21,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.952e+02 3.364e+02 4.524e+02 6.907e+02, threshold=6.729e+02, percent-clipped=2.0 2023-06-20 17:28:36,620 INFO [train.py:996] (0/4) Epoch 5, batch 5300, loss[loss=0.2784, simple_loss=0.3453, pruned_loss=0.1058, over 21900.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3253, pruned_loss=0.09011, over 4291621.81 frames. ], batch size: 333, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:28:56,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=763734.0, ans=0.0 2023-06-20 17:29:15,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=763734.0, ans=0.125 2023-06-20 17:29:28,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=763794.0, ans=0.125 2023-06-20 17:30:09,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=763914.0, ans=0.2 2023-06-20 17:30:11,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=763914.0, ans=0.0 2023-06-20 17:30:22,067 INFO [train.py:996] (0/4) Epoch 5, batch 5350, loss[loss=0.2464, simple_loss=0.3052, pruned_loss=0.0938, over 21298.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3232, pruned_loss=0.0915, over 4297289.61 frames. ], batch size: 159, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:30:24,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=763974.0, ans=0.0 2023-06-20 17:30:37,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=764034.0, ans=0.0 2023-06-20 17:31:28,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-20 17:31:36,784 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:31:44,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.105e+02 3.554e+02 4.280e+02 7.043e+02, threshold=7.109e+02, percent-clipped=1.0 2023-06-20 17:31:44,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=764214.0, ans=0.2 2023-06-20 17:32:01,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=764214.0, ans=0.125 2023-06-20 17:32:03,840 INFO [train.py:996] (0/4) Epoch 5, batch 5400, loss[loss=0.2766, simple_loss=0.3408, pruned_loss=0.1062, over 21545.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3249, pruned_loss=0.09328, over 4288593.45 frames. ], batch size: 471, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:32:37,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-20 17:33:45,602 INFO [train.py:996] (0/4) Epoch 5, batch 5450, loss[loss=0.2729, simple_loss=0.3563, pruned_loss=0.09473, over 19917.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3266, pruned_loss=0.09136, over 4290067.36 frames. ], batch size: 702, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:34:31,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=764634.0, ans=0.1 2023-06-20 17:34:34,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=764694.0, ans=0.125 2023-06-20 17:34:44,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=764694.0, ans=0.125 2023-06-20 17:34:50,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=764754.0, ans=0.125 2023-06-20 17:34:54,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=764754.0, ans=0.2 2023-06-20 17:35:13,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.553e+02 3.012e+02 3.713e+02 8.478e+02, threshold=6.025e+02, percent-clipped=4.0 2023-06-20 17:35:19,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=764814.0, ans=0.125 2023-06-20 17:35:21,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=764814.0, ans=0.125 2023-06-20 17:35:34,666 INFO [train.py:996] (0/4) Epoch 5, batch 5500, loss[loss=0.2652, simple_loss=0.3523, pruned_loss=0.08904, over 21750.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3301, pruned_loss=0.0884, over 4283678.45 frames. ], batch size: 332, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:35:54,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=764934.0, ans=0.125 2023-06-20 17:36:55,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=765114.0, ans=0.1 2023-06-20 17:37:11,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-20 17:37:17,278 INFO [train.py:996] (0/4) Epoch 5, batch 5550, loss[loss=0.202, simple_loss=0.2889, pruned_loss=0.05757, over 21644.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.329, pruned_loss=0.08562, over 4283162.57 frames. ], batch size: 263, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:37:44,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=765234.0, ans=0.1 2023-06-20 17:38:27,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.41 vs. limit=15.0 2023-06-20 17:38:48,570 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.754e+02 3.445e+02 4.644e+02 7.344e+02, threshold=6.889e+02, percent-clipped=6.0 2023-06-20 17:39:13,783 INFO [train.py:996] (0/4) Epoch 5, batch 5600, loss[loss=0.2745, simple_loss=0.3303, pruned_loss=0.1093, over 19935.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3255, pruned_loss=0.08271, over 4280896.89 frames. ], batch size: 703, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:40:44,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=765714.0, ans=0.2 2023-06-20 17:40:55,626 INFO [train.py:996] (0/4) Epoch 5, batch 5650, loss[loss=0.2425, simple_loss=0.3075, pruned_loss=0.0887, over 21218.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3295, pruned_loss=0.0856, over 4273441.99 frames. ], batch size: 143, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:40:59,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=765774.0, ans=0.125 2023-06-20 17:41:27,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=765834.0, ans=0.125 2023-06-20 17:41:43,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=765894.0, ans=0.125 2023-06-20 17:42:17,850 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 3.200e+02 3.767e+02 5.001e+02 8.912e+02, threshold=7.534e+02, percent-clipped=5.0 2023-06-20 17:42:38,884 INFO [train.py:996] (0/4) Epoch 5, batch 5700, loss[loss=0.2083, simple_loss=0.2941, pruned_loss=0.06123, over 21706.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3305, pruned_loss=0.08814, over 4277538.81 frames. ], batch size: 298, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:42:39,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=766074.0, ans=0.125 2023-06-20 17:42:43,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=766074.0, ans=0.0 2023-06-20 17:42:46,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=766074.0, ans=0.05 2023-06-20 17:42:54,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=766074.0, ans=0.0 2023-06-20 17:43:11,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=766134.0, ans=0.125 2023-06-20 17:43:29,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=766194.0, ans=0.125 2023-06-20 17:43:43,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=766254.0, ans=0.2 2023-06-20 17:44:02,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=766254.0, ans=0.125 2023-06-20 17:44:28,613 INFO [train.py:996] (0/4) Epoch 5, batch 5750, loss[loss=0.1892, simple_loss=0.2723, pruned_loss=0.05304, over 21453.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3268, pruned_loss=0.08564, over 4273647.48 frames. ], batch size: 195, lr: 6.46e-03, grad_scale: 16.0 2023-06-20 17:45:16,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=766494.0, ans=0.125 2023-06-20 17:45:42,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=766554.0, ans=0.125 2023-06-20 17:45:53,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.737e+02 3.307e+02 4.353e+02 7.537e+02, threshold=6.613e+02, percent-clipped=1.0 2023-06-20 17:46:11,492 INFO [train.py:996] (0/4) Epoch 5, batch 5800, loss[loss=0.2538, simple_loss=0.3271, pruned_loss=0.09022, over 21394.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3255, pruned_loss=0.08364, over 4280685.47 frames. ], batch size: 548, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:47:48,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=766914.0, ans=0.0 2023-06-20 17:47:54,122 INFO [train.py:996] (0/4) Epoch 5, batch 5850, loss[loss=0.2051, simple_loss=0.2917, pruned_loss=0.05923, over 21421.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3226, pruned_loss=0.07921, over 4282190.58 frames. ], batch size: 211, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:48:40,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.47 vs. limit=22.5 2023-06-20 17:48:40,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=767094.0, ans=0.125 2023-06-20 17:48:56,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=767154.0, ans=0.125 2023-06-20 17:49:21,580 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 2.196e+02 2.438e+02 2.861e+02 4.189e+02, threshold=4.877e+02, percent-clipped=0.0 2023-06-20 17:49:34,144 INFO [train.py:996] (0/4) Epoch 5, batch 5900, loss[loss=0.2471, simple_loss=0.3219, pruned_loss=0.08615, over 21887.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.315, pruned_loss=0.07414, over 4272902.43 frames. ], batch size: 124, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:49:58,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=767334.0, ans=0.125 2023-06-20 17:50:41,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=767454.0, ans=0.125 2023-06-20 17:50:51,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=767454.0, ans=0.09899494936611666 2023-06-20 17:51:11,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=767514.0, ans=0.1 2023-06-20 17:51:14,333 INFO [train.py:996] (0/4) Epoch 5, batch 5950, loss[loss=0.2396, simple_loss=0.3069, pruned_loss=0.08611, over 21834.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3148, pruned_loss=0.07707, over 4277323.43 frames. ], batch size: 124, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:51:15,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-20 17:51:24,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=767574.0, ans=0.125 2023-06-20 17:52:07,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=767694.0, ans=0.125 2023-06-20 17:52:07,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=767694.0, ans=0.125 2023-06-20 17:52:42,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.088e+02 3.712e+02 4.428e+02 7.411e+02, threshold=7.424e+02, percent-clipped=12.0 2023-06-20 17:53:01,230 INFO [train.py:996] (0/4) Epoch 5, batch 6000, loss[loss=0.2402, simple_loss=0.2969, pruned_loss=0.09172, over 21888.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.312, pruned_loss=0.08073, over 4281571.56 frames. ], batch size: 373, lr: 6.45e-03, grad_scale: 32.0 2023-06-20 17:53:01,231 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 17:53:19,509 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2687, simple_loss=0.3621, pruned_loss=0.08766, over 1796401.00 frames. 2023-06-20 17:53:19,509 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 17:53:19,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=767874.0, ans=0.025 2023-06-20 17:54:04,151 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-128000.pt 2023-06-20 17:55:11,110 INFO [train.py:996] (0/4) Epoch 5, batch 6050, loss[loss=0.2251, simple_loss=0.2822, pruned_loss=0.08401, over 21907.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3072, pruned_loss=0.08252, over 4287482.72 frames. ], batch size: 113, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:55:33,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=768234.0, ans=0.0 2023-06-20 17:56:30,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.566e+02 3.006e+02 3.910e+02 6.691e+02, threshold=6.013e+02, percent-clipped=0.0 2023-06-20 17:56:46,433 INFO [train.py:996] (0/4) Epoch 5, batch 6100, loss[loss=0.2466, simple_loss=0.3049, pruned_loss=0.09417, over 21205.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3038, pruned_loss=0.08155, over 4276401.69 frames. ], batch size: 608, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:57:04,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=768534.0, ans=0.07 2023-06-20 17:58:20,833 INFO [train.py:996] (0/4) Epoch 5, batch 6150, loss[loss=0.2114, simple_loss=0.2824, pruned_loss=0.07019, over 21191.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3074, pruned_loss=0.08421, over 4287332.15 frames. ], batch size: 176, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:59:07,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=768894.0, ans=0.0 2023-06-20 17:59:24,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-20 17:59:51,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.776e+02 3.188e+02 3.842e+02 5.972e+02, threshold=6.377e+02, percent-clipped=0.0 2023-06-20 18:00:02,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=769014.0, ans=0.1 2023-06-20 18:00:08,593 INFO [train.py:996] (0/4) Epoch 5, batch 6200, loss[loss=0.247, simple_loss=0.3316, pruned_loss=0.08114, over 21837.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.313, pruned_loss=0.08472, over 4280054.35 frames. ], batch size: 351, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:00:15,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=769074.0, ans=0.2 2023-06-20 18:01:12,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=769254.0, ans=0.125 2023-06-20 18:01:12,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=769254.0, ans=0.125 2023-06-20 18:01:47,614 INFO [train.py:996] (0/4) Epoch 5, batch 6250, loss[loss=0.2849, simple_loss=0.3785, pruned_loss=0.09567, over 21737.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3178, pruned_loss=0.08417, over 4279116.82 frames. ], batch size: 298, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:02:05,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=769434.0, ans=0.125 2023-06-20 18:03:10,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=769614.0, ans=0.0 2023-06-20 18:03:11,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.664e+02 3.120e+02 3.845e+02 7.013e+02, threshold=6.240e+02, percent-clipped=3.0 2023-06-20 18:03:28,157 INFO [train.py:996] (0/4) Epoch 5, batch 6300, loss[loss=0.2298, simple_loss=0.3074, pruned_loss=0.07613, over 20020.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3212, pruned_loss=0.08348, over 4286552.08 frames. ], batch size: 703, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:03:36,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=769674.0, ans=0.125 2023-06-20 18:03:36,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=769674.0, ans=0.1 2023-06-20 18:03:42,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.98 vs. limit=10.0 2023-06-20 18:04:39,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=769854.0, ans=0.2 2023-06-20 18:04:53,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=769914.0, ans=0.125 2023-06-20 18:05:06,353 INFO [train.py:996] (0/4) Epoch 5, batch 6350, loss[loss=0.2716, simple_loss=0.3394, pruned_loss=0.1019, over 21480.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3261, pruned_loss=0.08832, over 4292649.90 frames. ], batch size: 211, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:05:49,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.77 vs. limit=6.0 2023-06-20 18:06:34,766 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 2.998e+02 3.510e+02 4.011e+02 7.678e+02, threshold=7.020e+02, percent-clipped=3.0 2023-06-20 18:06:40,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=770214.0, ans=0.125 2023-06-20 18:06:51,036 INFO [train.py:996] (0/4) Epoch 5, batch 6400, loss[loss=0.3143, simple_loss=0.3724, pruned_loss=0.1281, over 21940.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3335, pruned_loss=0.09348, over 4283289.71 frames. ], batch size: 372, lr: 6.44e-03, grad_scale: 32.0 2023-06-20 18:06:53,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=770274.0, ans=0.125 2023-06-20 18:07:17,642 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-20 18:07:35,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-20 18:07:38,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-20 18:08:00,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=770454.0, ans=0.125 2023-06-20 18:08:03,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-20 18:08:33,240 INFO [train.py:996] (0/4) Epoch 5, batch 6450, loss[loss=0.2382, simple_loss=0.3311, pruned_loss=0.07266, over 21673.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3359, pruned_loss=0.09331, over 4284533.12 frames. ], batch size: 298, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:08:35,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=770574.0, ans=0.125 2023-06-20 18:08:42,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-20 18:08:58,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=770634.0, ans=0.0 2023-06-20 18:09:04,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=770634.0, ans=0.0 2023-06-20 18:09:36,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=770754.0, ans=0.125 2023-06-20 18:09:44,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=770754.0, ans=0.035 2023-06-20 18:10:05,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.819e+02 3.379e+02 4.009e+02 7.496e+02, threshold=6.759e+02, percent-clipped=3.0 2023-06-20 18:10:15,982 INFO [train.py:996] (0/4) Epoch 5, batch 6500, loss[loss=0.243, simple_loss=0.2992, pruned_loss=0.0934, over 21675.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3278, pruned_loss=0.09115, over 4274910.74 frames. ], batch size: 282, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:10:16,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-20 18:10:27,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=770874.0, ans=0.125 2023-06-20 18:10:53,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=770934.0, ans=0.125 2023-06-20 18:11:27,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=771054.0, ans=0.1 2023-06-20 18:11:28,039 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-06-20 18:11:37,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-20 18:12:02,384 INFO [train.py:996] (0/4) Epoch 5, batch 6550, loss[loss=0.2693, simple_loss=0.342, pruned_loss=0.09828, over 21743.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3267, pruned_loss=0.09015, over 4272953.70 frames. ], batch size: 414, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:12:03,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=771174.0, ans=0.07 2023-06-20 18:12:08,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=771174.0, ans=0.125 2023-06-20 18:12:27,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=771234.0, ans=0.125 2023-06-20 18:12:40,447 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=12.0 2023-06-20 18:13:06,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=771354.0, ans=0.04949747468305833 2023-06-20 18:13:28,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=771414.0, ans=0.125 2023-06-20 18:13:31,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.772e+02 3.430e+02 4.140e+02 7.576e+02, threshold=6.860e+02, percent-clipped=2.0 2023-06-20 18:13:49,248 INFO [train.py:996] (0/4) Epoch 5, batch 6600, loss[loss=0.2415, simple_loss=0.3004, pruned_loss=0.09128, over 21867.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3204, pruned_loss=0.08954, over 4265644.98 frames. ], batch size: 373, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:13:49,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=771474.0, ans=0.125 2023-06-20 18:14:24,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=771534.0, ans=0.125 2023-06-20 18:14:24,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=771534.0, ans=0.125 2023-06-20 18:15:31,873 INFO [train.py:996] (0/4) Epoch 5, batch 6650, loss[loss=0.2222, simple_loss=0.277, pruned_loss=0.08369, over 21543.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3138, pruned_loss=0.08713, over 4268091.62 frames. ], batch size: 132, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:15:44,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-20 18:16:21,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-20 18:16:22,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=771894.0, ans=0.125 2023-06-20 18:16:39,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-20 18:16:54,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=22.54 vs. limit=15.0 2023-06-20 18:17:04,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.627e+02 3.141e+02 4.437e+02 8.167e+02, threshold=6.282e+02, percent-clipped=6.0 2023-06-20 18:17:12,621 INFO [train.py:996] (0/4) Epoch 5, batch 6700, loss[loss=0.2236, simple_loss=0.2814, pruned_loss=0.08291, over 21289.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3081, pruned_loss=0.08637, over 4272460.01 frames. ], batch size: 159, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:17:13,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=772074.0, ans=0.5 2023-06-20 18:17:13,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=22.5 2023-06-20 18:17:39,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-20 18:17:40,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=772134.0, ans=0.1 2023-06-20 18:17:52,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=772194.0, ans=0.015 2023-06-20 18:18:09,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=772254.0, ans=0.0 2023-06-20 18:18:52,308 INFO [train.py:996] (0/4) Epoch 5, batch 6750, loss[loss=0.2145, simple_loss=0.2848, pruned_loss=0.07211, over 21287.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.306, pruned_loss=0.08687, over 4262856.88 frames. ], batch size: 194, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:19:05,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=772374.0, ans=0.125 2023-06-20 18:19:07,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-20 18:19:08,901 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:20:15,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 2.981e+02 3.524e+02 4.393e+02 7.808e+02, threshold=7.048e+02, percent-clipped=4.0 2023-06-20 18:20:33,916 INFO [train.py:996] (0/4) Epoch 5, batch 6800, loss[loss=0.2293, simple_loss=0.29, pruned_loss=0.08429, over 21599.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3076, pruned_loss=0.08889, over 4261939.48 frames. ], batch size: 263, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:21:41,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=772914.0, ans=0.125 2023-06-20 18:22:04,525 INFO [train.py:996] (0/4) Epoch 5, batch 6850, loss[loss=0.2761, simple_loss=0.3359, pruned_loss=0.1081, over 21852.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3066, pruned_loss=0.08972, over 4264221.99 frames. ], batch size: 118, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:22:31,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=773034.0, ans=0.125 2023-06-20 18:23:25,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=773214.0, ans=0.1 2023-06-20 18:23:34,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=773214.0, ans=0.125 2023-06-20 18:23:35,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 2.843e+02 3.245e+02 3.952e+02 6.473e+02, threshold=6.489e+02, percent-clipped=0.0 2023-06-20 18:23:53,377 INFO [train.py:996] (0/4) Epoch 5, batch 6900, loss[loss=0.2526, simple_loss=0.3529, pruned_loss=0.0762, over 21608.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3088, pruned_loss=0.08959, over 4264568.65 frames. ], batch size: 471, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:24:50,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=773454.0, ans=0.0 2023-06-20 18:24:56,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=773454.0, ans=0.125 2023-06-20 18:25:21,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=773514.0, ans=0.125 2023-06-20 18:25:28,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=773514.0, ans=0.0 2023-06-20 18:25:37,587 INFO [train.py:996] (0/4) Epoch 5, batch 6950, loss[loss=0.2738, simple_loss=0.3382, pruned_loss=0.1046, over 21659.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3109, pruned_loss=0.08705, over 4264373.56 frames. ], batch size: 351, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:26:12,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=773634.0, ans=15.0 2023-06-20 18:26:27,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=773694.0, ans=0.125 2023-06-20 18:26:56,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=773814.0, ans=0.2 2023-06-20 18:27:07,323 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-06-20 18:27:11,063 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.901e+02 2.952e+02 3.293e+02 4.286e+02 8.056e+02, threshold=6.585e+02, percent-clipped=5.0 2023-06-20 18:27:11,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=773814.0, ans=0.125 2023-06-20 18:27:19,076 INFO [train.py:996] (0/4) Epoch 5, batch 7000, loss[loss=0.2325, simple_loss=0.2993, pruned_loss=0.08282, over 21344.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3147, pruned_loss=0.08971, over 4269465.20 frames. ], batch size: 131, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:27:46,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=773934.0, ans=0.125 2023-06-20 18:28:00,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-20 18:29:08,844 INFO [train.py:996] (0/4) Epoch 5, batch 7050, loss[loss=0.2366, simple_loss=0.3507, pruned_loss=0.06129, over 19753.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.313, pruned_loss=0.08795, over 4259316.42 frames. ], batch size: 703, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:29:21,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=774174.0, ans=0.125 2023-06-20 18:29:37,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=774234.0, ans=0.04949747468305833 2023-06-20 18:30:44,367 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.864e+02 3.382e+02 4.312e+02 8.915e+02, threshold=6.764e+02, percent-clipped=3.0 2023-06-20 18:30:51,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=774474.0, ans=0.125 2023-06-20 18:30:51,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=774474.0, ans=0.0 2023-06-20 18:30:52,693 INFO [train.py:996] (0/4) Epoch 5, batch 7100, loss[loss=0.2112, simple_loss=0.2867, pruned_loss=0.06784, over 21671.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.318, pruned_loss=0.08916, over 4262363.84 frames. ], batch size: 247, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:31:32,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=774594.0, ans=0.125 2023-06-20 18:32:00,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=774654.0, ans=0.125 2023-06-20 18:32:28,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=774714.0, ans=0.125 2023-06-20 18:32:31,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=774714.0, ans=0.125 2023-06-20 18:32:33,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=774774.0, ans=0.125 2023-06-20 18:32:34,564 INFO [train.py:996] (0/4) Epoch 5, batch 7150, loss[loss=0.2872, simple_loss=0.3433, pruned_loss=0.1155, over 21290.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3124, pruned_loss=0.08573, over 4258544.25 frames. ], batch size: 176, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:33:56,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=774954.0, ans=0.125 2023-06-20 18:33:58,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=775014.0, ans=0.0 2023-06-20 18:34:08,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.836e+02 3.355e+02 3.927e+02 6.037e+02, threshold=6.711e+02, percent-clipped=0.0 2023-06-20 18:34:16,168 INFO [train.py:996] (0/4) Epoch 5, batch 7200, loss[loss=0.2443, simple_loss=0.2991, pruned_loss=0.09474, over 21161.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3164, pruned_loss=0.08905, over 4264227.73 frames. ], batch size: 176, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:34:16,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=775074.0, ans=0.125 2023-06-20 18:34:37,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=775134.0, ans=0.1 2023-06-20 18:34:40,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=775134.0, ans=0.2 2023-06-20 18:35:13,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=775194.0, ans=0.1 2023-06-20 18:35:33,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=775254.0, ans=0.0 2023-06-20 18:35:52,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=775314.0, ans=0.0 2023-06-20 18:35:58,261 INFO [train.py:996] (0/4) Epoch 5, batch 7250, loss[loss=0.2396, simple_loss=0.2904, pruned_loss=0.09438, over 21295.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.313, pruned_loss=0.08925, over 4270038.22 frames. ], batch size: 177, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:36:27,143 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-20 18:36:30,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-20 18:37:32,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 2.761e+02 3.389e+02 4.055e+02 6.932e+02, threshold=6.778e+02, percent-clipped=1.0 2023-06-20 18:37:32,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=775614.0, ans=0.125 2023-06-20 18:37:40,460 INFO [train.py:996] (0/4) Epoch 5, batch 7300, loss[loss=0.2418, simple_loss=0.2871, pruned_loss=0.09827, over 21383.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3081, pruned_loss=0.08844, over 4269948.62 frames. ], batch size: 160, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:38:35,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=775794.0, ans=0.0 2023-06-20 18:38:47,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=775794.0, ans=0.125 2023-06-20 18:39:01,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=775854.0, ans=0.125 2023-06-20 18:39:19,185 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:39:25,178 INFO [train.py:996] (0/4) Epoch 5, batch 7350, loss[loss=0.3018, simple_loss=0.357, pruned_loss=0.1233, over 21342.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3053, pruned_loss=0.08943, over 4269385.49 frames. ], batch size: 159, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:39:29,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=775974.0, ans=0.125 2023-06-20 18:39:37,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=775974.0, ans=0.05 2023-06-20 18:39:44,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-06-20 18:40:37,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=776154.0, ans=0.125 2023-06-20 18:41:01,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 3.040e+02 3.793e+02 4.434e+02 6.655e+02, threshold=7.586e+02, percent-clipped=0.0 2023-06-20 18:41:09,292 INFO [train.py:996] (0/4) Epoch 5, batch 7400, loss[loss=0.2529, simple_loss=0.3143, pruned_loss=0.09577, over 21354.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3104, pruned_loss=0.0918, over 4274829.94 frames. ], batch size: 176, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:42:28,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-20 18:42:31,016 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:42:41,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-20 18:42:41,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=776514.0, ans=0.1 2023-06-20 18:42:58,415 INFO [train.py:996] (0/4) Epoch 5, batch 7450, loss[loss=0.2485, simple_loss=0.3031, pruned_loss=0.09696, over 21106.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3087, pruned_loss=0.09073, over 4275056.18 frames. ], batch size: 176, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:43:22,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=776574.0, ans=0.0 2023-06-20 18:43:31,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.31 vs. limit=6.0 2023-06-20 18:43:59,036 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:44:10,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=776754.0, ans=0.125 2023-06-20 18:44:12,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=776754.0, ans=0.0 2023-06-20 18:44:12,880 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-20 18:44:34,829 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 2.956e+02 3.297e+02 4.311e+02 7.109e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-20 18:44:48,800 INFO [train.py:996] (0/4) Epoch 5, batch 7500, loss[loss=0.2797, simple_loss=0.354, pruned_loss=0.1027, over 21515.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3148, pruned_loss=0.09221, over 4274397.45 frames. ], batch size: 389, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:45:21,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=776934.0, ans=0.1 2023-06-20 18:46:22,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=777114.0, ans=0.125 2023-06-20 18:46:33,807 INFO [train.py:996] (0/4) Epoch 5, batch 7550, loss[loss=0.2213, simple_loss=0.3132, pruned_loss=0.06467, over 21579.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3255, pruned_loss=0.09243, over 4280520.71 frames. ], batch size: 230, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:47:12,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=777234.0, ans=0.0 2023-06-20 18:47:19,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=777294.0, ans=0.0 2023-06-20 18:47:24,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=777294.0, ans=0.125 2023-06-20 18:47:25,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777294.0, ans=0.1 2023-06-20 18:48:01,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.804e+02 3.228e+02 4.067e+02 8.299e+02, threshold=6.455e+02, percent-clipped=1.0 2023-06-20 18:48:02,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-20 18:48:13,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=777474.0, ans=0.2 2023-06-20 18:48:14,639 INFO [train.py:996] (0/4) Epoch 5, batch 7600, loss[loss=0.265, simple_loss=0.3244, pruned_loss=0.1028, over 21901.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3257, pruned_loss=0.09179, over 4291659.32 frames. ], batch size: 316, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:48:28,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777474.0, ans=0.1 2023-06-20 18:48:30,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=777474.0, ans=0.0 2023-06-20 18:48:49,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=777534.0, ans=0.125 2023-06-20 18:49:11,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=777594.0, ans=0.125 2023-06-20 18:50:01,096 INFO [train.py:996] (0/4) Epoch 5, batch 7650, loss[loss=0.2694, simple_loss=0.3196, pruned_loss=0.1096, over 21569.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3234, pruned_loss=0.09299, over 4290815.95 frames. ], batch size: 195, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:50:25,422 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:50:42,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-06-20 18:51:30,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=778014.0, ans=0.125 2023-06-20 18:51:36,503 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 3.076e+02 3.625e+02 4.419e+02 8.627e+02, threshold=7.249e+02, percent-clipped=2.0 2023-06-20 18:51:44,656 INFO [train.py:996] (0/4) Epoch 5, batch 7700, loss[loss=0.3323, simple_loss=0.3797, pruned_loss=0.1424, over 21350.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.327, pruned_loss=0.09679, over 4295857.15 frames. ], batch size: 507, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:52:21,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=778194.0, ans=0.0 2023-06-20 18:52:25,139 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:53:03,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=778254.0, ans=0.125 2023-06-20 18:53:35,136 INFO [train.py:996] (0/4) Epoch 5, batch 7750, loss[loss=0.3016, simple_loss=0.4016, pruned_loss=0.1008, over 21645.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3306, pruned_loss=0.09482, over 4293632.92 frames. ], batch size: 414, lr: 6.41e-03, grad_scale: 16.0 2023-06-20 18:54:13,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=778494.0, ans=0.125 2023-06-20 18:54:59,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=778614.0, ans=0.015 2023-06-20 18:55:11,516 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=12.0 2023-06-20 18:55:13,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 2.997e+02 3.414e+02 4.185e+02 6.372e+02, threshold=6.827e+02, percent-clipped=0.0 2023-06-20 18:55:19,861 INFO [train.py:996] (0/4) Epoch 5, batch 7800, loss[loss=0.2323, simple_loss=0.3077, pruned_loss=0.07847, over 21769.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.333, pruned_loss=0.09565, over 4282646.72 frames. ], batch size: 333, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:55:22,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=778674.0, ans=0.0 2023-06-20 18:55:44,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=778734.0, ans=0.2 2023-06-20 18:56:40,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=778914.0, ans=0.125 2023-06-20 18:57:03,301 INFO [train.py:996] (0/4) Epoch 5, batch 7850, loss[loss=0.2471, simple_loss=0.2991, pruned_loss=0.09755, over 21563.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3247, pruned_loss=0.09383, over 4281983.51 frames. ], batch size: 263, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:57:43,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=779094.0, ans=0.125 2023-06-20 18:58:00,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=779094.0, ans=0.125 2023-06-20 18:58:21,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=779154.0, ans=0.04949747468305833 2023-06-20 18:58:41,650 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.807e+02 3.191e+02 4.000e+02 6.084e+02, threshold=6.382e+02, percent-clipped=0.0 2023-06-20 18:58:48,757 INFO [train.py:996] (0/4) Epoch 5, batch 7900, loss[loss=0.2901, simple_loss=0.3801, pruned_loss=0.1001, over 21622.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3208, pruned_loss=0.09297, over 4272361.10 frames. ], batch size: 414, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:59:55,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=779454.0, ans=0.125 2023-06-20 19:00:02,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=779454.0, ans=0.125 2023-06-20 19:00:14,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=779454.0, ans=0.0 2023-06-20 19:00:21,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=779514.0, ans=0.0 2023-06-20 19:00:34,075 INFO [train.py:996] (0/4) Epoch 5, batch 7950, loss[loss=0.2761, simple_loss=0.3404, pruned_loss=0.1059, over 21243.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3267, pruned_loss=0.09235, over 4272734.53 frames. ], batch size: 143, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:00:34,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=779574.0, ans=0.0 2023-06-20 19:00:44,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=779574.0, ans=0.0 2023-06-20 19:01:10,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=779634.0, ans=0.125 2023-06-20 19:01:58,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=779814.0, ans=0.04949747468305833 2023-06-20 19:02:03,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779814.0, ans=0.1 2023-06-20 19:02:08,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 3.095e+02 3.551e+02 4.583e+02 8.567e+02, threshold=7.102e+02, percent-clipped=4.0 2023-06-20 19:02:14,723 INFO [train.py:996] (0/4) Epoch 5, batch 8000, loss[loss=0.2713, simple_loss=0.3324, pruned_loss=0.1052, over 21246.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3303, pruned_loss=0.09565, over 4267634.19 frames. ], batch size: 176, lr: 6.40e-03, grad_scale: 32.0 2023-06-20 19:03:41,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=780114.0, ans=0.125 2023-06-20 19:03:48,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=780114.0, ans=0.125 2023-06-20 19:04:08,256 INFO [train.py:996] (0/4) Epoch 5, batch 8050, loss[loss=0.2773, simple_loss=0.3643, pruned_loss=0.09513, over 21666.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3331, pruned_loss=0.09602, over 4270852.73 frames. ], batch size: 389, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:04:44,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=780234.0, ans=0.0 2023-06-20 19:05:02,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-20 19:05:45,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=780414.0, ans=0.0 2023-06-20 19:05:46,302 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-20 19:05:46,954 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.429e+02 3.996e+02 5.275e+02 1.132e+03, threshold=7.992e+02, percent-clipped=3.0 2023-06-20 19:05:52,196 INFO [train.py:996] (0/4) Epoch 5, batch 8100, loss[loss=0.315, simple_loss=0.3598, pruned_loss=0.135, over 21635.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3318, pruned_loss=0.09617, over 4273024.79 frames. ], batch size: 471, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:06:03,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.24 vs. limit=15.0 2023-06-20 19:06:19,192 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-20 19:06:26,039 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:07:49,541 INFO [train.py:996] (0/4) Epoch 5, batch 8150, loss[loss=0.2747, simple_loss=0.3913, pruned_loss=0.07905, over 21175.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3374, pruned_loss=0.09633, over 4269016.14 frames. ], batch size: 548, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:07:56,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=780774.0, ans=0.125 2023-06-20 19:08:01,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=780774.0, ans=0.0 2023-06-20 19:08:01,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=780774.0, ans=0.125 2023-06-20 19:08:32,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-20 19:08:43,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.23 vs. limit=15.0 2023-06-20 19:09:27,399 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.029e+02 3.506e+02 4.177e+02 7.466e+02, threshold=7.011e+02, percent-clipped=0.0 2023-06-20 19:09:32,535 INFO [train.py:996] (0/4) Epoch 5, batch 8200, loss[loss=0.259, simple_loss=0.3139, pruned_loss=0.102, over 21549.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3311, pruned_loss=0.0937, over 4261599.83 frames. ], batch size: 391, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:09:36,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=781074.0, ans=0.0 2023-06-20 19:09:56,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-20 19:10:14,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=781194.0, ans=0.0 2023-06-20 19:10:36,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=781254.0, ans=0.1 2023-06-20 19:10:59,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=781314.0, ans=0.125 2023-06-20 19:11:15,887 INFO [train.py:996] (0/4) Epoch 5, batch 8250, loss[loss=0.2895, simple_loss=0.3751, pruned_loss=0.102, over 21809.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3289, pruned_loss=0.09338, over 4252754.78 frames. ], batch size: 371, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:11:31,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-20 19:12:12,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=781494.0, ans=0.0 2023-06-20 19:12:19,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=781554.0, ans=0.0 2023-06-20 19:12:56,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.737e+02 3.159e+02 4.146e+02 7.904e+02, threshold=6.318e+02, percent-clipped=1.0 2023-06-20 19:12:59,869 INFO [train.py:996] (0/4) Epoch 5, batch 8300, loss[loss=0.1912, simple_loss=0.2683, pruned_loss=0.05705, over 21385.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3261, pruned_loss=0.09126, over 4247276.29 frames. ], batch size: 194, lr: 6.39e-03, grad_scale: 8.0 2023-06-20 19:14:34,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=781914.0, ans=0.125 2023-06-20 19:14:43,852 INFO [train.py:996] (0/4) Epoch 5, batch 8350, loss[loss=0.2347, simple_loss=0.3147, pruned_loss=0.07731, over 21588.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3248, pruned_loss=0.08913, over 4256856.83 frames. ], batch size: 391, lr: 6.39e-03, grad_scale: 8.0 2023-06-20 19:14:46,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-20 19:14:49,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=781974.0, ans=0.0 2023-06-20 19:15:29,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=782094.0, ans=0.125 2023-06-20 19:15:48,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=782154.0, ans=0.125 2023-06-20 19:16:18,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.671e+02 3.158e+02 4.298e+02 8.367e+02, threshold=6.316e+02, percent-clipped=9.0 2023-06-20 19:16:21,599 INFO [train.py:996] (0/4) Epoch 5, batch 8400, loss[loss=0.2182, simple_loss=0.2967, pruned_loss=0.06984, over 21781.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3226, pruned_loss=0.08626, over 4261685.40 frames. ], batch size: 317, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:17:07,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=782394.0, ans=0.125 2023-06-20 19:17:46,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=782454.0, ans=0.125 2023-06-20 19:18:06,497 INFO [train.py:996] (0/4) Epoch 5, batch 8450, loss[loss=0.2282, simple_loss=0.2883, pruned_loss=0.084, over 20763.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3194, pruned_loss=0.08536, over 4261644.56 frames. ], batch size: 609, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:18:21,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=782574.0, ans=0.2 2023-06-20 19:18:45,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=782634.0, ans=0.2 2023-06-20 19:19:10,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=782694.0, ans=0.0 2023-06-20 19:19:10,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=782694.0, ans=0.0 2023-06-20 19:19:20,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=782754.0, ans=0.0 2023-06-20 19:19:28,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=782754.0, ans=0.04949747468305833 2023-06-20 19:19:32,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=782814.0, ans=0.2 2023-06-20 19:19:40,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=782814.0, ans=0.1 2023-06-20 19:19:46,760 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.787e+02 3.267e+02 4.084e+02 6.258e+02, threshold=6.534e+02, percent-clipped=0.0 2023-06-20 19:19:49,907 INFO [train.py:996] (0/4) Epoch 5, batch 8500, loss[loss=0.2039, simple_loss=0.2646, pruned_loss=0.07163, over 21473.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3157, pruned_loss=0.0866, over 4262623.79 frames. ], batch size: 212, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:19:52,018 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:20:04,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=782934.0, ans=0.07 2023-06-20 19:20:25,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=782934.0, ans=0.0 2023-06-20 19:21:24,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=783114.0, ans=0.0 2023-06-20 19:21:35,056 INFO [train.py:996] (0/4) Epoch 5, batch 8550, loss[loss=0.31, simple_loss=0.3775, pruned_loss=0.1213, over 21734.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3206, pruned_loss=0.09013, over 4272386.48 frames. ], batch size: 351, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:22:58,158 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:23:11,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=783414.0, ans=0.0 2023-06-20 19:23:16,130 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 2.988e+02 3.421e+02 4.230e+02 6.048e+02, threshold=6.842e+02, percent-clipped=0.0 2023-06-20 19:23:16,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=783414.0, ans=0.0 2023-06-20 19:23:19,474 INFO [train.py:996] (0/4) Epoch 5, batch 8600, loss[loss=0.3125, simple_loss=0.3778, pruned_loss=0.1236, over 21639.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3281, pruned_loss=0.09233, over 4273248.76 frames. ], batch size: 389, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:23:51,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=783534.0, ans=0.1 2023-06-20 19:24:03,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.25 vs. limit=22.5 2023-06-20 19:24:31,690 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-20 19:24:51,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783714.0, ans=0.1 2023-06-20 19:24:52,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=783714.0, ans=0.0 2023-06-20 19:25:14,188 INFO [train.py:996] (0/4) Epoch 5, batch 8650, loss[loss=0.1894, simple_loss=0.2893, pruned_loss=0.04476, over 21758.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3342, pruned_loss=0.09276, over 4271982.99 frames. ], batch size: 332, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:25:17,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=783774.0, ans=0.125 2023-06-20 19:25:22,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=783774.0, ans=0.0 2023-06-20 19:25:26,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=783774.0, ans=0.125 2023-06-20 19:26:14,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-20 19:26:17,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=783954.0, ans=0.2 2023-06-20 19:26:24,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=783954.0, ans=0.07 2023-06-20 19:26:48,619 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.897e+02 3.428e+02 4.241e+02 7.600e+02, threshold=6.856e+02, percent-clipped=1.0 2023-06-20 19:26:51,763 INFO [train.py:996] (0/4) Epoch 5, batch 8700, loss[loss=0.2183, simple_loss=0.2765, pruned_loss=0.08006, over 21421.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3262, pruned_loss=0.09067, over 4260129.93 frames. ], batch size: 131, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:27:38,788 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:27:52,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=784254.0, ans=0.125 2023-06-20 19:28:07,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=12.0 2023-06-20 19:28:14,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=784314.0, ans=0.1 2023-06-20 19:28:34,549 INFO [train.py:996] (0/4) Epoch 5, batch 8750, loss[loss=0.234, simple_loss=0.2934, pruned_loss=0.08731, over 21575.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3226, pruned_loss=0.0913, over 4266532.13 frames. ], batch size: 195, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:29:12,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=784434.0, ans=0.1 2023-06-20 19:29:53,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=784614.0, ans=0.125 2023-06-20 19:30:14,576 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.095e+02 3.629e+02 5.257e+02 8.550e+02, threshold=7.257e+02, percent-clipped=6.0 2023-06-20 19:30:18,148 INFO [train.py:996] (0/4) Epoch 5, batch 8800, loss[loss=0.3044, simple_loss=0.3571, pruned_loss=0.1259, over 20242.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3297, pruned_loss=0.09415, over 4267357.21 frames. ], batch size: 707, lr: 6.38e-03, grad_scale: 32.0 2023-06-20 19:31:22,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.29 vs. limit=6.0 2023-06-20 19:31:27,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-20 19:31:47,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=784914.0, ans=0.125 2023-06-20 19:31:52,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-20 19:32:01,559 INFO [train.py:996] (0/4) Epoch 5, batch 8850, loss[loss=0.2917, simple_loss=0.3522, pruned_loss=0.1156, over 21336.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3398, pruned_loss=0.09688, over 4258659.52 frames. ], batch size: 471, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:32:27,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=785034.0, ans=0.0 2023-06-20 19:32:35,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=785034.0, ans=0.125 2023-06-20 19:32:45,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=785094.0, ans=0.0 2023-06-20 19:32:59,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=785094.0, ans=0.125 2023-06-20 19:33:21,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=785154.0, ans=0.125 2023-06-20 19:33:32,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=785214.0, ans=0.0 2023-06-20 19:33:45,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.064e+02 3.643e+02 4.711e+02 6.430e+02, threshold=7.286e+02, percent-clipped=0.0 2023-06-20 19:33:47,247 INFO [train.py:996] (0/4) Epoch 5, batch 8900, loss[loss=0.238, simple_loss=0.315, pruned_loss=0.08046, over 21226.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3356, pruned_loss=0.09611, over 4260624.81 frames. ], batch size: 548, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:34:19,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=785334.0, ans=0.0 2023-06-20 19:35:05,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=785454.0, ans=0.1 2023-06-20 19:35:28,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=785514.0, ans=0.0 2023-06-20 19:35:37,886 INFO [train.py:996] (0/4) Epoch 5, batch 8950, loss[loss=0.259, simple_loss=0.3191, pruned_loss=0.09941, over 21664.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3372, pruned_loss=0.09625, over 4264358.97 frames. ], batch size: 298, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:36:41,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=785754.0, ans=0.2 2023-06-20 19:36:48,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2023-06-20 19:36:57,716 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:37:01,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.56 vs. limit=22.5 2023-06-20 19:37:10,087 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=12.0 2023-06-20 19:37:18,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.139e+02 3.697e+02 4.422e+02 7.989e+02, threshold=7.395e+02, percent-clipped=1.0 2023-06-20 19:37:19,882 INFO [train.py:996] (0/4) Epoch 5, batch 9000, loss[loss=0.3057, simple_loss=0.3427, pruned_loss=0.1343, over 21240.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3312, pruned_loss=0.0956, over 4260236.78 frames. ], batch size: 471, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:37:19,883 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 19:37:36,470 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2656, simple_loss=0.3627, pruned_loss=0.0843, over 1796401.00 frames. 2023-06-20 19:37:36,471 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24341MB 2023-06-20 19:37:37,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=785874.0, ans=0.0 2023-06-20 19:37:39,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=785874.0, ans=0.125 2023-06-20 19:37:47,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-20 19:38:00,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=785934.0, ans=0.125 2023-06-20 19:39:06,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=786114.0, ans=0.0 2023-06-20 19:39:16,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=786114.0, ans=0.125 2023-06-20 19:39:20,746 INFO [train.py:996] (0/4) Epoch 5, batch 9050, loss[loss=0.2849, simple_loss=0.3511, pruned_loss=0.1094, over 21889.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3259, pruned_loss=0.09215, over 4262863.64 frames. ], batch size: 372, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:39:55,790 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-20 19:40:25,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=786354.0, ans=0.125 2023-06-20 19:40:58,940 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.876e+02 3.189e+02 3.875e+02 6.604e+02, threshold=6.378e+02, percent-clipped=0.0 2023-06-20 19:41:00,744 INFO [train.py:996] (0/4) Epoch 5, batch 9100, loss[loss=0.2571, simple_loss=0.3521, pruned_loss=0.08103, over 21597.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3327, pruned_loss=0.09444, over 4263580.14 frames. ], batch size: 389, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:41:15,465 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-20 19:41:39,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=786534.0, ans=0.125 2023-06-20 19:41:40,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-06-20 19:42:14,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=786654.0, ans=0.0 2023-06-20 19:42:19,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=786654.0, ans=0.2 2023-06-20 19:42:42,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=15.0 2023-06-20 19:42:43,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=786714.0, ans=0.125 2023-06-20 19:42:45,924 INFO [train.py:996] (0/4) Epoch 5, batch 9150, loss[loss=0.3241, simple_loss=0.4141, pruned_loss=0.1171, over 21305.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3335, pruned_loss=0.09181, over 4259642.55 frames. ], batch size: 548, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:43:16,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=786834.0, ans=0.0 2023-06-20 19:43:28,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=786834.0, ans=0.0 2023-06-20 19:43:42,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-06-20 19:43:45,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=786894.0, ans=0.1 2023-06-20 19:44:38,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.853e+02 3.445e+02 4.216e+02 8.485e+02, threshold=6.890e+02, percent-clipped=4.0 2023-06-20 19:44:39,965 INFO [train.py:996] (0/4) Epoch 5, batch 9200, loss[loss=0.2617, simple_loss=0.3476, pruned_loss=0.08787, over 21660.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3355, pruned_loss=0.09044, over 4261576.47 frames. ], batch size: 441, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:45:29,317 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=22.5 2023-06-20 19:46:23,478 INFO [train.py:996] (0/4) Epoch 5, batch 9250, loss[loss=0.2551, simple_loss=0.3191, pruned_loss=0.09549, over 21776.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3381, pruned_loss=0.09342, over 4266850.73 frames. ], batch size: 124, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:47:32,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=787554.0, ans=0.125 2023-06-20 19:47:34,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=787554.0, ans=0.05 2023-06-20 19:48:05,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.152e+02 3.844e+02 5.008e+02 7.995e+02, threshold=7.688e+02, percent-clipped=5.0 2023-06-20 19:48:12,346 INFO [train.py:996] (0/4) Epoch 5, batch 9300, loss[loss=0.2634, simple_loss=0.3552, pruned_loss=0.08582, over 21767.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3337, pruned_loss=0.09408, over 4265255.79 frames. ], batch size: 351, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:49:02,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-20 19:49:05,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-20 19:49:55,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=787974.0, ans=0.1 2023-06-20 19:49:57,409 INFO [train.py:996] (0/4) Epoch 5, batch 9350, loss[loss=0.2805, simple_loss=0.3564, pruned_loss=0.1023, over 21842.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.34, pruned_loss=0.09536, over 4268686.58 frames. ], batch size: 118, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:50:13,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=787974.0, ans=0.125 2023-06-20 19:50:25,132 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:50:33,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=788034.0, ans=0.125 2023-06-20 19:51:41,584 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.821e+02 3.195e+02 3.693e+02 6.028e+02, threshold=6.390e+02, percent-clipped=0.0 2023-06-20 19:51:41,605 INFO [train.py:996] (0/4) Epoch 5, batch 9400, loss[loss=0.2781, simple_loss=0.3246, pruned_loss=0.1158, over 21421.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3408, pruned_loss=0.0961, over 4271339.67 frames. ], batch size: 510, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:52:33,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.33 vs. limit=10.0 2023-06-20 19:52:54,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=788454.0, ans=0.125 2023-06-20 19:53:28,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=788514.0, ans=0.0 2023-06-20 19:53:31,435 INFO [train.py:996] (0/4) Epoch 5, batch 9450, loss[loss=0.2401, simple_loss=0.2964, pruned_loss=0.09192, over 21712.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3313, pruned_loss=0.09474, over 4270186.82 frames. ], batch size: 334, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:53:31,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=788574.0, ans=0.125 2023-06-20 19:54:07,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.89 vs. limit=15.0 2023-06-20 19:55:15,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.307e+02 4.000e+02 4.833e+02 8.255e+02, threshold=8.000e+02, percent-clipped=8.0 2023-06-20 19:55:15,242 INFO [train.py:996] (0/4) Epoch 5, batch 9500, loss[loss=0.2485, simple_loss=0.31, pruned_loss=0.09349, over 21786.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3237, pruned_loss=0.09225, over 4263519.44 frames. ], batch size: 118, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:56:57,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=789174.0, ans=0.95 2023-06-20 19:56:58,677 INFO [train.py:996] (0/4) Epoch 5, batch 9550, loss[loss=0.2934, simple_loss=0.3801, pruned_loss=0.1034, over 21622.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3282, pruned_loss=0.09501, over 4264262.29 frames. ], batch size: 389, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:57:38,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=789294.0, ans=0.125 2023-06-20 19:57:59,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=789354.0, ans=0.0 2023-06-20 19:58:13,749 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:58:42,206 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 2.917e+02 3.406e+02 3.917e+02 8.592e+02, threshold=6.812e+02, percent-clipped=1.0 2023-06-20 19:58:42,237 INFO [train.py:996] (0/4) Epoch 5, batch 9600, loss[loss=0.2866, simple_loss=0.4019, pruned_loss=0.08561, over 20818.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3317, pruned_loss=0.09694, over 4270399.47 frames. ], batch size: 607, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 19:58:53,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=789474.0, ans=0.0 2023-06-20 19:59:51,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=789654.0, ans=0.125 2023-06-20 20:00:28,533 INFO [train.py:996] (0/4) Epoch 5, batch 9650, loss[loss=0.2503, simple_loss=0.3223, pruned_loss=0.08915, over 21933.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3303, pruned_loss=0.09596, over 4276490.08 frames. ], batch size: 316, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 20:01:23,313 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:01:43,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=789954.0, ans=0.5 2023-06-20 20:02:04,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=790014.0, ans=0.95 2023-06-20 20:02:08,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=790014.0, ans=0.125 2023-06-20 20:02:13,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.873e+02 3.424e+02 4.153e+02 6.817e+02, threshold=6.847e+02, percent-clipped=1.0 2023-06-20 20:02:13,517 INFO [train.py:996] (0/4) Epoch 5, batch 9700, loss[loss=0.2154, simple_loss=0.2935, pruned_loss=0.06867, over 20047.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3331, pruned_loss=0.09587, over 4277382.74 frames. ], batch size: 703, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 20:03:02,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=790194.0, ans=0.125 2023-06-20 20:03:22,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=790254.0, ans=0.125 2023-06-20 20:03:49,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=790314.0, ans=0.0 2023-06-20 20:03:49,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=790314.0, ans=0.125 2023-06-20 20:03:56,226 INFO [train.py:996] (0/4) Epoch 5, batch 9750, loss[loss=0.2195, simple_loss=0.2723, pruned_loss=0.08332, over 21187.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3251, pruned_loss=0.09386, over 4265983.53 frames. ], batch size: 159, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 20:03:57,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-20 20:04:04,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=790374.0, ans=0.125 2023-06-20 20:04:56,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=790554.0, ans=0.1 2023-06-20 20:05:02,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=790554.0, ans=0.0 2023-06-20 20:05:34,164 INFO [train.py:996] (0/4) Epoch 5, batch 9800, loss[loss=0.2378, simple_loss=0.2993, pruned_loss=0.08817, over 21660.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3256, pruned_loss=0.09413, over 4254396.67 frames. ], batch size: 263, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 20:05:35,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.031e+02 3.637e+02 4.540e+02 9.363e+02, threshold=7.274e+02, percent-clipped=7.0 2023-06-20 20:05:53,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=790734.0, ans=0.0 2023-06-20 20:06:40,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-20 20:06:45,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=12.0 2023-06-20 20:06:48,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=790914.0, ans=0.0 2023-06-20 20:07:12,178 INFO [train.py:996] (0/4) Epoch 5, batch 9850, loss[loss=0.2538, simple_loss=0.3093, pruned_loss=0.09909, over 21717.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3222, pruned_loss=0.09408, over 4251188.44 frames. ], batch size: 112, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:07:16,489 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-20 20:07:26,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=22.5 2023-06-20 20:08:51,637 INFO [train.py:996] (0/4) Epoch 5, batch 9900, loss[loss=0.2727, simple_loss=0.3523, pruned_loss=0.09657, over 19795.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3184, pruned_loss=0.0938, over 4238983.74 frames. ], batch size: 702, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:08:53,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.770e+02 3.187e+02 3.744e+02 7.656e+02, threshold=6.375e+02, percent-clipped=1.0 2023-06-20 20:08:53,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=791274.0, ans=0.0 2023-06-20 20:08:57,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=791274.0, ans=0.2 2023-06-20 20:09:02,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=791274.0, ans=0.125 2023-06-20 20:09:03,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=791274.0, ans=0.035 2023-06-20 20:09:20,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=791334.0, ans=0.125 2023-06-20 20:09:29,361 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-20 20:09:50,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=791454.0, ans=0.0 2023-06-20 20:09:54,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=791454.0, ans=0.0 2023-06-20 20:09:59,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-20 20:10:29,354 INFO [train.py:996] (0/4) Epoch 5, batch 9950, loss[loss=0.2274, simple_loss=0.2879, pruned_loss=0.08343, over 21878.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3209, pruned_loss=0.0958, over 4245120.45 frames. ], batch size: 317, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:10:57,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=791634.0, ans=0.125 2023-06-20 20:10:59,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=791634.0, ans=0.0 2023-06-20 20:11:21,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=791694.0, ans=0.125 2023-06-20 20:12:00,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-20 20:12:13,220 INFO [train.py:996] (0/4) Epoch 5, batch 10000, loss[loss=0.342, simple_loss=0.3868, pruned_loss=0.1487, over 21431.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3182, pruned_loss=0.0955, over 4257995.00 frames. ], batch size: 509, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:12:13,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=791874.0, ans=0.5 2023-06-20 20:12:14,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.860e+02 3.256e+02 3.834e+02 6.756e+02, threshold=6.512e+02, percent-clipped=1.0 2023-06-20 20:13:01,642 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-132000.pt 2023-06-20 20:13:03,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=791994.0, ans=0.125 2023-06-20 20:13:05,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=15.0 2023-06-20 20:13:15,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-20 20:13:22,275 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:13:31,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=792054.0, ans=0.2 2023-06-20 20:13:35,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=792054.0, ans=0.1 2023-06-20 20:13:50,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-20 20:14:04,461 INFO [train.py:996] (0/4) Epoch 5, batch 10050, loss[loss=0.2281, simple_loss=0.2965, pruned_loss=0.07987, over 21710.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3209, pruned_loss=0.09569, over 4262065.82 frames. ], batch size: 282, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:14:18,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-06-20 20:14:29,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=792234.0, ans=0.125 2023-06-20 20:14:31,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=792234.0, ans=0.125 2023-06-20 20:15:20,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=792354.0, ans=0.125 2023-06-20 20:15:27,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=792414.0, ans=0.0 2023-06-20 20:15:48,202 INFO [train.py:996] (0/4) Epoch 5, batch 10100, loss[loss=0.3045, simple_loss=0.3712, pruned_loss=0.1189, over 21647.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3175, pruned_loss=0.09358, over 4263918.28 frames. ], batch size: 389, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:15:49,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.949e+02 3.459e+02 3.930e+02 6.580e+02, threshold=6.918e+02, percent-clipped=2.0 2023-06-20 20:16:35,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=792594.0, ans=0.0 2023-06-20 20:16:38,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=792594.0, ans=0.125 2023-06-20 20:16:39,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=22.5 2023-06-20 20:16:52,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-20 20:17:30,466 INFO [train.py:996] (0/4) Epoch 5, batch 10150, loss[loss=0.2813, simple_loss=0.3384, pruned_loss=0.1121, over 21684.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3229, pruned_loss=0.09535, over 4263991.99 frames. ], batch size: 112, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:17:59,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-20 20:18:58,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=793014.0, ans=0.2 2023-06-20 20:18:58,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=793014.0, ans=0.125 2023-06-20 20:18:58,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=793014.0, ans=0.125 2023-06-20 20:19:19,879 INFO [train.py:996] (0/4) Epoch 5, batch 10200, loss[loss=0.2232, simple_loss=0.3027, pruned_loss=0.07184, over 21681.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3223, pruned_loss=0.09291, over 4265489.68 frames. ], batch size: 298, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:19:21,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.868e+02 3.276e+02 4.054e+02 7.472e+02, threshold=6.552e+02, percent-clipped=1.0 2023-06-20 20:19:33,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=793074.0, ans=0.2 2023-06-20 20:19:33,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=793074.0, ans=0.1 2023-06-20 20:19:38,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=793074.0, ans=0.125 2023-06-20 20:20:10,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=793194.0, ans=0.0 2023-06-20 20:20:10,744 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-20 20:20:58,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=793314.0, ans=0.1 2023-06-20 20:21:02,695 INFO [train.py:996] (0/4) Epoch 5, batch 10250, loss[loss=0.201, simple_loss=0.2929, pruned_loss=0.05456, over 21592.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3164, pruned_loss=0.08625, over 4271179.45 frames. ], batch size: 389, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:21:15,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=793374.0, ans=0.1 2023-06-20 20:22:53,301 INFO [train.py:996] (0/4) Epoch 5, batch 10300, loss[loss=0.3274, simple_loss=0.3966, pruned_loss=0.1291, over 21471.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.32, pruned_loss=0.08751, over 4279760.56 frames. ], batch size: 471, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:22:54,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 2.621e+02 3.179e+02 4.486e+02 7.082e+02, threshold=6.359e+02, percent-clipped=5.0 2023-06-20 20:24:03,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=793854.0, ans=0.0 2023-06-20 20:24:15,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-20 20:24:27,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=793914.0, ans=0.0 2023-06-20 20:24:27,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=793914.0, ans=0.95 2023-06-20 20:24:37,286 INFO [train.py:996] (0/4) Epoch 5, batch 10350, loss[loss=0.2094, simple_loss=0.2849, pruned_loss=0.06697, over 21701.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.321, pruned_loss=0.08675, over 4279807.34 frames. ], batch size: 247, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:24:42,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=793974.0, ans=0.0 2023-06-20 20:25:07,079 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-20 20:25:13,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=794034.0, ans=0.125 2023-06-20 20:26:22,044 INFO [train.py:996] (0/4) Epoch 5, batch 10400, loss[loss=0.1796, simple_loss=0.2229, pruned_loss=0.06817, over 21107.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3165, pruned_loss=0.08584, over 4276041.62 frames. ], batch size: 143, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:26:22,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=794274.0, ans=0.2 2023-06-20 20:26:23,731 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.112e+02 3.947e+02 5.007e+02 1.010e+03, threshold=7.895e+02, percent-clipped=9.0 2023-06-20 20:27:26,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-20 20:27:27,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=794454.0, ans=0.2 2023-06-20 20:27:40,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=794454.0, ans=0.125 2023-06-20 20:27:55,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=794514.0, ans=0.0 2023-06-20 20:27:57,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=794514.0, ans=0.0 2023-06-20 20:28:13,690 INFO [train.py:996] (0/4) Epoch 5, batch 10450, loss[loss=0.2636, simple_loss=0.3426, pruned_loss=0.09233, over 21700.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3205, pruned_loss=0.08877, over 4270335.03 frames. ], batch size: 298, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:28:29,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=794634.0, ans=0.1 2023-06-20 20:29:06,452 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-20 20:29:30,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=794754.0, ans=0.125 2023-06-20 20:29:32,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=794754.0, ans=0.0 2023-06-20 20:29:57,613 INFO [train.py:996] (0/4) Epoch 5, batch 10500, loss[loss=0.2518, simple_loss=0.3039, pruned_loss=0.09987, over 21764.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3204, pruned_loss=0.08842, over 4267913.88 frames. ], batch size: 112, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:29:59,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.789e+02 3.342e+02 3.915e+02 9.640e+02, threshold=6.684e+02, percent-clipped=1.0 2023-06-20 20:30:01,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=794874.0, ans=0.0 2023-06-20 20:30:42,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=794994.0, ans=0.125 2023-06-20 20:31:07,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=795054.0, ans=0.125 2023-06-20 20:31:42,459 INFO [train.py:996] (0/4) Epoch 5, batch 10550, loss[loss=0.2006, simple_loss=0.26, pruned_loss=0.07058, over 21281.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3151, pruned_loss=0.08847, over 4253288.36 frames. ], batch size: 551, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:32:20,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=795234.0, ans=0.125 2023-06-20 20:32:29,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=795294.0, ans=0.0 2023-06-20 20:32:40,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=795354.0, ans=0.0 2023-06-20 20:32:40,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=795354.0, ans=0.1 2023-06-20 20:32:58,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=795354.0, ans=0.0 2023-06-20 20:33:00,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-20 20:33:28,223 INFO [train.py:996] (0/4) Epoch 5, batch 10600, loss[loss=0.1977, simple_loss=0.2662, pruned_loss=0.06461, over 21901.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.311, pruned_loss=0.08713, over 4255490.70 frames. ], batch size: 98, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:33:29,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.099e+02 3.993e+02 4.741e+02 9.586e+02, threshold=7.985e+02, percent-clipped=4.0 2023-06-20 20:34:46,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=795654.0, ans=0.125 2023-06-20 20:35:03,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=795714.0, ans=0.125 2023-06-20 20:35:23,430 INFO [train.py:996] (0/4) Epoch 5, batch 10650, loss[loss=0.2275, simple_loss=0.3267, pruned_loss=0.06419, over 21173.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.314, pruned_loss=0.08617, over 4255241.20 frames. ], batch size: 548, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:36:04,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=795894.0, ans=0.125 2023-06-20 20:36:54,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=796014.0, ans=0.0 2023-06-20 20:37:05,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=796074.0, ans=0.0 2023-06-20 20:37:06,879 INFO [train.py:996] (0/4) Epoch 5, batch 10700, loss[loss=0.2526, simple_loss=0.3156, pruned_loss=0.09484, over 21712.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3122, pruned_loss=0.08609, over 4261599.64 frames. ], batch size: 247, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:37:08,782 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.769e+02 3.121e+02 4.008e+02 5.487e+02, threshold=6.241e+02, percent-clipped=0.0 2023-06-20 20:37:26,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=796134.0, ans=0.09899494936611666 2023-06-20 20:37:27,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-20 20:37:49,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=796194.0, ans=0.125 2023-06-20 20:38:06,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-20 20:38:11,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=796254.0, ans=0.125 2023-06-20 20:38:53,081 INFO [train.py:996] (0/4) Epoch 5, batch 10750, loss[loss=0.2896, simple_loss=0.3578, pruned_loss=0.1107, over 21787.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3241, pruned_loss=0.09052, over 4263681.53 frames. ], batch size: 124, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:38:58,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=796374.0, ans=0.2 2023-06-20 20:40:09,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=796554.0, ans=0.1 2023-06-20 20:40:44,438 INFO [train.py:996] (0/4) Epoch 5, batch 10800, loss[loss=0.2887, simple_loss=0.3487, pruned_loss=0.1144, over 21445.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3303, pruned_loss=0.09183, over 4263908.23 frames. ], batch size: 194, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:40:47,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.089e+02 3.673e+02 4.196e+02 7.308e+02, threshold=7.346e+02, percent-clipped=3.0 2023-06-20 20:41:16,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=796734.0, ans=0.125 2023-06-20 20:42:29,750 INFO [train.py:996] (0/4) Epoch 5, batch 10850, loss[loss=0.2539, simple_loss=0.3003, pruned_loss=0.1037, over 21472.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3301, pruned_loss=0.09252, over 4262476.04 frames. ], batch size: 511, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:42:33,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=796974.0, ans=0.0 2023-06-20 20:42:50,963 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-20 20:42:58,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=797034.0, ans=0.1 2023-06-20 20:43:03,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=797034.0, ans=0.125 2023-06-20 20:43:06,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=797094.0, ans=0.0 2023-06-20 20:44:11,636 INFO [train.py:996] (0/4) Epoch 5, batch 10900, loss[loss=0.2235, simple_loss=0.2865, pruned_loss=0.08023, over 21404.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3227, pruned_loss=0.09081, over 4258096.81 frames. ], batch size: 211, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:44:16,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.440e+02 2.886e+02 3.394e+02 4.121e+02 7.095e+02, threshold=6.789e+02, percent-clipped=0.0 2023-06-20 20:45:09,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=797454.0, ans=0.125 2023-06-20 20:45:11,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=797454.0, ans=0.125 2023-06-20 20:45:39,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=797514.0, ans=0.0 2023-06-20 20:45:53,503 INFO [train.py:996] (0/4) Epoch 5, batch 10950, loss[loss=0.2257, simple_loss=0.2855, pruned_loss=0.08297, over 21109.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3177, pruned_loss=0.08783, over 4263119.35 frames. ], batch size: 143, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:46:29,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=797694.0, ans=0.2 2023-06-20 20:46:40,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=797694.0, ans=15.0 2023-06-20 20:46:54,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-20 20:47:27,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=797814.0, ans=0.1 2023-06-20 20:47:35,116 INFO [train.py:996] (0/4) Epoch 5, batch 11000, loss[loss=0.2678, simple_loss=0.3302, pruned_loss=0.1027, over 21355.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3171, pruned_loss=0.08948, over 4260266.44 frames. ], batch size: 159, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:47:38,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=797874.0, ans=0.0 2023-06-20 20:47:39,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.761e+02 3.277e+02 3.770e+02 5.855e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-20 20:48:06,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=797934.0, ans=0.0 2023-06-20 20:48:08,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=797934.0, ans=0.0 2023-06-20 20:48:16,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=797994.0, ans=0.2 2023-06-20 20:48:55,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=798114.0, ans=0.0 2023-06-20 20:49:11,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=798114.0, ans=0.025 2023-06-20 20:49:17,467 INFO [train.py:996] (0/4) Epoch 5, batch 11050, loss[loss=0.2535, simple_loss=0.3748, pruned_loss=0.06612, over 20779.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3155, pruned_loss=0.09066, over 4264466.67 frames. ], batch size: 607, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:49:22,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=798174.0, ans=0.05 2023-06-20 20:49:33,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=798234.0, ans=0.125 2023-06-20 20:50:01,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=798294.0, ans=0.125 2023-06-20 20:50:59,785 INFO [train.py:996] (0/4) Epoch 5, batch 11100, loss[loss=0.2323, simple_loss=0.3027, pruned_loss=0.08096, over 21621.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3148, pruned_loss=0.09118, over 4266112.59 frames. ], batch size: 298, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:51:04,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 2.972e+02 3.396e+02 4.031e+02 6.791e+02, threshold=6.791e+02, percent-clipped=1.0 2023-06-20 20:51:05,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=798474.0, ans=0.1 2023-06-20 20:52:00,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=798654.0, ans=0.1 2023-06-20 20:52:02,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=798654.0, ans=0.125 2023-06-20 20:52:43,922 INFO [train.py:996] (0/4) Epoch 5, batch 11150, loss[loss=0.2262, simple_loss=0.3061, pruned_loss=0.07317, over 21247.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3121, pruned_loss=0.08987, over 4254386.43 frames. ], batch size: 143, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 20:53:05,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=798834.0, ans=0.95 2023-06-20 20:53:17,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=798834.0, ans=0.1 2023-06-20 20:53:42,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=798954.0, ans=0.1 2023-06-20 20:53:44,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-20 20:54:23,629 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:54:23,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=799014.0, ans=0.0 2023-06-20 20:54:23,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=799014.0, ans=0.0 2023-06-20 20:54:27,934 INFO [train.py:996] (0/4) Epoch 5, batch 11200, loss[loss=0.2354, simple_loss=0.2997, pruned_loss=0.08553, over 21483.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3101, pruned_loss=0.08946, over 4259306.51 frames. ], batch size: 389, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:54:33,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.645e+02 2.968e+02 3.525e+02 6.155e+02, threshold=5.936e+02, percent-clipped=0.0 2023-06-20 20:54:38,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=799074.0, ans=0.125 2023-06-20 20:54:58,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=799134.0, ans=0.1 2023-06-20 20:55:05,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=799194.0, ans=0.0 2023-06-20 20:55:23,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=799194.0, ans=0.0 2023-06-20 20:56:01,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=799314.0, ans=0.125 2023-06-20 20:56:11,027 INFO [train.py:996] (0/4) Epoch 5, batch 11250, loss[loss=0.2334, simple_loss=0.3168, pruned_loss=0.07503, over 21653.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3109, pruned_loss=0.08999, over 4256301.96 frames. ], batch size: 391, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:56:19,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=799374.0, ans=0.04949747468305833 2023-06-20 20:56:29,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=799434.0, ans=0.125 2023-06-20 20:56:50,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=799494.0, ans=0.125 2023-06-20 20:57:52,805 INFO [train.py:996] (0/4) Epoch 5, batch 11300, loss[loss=0.2299, simple_loss=0.3175, pruned_loss=0.0712, over 21765.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3127, pruned_loss=0.09, over 4256545.61 frames. ], batch size: 316, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:57:57,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.794e+02 3.076e+02 3.460e+02 4.900e+02, threshold=6.152e+02, percent-clipped=0.0 2023-06-20 20:58:23,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=799734.0, ans=0.125 2023-06-20 20:58:43,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=799794.0, ans=0.1 2023-06-20 20:58:48,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-20 20:59:28,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=799914.0, ans=0.0 2023-06-20 20:59:33,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=799914.0, ans=0.0 2023-06-20 20:59:38,221 INFO [train.py:996] (0/4) Epoch 5, batch 11350, loss[loss=0.2141, simple_loss=0.2894, pruned_loss=0.06944, over 21273.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3147, pruned_loss=0.08924, over 4264295.67 frames. ], batch size: 143, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 21:01:06,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800214.0, ans=0.1 2023-06-20 21:01:20,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=800274.0, ans=0.0 2023-06-20 21:01:21,821 INFO [train.py:996] (0/4) Epoch 5, batch 11400, loss[loss=0.2998, simple_loss=0.3799, pruned_loss=0.1099, over 21627.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3216, pruned_loss=0.09233, over 4271722.85 frames. ], batch size: 441, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 21:01:26,755 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.159e+02 3.655e+02 4.619e+02 8.867e+02, threshold=7.309e+02, percent-clipped=8.0 2023-06-20 21:01:28,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-20 21:01:42,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=800334.0, ans=0.0 2023-06-20 21:01:52,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=800334.0, ans=0.0 2023-06-20 21:01:59,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=800334.0, ans=0.1 2023-06-20 21:02:00,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=800334.0, ans=0.2 2023-06-20 21:02:06,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-20 21:02:31,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=800454.0, ans=0.125 2023-06-20 21:02:32,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=800454.0, ans=0.125 2023-06-20 21:02:33,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=800454.0, ans=0.1 2023-06-20 21:02:39,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=800454.0, ans=0.0 2023-06-20 21:02:45,218 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:02:47,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=800514.0, ans=0.125 2023-06-20 21:02:53,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=800514.0, ans=0.0 2023-06-20 21:03:10,022 INFO [train.py:996] (0/4) Epoch 5, batch 11450, loss[loss=0.2498, simple_loss=0.3337, pruned_loss=0.08294, over 21570.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3217, pruned_loss=0.09015, over 4273697.74 frames. ], batch size: 389, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 21:03:29,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-20 21:03:38,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-20 21:04:50,166 INFO [train.py:996] (0/4) Epoch 5, batch 11500, loss[loss=0.2251, simple_loss=0.3158, pruned_loss=0.06722, over 21728.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3263, pruned_loss=0.0926, over 4280710.99 frames. ], batch size: 298, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 21:04:56,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.797e+02 3.137e+02 3.643e+02 6.427e+02, threshold=6.273e+02, percent-clipped=0.0 2023-06-20 21:05:06,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=800874.0, ans=0.125 2023-06-20 21:06:14,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=801114.0, ans=0.125 2023-06-20 21:06:39,680 INFO [train.py:996] (0/4) Epoch 5, batch 11550, loss[loss=0.2823, simple_loss=0.3791, pruned_loss=0.09274, over 21773.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3318, pruned_loss=0.09227, over 4281075.53 frames. ], batch size: 332, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:08:24,264 INFO [train.py:996] (0/4) Epoch 5, batch 11600, loss[loss=0.2498, simple_loss=0.3507, pruned_loss=0.07441, over 21652.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3432, pruned_loss=0.0938, over 4275312.78 frames. ], batch size: 263, lr: 6.31e-03, grad_scale: 32.0 2023-06-20 21:08:30,831 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.937e+02 3.426e+02 4.222e+02 6.279e+02, threshold=6.853e+02, percent-clipped=1.0 2023-06-20 21:08:42,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=801474.0, ans=0.125 2023-06-20 21:09:58,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=801714.0, ans=0.125 2023-06-20 21:10:06,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-20 21:10:08,668 INFO [train.py:996] (0/4) Epoch 5, batch 11650, loss[loss=0.2645, simple_loss=0.3355, pruned_loss=0.09679, over 21496.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3473, pruned_loss=0.09352, over 4274604.77 frames. ], batch size: 230, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:10:11,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=801774.0, ans=0.125 2023-06-20 21:11:12,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=801954.0, ans=0.125 2023-06-20 21:11:22,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=801954.0, ans=0.125 2023-06-20 21:11:31,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.52 vs. limit=22.5 2023-06-20 21:11:51,290 INFO [train.py:996] (0/4) Epoch 5, batch 11700, loss[loss=0.2736, simple_loss=0.3225, pruned_loss=0.1123, over 21563.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3381, pruned_loss=0.09287, over 4270987.02 frames. ], batch size: 415, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:11:52,463 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-20 21:11:58,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=802074.0, ans=0.125 2023-06-20 21:12:02,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=802074.0, ans=0.0 2023-06-20 21:12:03,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.808e+02 3.203e+02 3.779e+02 6.898e+02, threshold=6.406e+02, percent-clipped=1.0 2023-06-20 21:13:07,566 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:13:28,475 INFO [train.py:996] (0/4) Epoch 5, batch 11750, loss[loss=0.2321, simple_loss=0.2836, pruned_loss=0.09024, over 21613.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.329, pruned_loss=0.09301, over 4266183.29 frames. ], batch size: 264, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:13:51,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=802434.0, ans=0.1 2023-06-20 21:14:54,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-20 21:15:10,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=802614.0, ans=0.025 2023-06-20 21:15:11,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=802674.0, ans=0.2 2023-06-20 21:15:18,433 INFO [train.py:996] (0/4) Epoch 5, batch 11800, loss[loss=0.2611, simple_loss=0.3335, pruned_loss=0.09437, over 21478.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3325, pruned_loss=0.09599, over 4259234.69 frames. ], batch size: 211, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:15:26,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 2.905e+02 3.538e+02 4.499e+02 8.498e+02, threshold=7.075e+02, percent-clipped=3.0 2023-06-20 21:15:29,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=802674.0, ans=0.2 2023-06-20 21:16:53,714 INFO [train.py:996] (0/4) Epoch 5, batch 11850, loss[loss=0.2833, simple_loss=0.3591, pruned_loss=0.1037, over 21789.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3338, pruned_loss=0.09511, over 4268964.17 frames. ], batch size: 414, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:17:19,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=803034.0, ans=0.125 2023-06-20 21:17:20,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-20 21:17:36,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=803094.0, ans=0.125 2023-06-20 21:18:23,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=803214.0, ans=0.125 2023-06-20 21:18:40,298 INFO [train.py:996] (0/4) Epoch 5, batch 11900, loss[loss=0.2737, simple_loss=0.3848, pruned_loss=0.08134, over 20870.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3341, pruned_loss=0.0918, over 4271256.38 frames. ], batch size: 608, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:18:48,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.635e+02 2.949e+02 3.429e+02 6.903e+02, threshold=5.899e+02, percent-clipped=0.0 2023-06-20 21:18:49,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=803274.0, ans=0.0 2023-06-20 21:18:53,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.37 vs. limit=6.0 2023-06-20 21:18:58,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=803334.0, ans=0.125 2023-06-20 21:19:01,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=803334.0, ans=0.125 2023-06-20 21:19:35,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=803394.0, ans=0.2 2023-06-20 21:20:19,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=803514.0, ans=0.2 2023-06-20 21:20:25,290 INFO [train.py:996] (0/4) Epoch 5, batch 11950, loss[loss=0.2504, simple_loss=0.3385, pruned_loss=0.08117, over 21698.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.334, pruned_loss=0.08803, over 4272494.89 frames. ], batch size: 414, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:22:08,852 INFO [train.py:996] (0/4) Epoch 5, batch 12000, loss[loss=0.213, simple_loss=0.2809, pruned_loss=0.07257, over 21541.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.327, pruned_loss=0.08608, over 4270455.25 frames. ], batch size: 263, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:22:08,860 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 21:22:26,107 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2641, simple_loss=0.3594, pruned_loss=0.08443, over 1796401.00 frames. 2023-06-20 21:22:26,107 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-20 21:22:30,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=803874.0, ans=0.0 2023-06-20 21:22:34,440 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 3.012e+02 3.779e+02 4.599e+02 7.953e+02, threshold=7.557e+02, percent-clipped=8.0 2023-06-20 21:23:04,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-20 21:23:15,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.35 vs. limit=10.0 2023-06-20 21:23:25,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=803994.0, ans=0.1 2023-06-20 21:23:28,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=803994.0, ans=0.125 2023-06-20 21:23:45,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=804054.0, ans=0.2 2023-06-20 21:23:51,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=804114.0, ans=0.125 2023-06-20 21:23:53,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=804114.0, ans=0.1 2023-06-20 21:24:10,206 INFO [train.py:996] (0/4) Epoch 5, batch 12050, loss[loss=0.282, simple_loss=0.3421, pruned_loss=0.111, over 21880.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3255, pruned_loss=0.08968, over 4262873.53 frames. ], batch size: 351, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:24:37,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=804234.0, ans=0.1 2023-06-20 21:25:23,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=804354.0, ans=0.125 2023-06-20 21:25:31,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=804354.0, ans=0.125 2023-06-20 21:26:00,523 INFO [train.py:996] (0/4) Epoch 5, batch 12100, loss[loss=0.3004, simple_loss=0.3623, pruned_loss=0.1192, over 21389.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3296, pruned_loss=0.09432, over 4267708.05 frames. ], batch size: 548, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:26:14,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.887e+02 3.241e+02 3.794e+02 5.961e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 21:26:56,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=804594.0, ans=0.05 2023-06-20 21:27:11,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-20 21:27:45,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=804714.0, ans=0.2 2023-06-20 21:27:50,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-20 21:27:51,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=804714.0, ans=0.2 2023-06-20 21:27:54,343 INFO [train.py:996] (0/4) Epoch 5, batch 12150, loss[loss=0.234, simple_loss=0.3234, pruned_loss=0.07229, over 21006.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3324, pruned_loss=0.09272, over 4265514.41 frames. ], batch size: 607, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:28:39,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=804894.0, ans=0.125 2023-06-20 21:29:20,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-20 21:29:35,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=805014.0, ans=0.125 2023-06-20 21:29:38,013 INFO [train.py:996] (0/4) Epoch 5, batch 12200, loss[loss=0.2085, simple_loss=0.2705, pruned_loss=0.07326, over 21510.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3302, pruned_loss=0.09131, over 4257918.33 frames. ], batch size: 230, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:29:38,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=805074.0, ans=0.0 2023-06-20 21:29:51,304 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 2.953e+02 3.466e+02 4.601e+02 9.385e+02, threshold=6.933e+02, percent-clipped=9.0 2023-06-20 21:30:10,745 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:30:19,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-20 21:30:52,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=805254.0, ans=0.1 2023-06-20 21:31:22,587 INFO [train.py:996] (0/4) Epoch 5, batch 12250, loss[loss=0.1668, simple_loss=0.2471, pruned_loss=0.04322, over 21533.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3217, pruned_loss=0.08804, over 4258663.70 frames. ], batch size: 212, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:32:17,835 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.487e-03 2023-06-20 21:32:22,504 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:32:31,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=805554.0, ans=0.0 2023-06-20 21:32:36,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=805554.0, ans=0.2 2023-06-20 21:32:56,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.49 vs. limit=6.0 2023-06-20 21:33:06,439 INFO [train.py:996] (0/4) Epoch 5, batch 12300, loss[loss=0.3121, simple_loss=0.3994, pruned_loss=0.1125, over 19994.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3137, pruned_loss=0.08224, over 4262339.28 frames. ], batch size: 702, lr: 6.30e-03, grad_scale: 16.0 2023-06-20 21:33:08,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=805674.0, ans=0.125 2023-06-20 21:33:20,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 2.636e+02 3.177e+02 4.024e+02 7.253e+02, threshold=6.354e+02, percent-clipped=1.0 2023-06-20 21:33:47,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=805734.0, ans=0.125 2023-06-20 21:33:50,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=805794.0, ans=0.125 2023-06-20 21:33:57,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=805794.0, ans=0.95 2023-06-20 21:34:49,585 INFO [train.py:996] (0/4) Epoch 5, batch 12350, loss[loss=0.2505, simple_loss=0.3449, pruned_loss=0.07805, over 21776.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3181, pruned_loss=0.08288, over 4272268.25 frames. ], batch size: 332, lr: 6.30e-03, grad_scale: 16.0 2023-06-20 21:35:39,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=806094.0, ans=0.0 2023-06-20 21:36:06,543 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:36:15,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=806154.0, ans=0.0 2023-06-20 21:36:34,910 INFO [train.py:996] (0/4) Epoch 5, batch 12400, loss[loss=0.2874, simple_loss=0.3353, pruned_loss=0.1197, over 21310.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3214, pruned_loss=0.0867, over 4279821.26 frames. ], batch size: 176, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:36:35,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=806274.0, ans=0.125 2023-06-20 21:36:46,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=806274.0, ans=0.2 2023-06-20 21:36:49,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.789e+02 3.195e+02 3.703e+02 5.558e+02, threshold=6.391e+02, percent-clipped=0.0 2023-06-20 21:36:51,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=806274.0, ans=0.2 2023-06-20 21:36:51,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=806274.0, ans=0.2 2023-06-20 21:37:08,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-20 21:37:11,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=806334.0, ans=0.0 2023-06-20 21:38:03,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=806514.0, ans=0.2 2023-06-20 21:38:15,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=806514.0, ans=0.125 2023-06-20 21:38:25,368 INFO [train.py:996] (0/4) Epoch 5, batch 12450, loss[loss=0.3017, simple_loss=0.3741, pruned_loss=0.1147, over 21411.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3235, pruned_loss=0.08947, over 4280016.06 frames. ], batch size: 131, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:38:25,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=806574.0, ans=0.1 2023-06-20 21:38:43,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=806574.0, ans=0.0 2023-06-20 21:38:48,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=806634.0, ans=0.125 2023-06-20 21:39:55,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=806814.0, ans=0.1 2023-06-20 21:40:17,362 INFO [train.py:996] (0/4) Epoch 5, batch 12500, loss[loss=0.2833, simple_loss=0.373, pruned_loss=0.09683, over 21692.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3365, pruned_loss=0.09463, over 4281497.56 frames. ], batch size: 298, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:40:27,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 2.818e+02 3.353e+02 4.175e+02 5.969e+02, threshold=6.707e+02, percent-clipped=0.0 2023-06-20 21:40:44,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=806934.0, ans=0.125 2023-06-20 21:42:04,544 INFO [train.py:996] (0/4) Epoch 5, batch 12550, loss[loss=0.2517, simple_loss=0.3359, pruned_loss=0.08377, over 21788.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3393, pruned_loss=0.09633, over 4279760.75 frames. ], batch size: 282, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:42:28,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=807174.0, ans=0.125 2023-06-20 21:42:51,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=807294.0, ans=0.0 2023-06-20 21:43:53,053 INFO [train.py:996] (0/4) Epoch 5, batch 12600, loss[loss=0.1894, simple_loss=0.2654, pruned_loss=0.05668, over 21195.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3374, pruned_loss=0.09366, over 4281417.34 frames. ], batch size: 176, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:44:08,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.858e+02 3.271e+02 3.869e+02 6.376e+02, threshold=6.541e+02, percent-clipped=0.0 2023-06-20 21:44:12,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=15.0 2023-06-20 21:44:59,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=807654.0, ans=0.2 2023-06-20 21:45:35,595 INFO [train.py:996] (0/4) Epoch 5, batch 12650, loss[loss=0.2898, simple_loss=0.3473, pruned_loss=0.1161, over 21864.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3301, pruned_loss=0.09017, over 4279886.49 frames. ], batch size: 118, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:45:52,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=807774.0, ans=0.125 2023-06-20 21:46:01,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=807834.0, ans=0.125 2023-06-20 21:46:26,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=807894.0, ans=0.0 2023-06-20 21:47:20,050 INFO [train.py:996] (0/4) Epoch 5, batch 12700, loss[loss=0.2872, simple_loss=0.3506, pruned_loss=0.1119, over 21475.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3303, pruned_loss=0.09317, over 4285714.73 frames. ], batch size: 194, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:47:35,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.923e+02 3.430e+02 4.123e+02 8.274e+02, threshold=6.860e+02, percent-clipped=2.0 2023-06-20 21:47:41,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=808134.0, ans=0.125 2023-06-20 21:47:46,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=808134.0, ans=0.125 2023-06-20 21:47:59,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=808134.0, ans=0.0 2023-06-20 21:49:02,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=808374.0, ans=0.125 2023-06-20 21:49:03,372 INFO [train.py:996] (0/4) Epoch 5, batch 12750, loss[loss=0.2206, simple_loss=0.3076, pruned_loss=0.06682, over 21794.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3305, pruned_loss=0.09284, over 4286982.66 frames. ], batch size: 282, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:49:27,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=808434.0, ans=0.125 2023-06-20 21:49:43,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=808434.0, ans=0.125 2023-06-20 21:49:49,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=808494.0, ans=0.125 2023-06-20 21:50:15,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=808554.0, ans=0.125 2023-06-20 21:50:52,093 INFO [train.py:996] (0/4) Epoch 5, batch 12800, loss[loss=0.2541, simple_loss=0.3236, pruned_loss=0.09233, over 21820.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3309, pruned_loss=0.09394, over 4277472.48 frames. ], batch size: 282, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:51:03,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.761e+02 3.177e+02 3.732e+02 6.852e+02, threshold=6.353e+02, percent-clipped=0.0 2023-06-20 21:51:47,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=808794.0, ans=0.07 2023-06-20 21:51:50,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=808794.0, ans=0.0 2023-06-20 21:52:14,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=808914.0, ans=0.125 2023-06-20 21:52:37,067 INFO [train.py:996] (0/4) Epoch 5, batch 12850, loss[loss=0.2578, simple_loss=0.3336, pruned_loss=0.09101, over 21431.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.332, pruned_loss=0.0947, over 4279650.31 frames. ], batch size: 131, lr: 6.28e-03, grad_scale: 32.0 2023-06-20 21:52:56,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=808974.0, ans=0.2 2023-06-20 21:53:09,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=809034.0, ans=0.125 2023-06-20 21:53:38,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=809094.0, ans=0.0 2023-06-20 21:53:50,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-20 21:54:02,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=809154.0, ans=0.125 2023-06-20 21:54:09,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=809214.0, ans=0.2 2023-06-20 21:54:27,305 INFO [train.py:996] (0/4) Epoch 5, batch 12900, loss[loss=0.2364, simple_loss=0.3108, pruned_loss=0.08097, over 21502.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3284, pruned_loss=0.09073, over 4275529.68 frames. ], batch size: 230, lr: 6.28e-03, grad_scale: 32.0 2023-06-20 21:54:33,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=809274.0, ans=0.1 2023-06-20 21:54:45,414 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.796e+02 3.210e+02 3.770e+02 8.746e+02, threshold=6.419e+02, percent-clipped=1.0 2023-06-20 21:54:49,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=809334.0, ans=0.04949747468305833 2023-06-20 21:55:06,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-20 21:55:12,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=809394.0, ans=0.125 2023-06-20 21:55:12,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=809394.0, ans=0.0 2023-06-20 21:55:17,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=809394.0, ans=0.125 2023-06-20 21:55:36,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=809454.0, ans=0.2 2023-06-20 21:56:17,478 INFO [train.py:996] (0/4) Epoch 5, batch 12950, loss[loss=0.2367, simple_loss=0.3252, pruned_loss=0.07409, over 21736.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3287, pruned_loss=0.08946, over 4272605.91 frames. ], batch size: 298, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 21:56:49,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809634.0, ans=0.1 2023-06-20 21:57:52,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=809814.0, ans=0.0 2023-06-20 21:58:00,367 INFO [train.py:996] (0/4) Epoch 5, batch 13000, loss[loss=0.1748, simple_loss=0.2489, pruned_loss=0.05035, over 16088.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3276, pruned_loss=0.08914, over 4272184.04 frames. ], batch size: 60, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 21:58:01,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809874.0, ans=0.1 2023-06-20 21:58:09,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=809874.0, ans=0.125 2023-06-20 21:58:15,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.939e+02 3.382e+02 4.149e+02 6.832e+02, threshold=6.764e+02, percent-clipped=3.0 2023-06-20 21:58:17,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=809934.0, ans=0.125 2023-06-20 21:58:19,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=809934.0, ans=0.125 2023-06-20 21:58:59,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=809994.0, ans=0.0 2023-06-20 21:59:14,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=810054.0, ans=0.035 2023-06-20 21:59:26,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=810114.0, ans=0.125 2023-06-20 21:59:39,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-06-20 21:59:42,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=810114.0, ans=0.04949747468305833 2023-06-20 21:59:45,262 INFO [train.py:996] (0/4) Epoch 5, batch 13050, loss[loss=0.2305, simple_loss=0.3011, pruned_loss=0.07998, over 21659.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3246, pruned_loss=0.08744, over 4276281.81 frames. ], batch size: 230, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:00:14,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=810234.0, ans=0.2 2023-06-20 22:00:16,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=810234.0, ans=0.09899494936611666 2023-06-20 22:00:19,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=810234.0, ans=0.0 2023-06-20 22:00:26,831 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-20 22:00:33,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=810294.0, ans=0.2 2023-06-20 22:01:07,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=810414.0, ans=0.125 2023-06-20 22:01:29,105 INFO [train.py:996] (0/4) Epoch 5, batch 13100, loss[loss=0.2433, simple_loss=0.3233, pruned_loss=0.0817, over 21795.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3248, pruned_loss=0.08763, over 4279132.25 frames. ], batch size: 332, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:01:49,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.828e+02 3.544e+02 4.567e+02 8.084e+02, threshold=7.089e+02, percent-clipped=1.0 2023-06-20 22:03:11,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=810714.0, ans=0.125 2023-06-20 22:03:19,535 INFO [train.py:996] (0/4) Epoch 5, batch 13150, loss[loss=0.2056, simple_loss=0.2806, pruned_loss=0.06527, over 21429.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3261, pruned_loss=0.09029, over 4277922.87 frames. ], batch size: 211, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:03:25,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-20 22:03:31,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=810774.0, ans=0.0 2023-06-20 22:04:31,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-20 22:04:32,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=810954.0, ans=0.0 2023-06-20 22:05:03,356 INFO [train.py:996] (0/4) Epoch 5, batch 13200, loss[loss=0.2445, simple_loss=0.3352, pruned_loss=0.07689, over 20109.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3248, pruned_loss=0.09029, over 4274051.80 frames. ], batch size: 702, lr: 6.28e-03, grad_scale: 16.0 2023-06-20 22:05:18,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.901e+02 3.344e+02 4.017e+02 7.205e+02, threshold=6.688e+02, percent-clipped=1.0 2023-06-20 22:05:18,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=811134.0, ans=0.1 2023-06-20 22:05:43,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=811134.0, ans=0.0 2023-06-20 22:06:30,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=811314.0, ans=0.0 2023-06-20 22:06:32,004 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:06:48,163 INFO [train.py:996] (0/4) Epoch 5, batch 13250, loss[loss=0.2399, simple_loss=0.3155, pruned_loss=0.08215, over 21392.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3262, pruned_loss=0.09098, over 4274103.20 frames. ], batch size: 194, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:06:51,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-20 22:07:06,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-20 22:08:02,471 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-20 22:08:15,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=811614.0, ans=0.0 2023-06-20 22:08:25,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=811614.0, ans=0.05 2023-06-20 22:08:39,120 INFO [train.py:996] (0/4) Epoch 5, batch 13300, loss[loss=0.2999, simple_loss=0.3582, pruned_loss=0.1208, over 21751.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3306, pruned_loss=0.09166, over 4275786.89 frames. ], batch size: 441, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:08:59,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 2.849e+02 3.375e+02 4.056e+02 7.431e+02, threshold=6.749e+02, percent-clipped=1.0 2023-06-20 22:09:11,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=811734.0, ans=0.125 2023-06-20 22:09:53,636 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:10:28,593 INFO [train.py:996] (0/4) Epoch 5, batch 13350, loss[loss=0.2874, simple_loss=0.3491, pruned_loss=0.1128, over 21240.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3358, pruned_loss=0.09443, over 4276197.05 frames. ], batch size: 159, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:11:38,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.16 vs. limit=22.5 2023-06-20 22:11:39,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=812154.0, ans=0.0 2023-06-20 22:11:48,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-20 22:12:08,363 INFO [train.py:996] (0/4) Epoch 5, batch 13400, loss[loss=0.2386, simple_loss=0.3075, pruned_loss=0.08485, over 21918.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3385, pruned_loss=0.09793, over 4284567.75 frames. ], batch size: 124, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:12:22,373 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.312e+02 2.994e+02 3.475e+02 4.105e+02 5.675e+02, threshold=6.951e+02, percent-clipped=0.0 2023-06-20 22:12:50,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-20 22:12:55,512 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-20 22:13:07,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=812454.0, ans=0.0 2023-06-20 22:13:39,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=812514.0, ans=0.0 2023-06-20 22:13:46,638 INFO [train.py:996] (0/4) Epoch 5, batch 13450, loss[loss=0.2507, simple_loss=0.3063, pruned_loss=0.09756, over 21744.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3395, pruned_loss=0.1004, over 4289653.34 frames. ], batch size: 124, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:14:19,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-20 22:14:47,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=812694.0, ans=0.0 2023-06-20 22:14:53,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=812754.0, ans=0.0 2023-06-20 22:15:33,819 INFO [train.py:996] (0/4) Epoch 5, batch 13500, loss[loss=0.1823, simple_loss=0.2391, pruned_loss=0.06274, over 21327.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3318, pruned_loss=0.09684, over 4274234.92 frames. ], batch size: 176, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:15:53,682 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.173e+02 3.616e+02 4.505e+02 8.152e+02, threshold=7.232e+02, percent-clipped=1.0 2023-06-20 22:15:56,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.05 vs. limit=6.0 2023-06-20 22:16:26,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=812994.0, ans=0.125 2023-06-20 22:17:15,896 INFO [train.py:996] (0/4) Epoch 5, batch 13550, loss[loss=0.3008, simple_loss=0.3988, pruned_loss=0.1014, over 21642.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.336, pruned_loss=0.09562, over 4271457.83 frames. ], batch size: 389, lr: 6.27e-03, grad_scale: 8.0 2023-06-20 22:17:40,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-20 22:17:59,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=813294.0, ans=0.0 2023-06-20 22:18:28,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=813354.0, ans=0.05 2023-06-20 22:18:55,156 INFO [train.py:996] (0/4) Epoch 5, batch 13600, loss[loss=0.2522, simple_loss=0.3131, pruned_loss=0.09565, over 21281.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3372, pruned_loss=0.0965, over 4276795.87 frames. ], batch size: 143, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:19:16,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.895e+02 3.506e+02 4.425e+02 7.285e+02, threshold=7.012e+02, percent-clipped=2.0 2023-06-20 22:20:39,922 INFO [train.py:996] (0/4) Epoch 5, batch 13650, loss[loss=0.2304, simple_loss=0.295, pruned_loss=0.08292, over 21756.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3318, pruned_loss=0.09299, over 4279886.13 frames. ], batch size: 316, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:20:48,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=813774.0, ans=0.125 2023-06-20 22:21:12,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=813834.0, ans=0.2 2023-06-20 22:21:55,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=813954.0, ans=0.125 2023-06-20 22:22:10,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=814014.0, ans=0.2 2023-06-20 22:22:19,152 INFO [train.py:996] (0/4) Epoch 5, batch 13700, loss[loss=0.2763, simple_loss=0.3506, pruned_loss=0.101, over 20133.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3285, pruned_loss=0.09358, over 4273060.83 frames. ], batch size: 703, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:22:24,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=814074.0, ans=0.125 2023-06-20 22:22:41,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.806e+02 3.342e+02 4.306e+02 8.545e+02, threshold=6.684e+02, percent-clipped=2.0 2023-06-20 22:22:56,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=814194.0, ans=0.1 2023-06-20 22:24:01,305 INFO [train.py:996] (0/4) Epoch 5, batch 13750, loss[loss=0.207, simple_loss=0.2676, pruned_loss=0.07322, over 21340.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3255, pruned_loss=0.09342, over 4273059.52 frames. ], batch size: 131, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:24:33,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=814434.0, ans=0.1 2023-06-20 22:24:55,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=814494.0, ans=0.0 2023-06-20 22:25:49,708 INFO [train.py:996] (0/4) Epoch 5, batch 13800, loss[loss=0.285, simple_loss=0.3889, pruned_loss=0.0906, over 21675.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3292, pruned_loss=0.09229, over 4278576.50 frames. ], batch size: 389, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:26:03,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=814674.0, ans=0.2 2023-06-20 22:26:06,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 2.993e+02 3.321e+02 4.024e+02 5.976e+02, threshold=6.643e+02, percent-clipped=0.0 2023-06-20 22:26:34,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=814794.0, ans=0.125 2023-06-20 22:26:34,555 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:27:03,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=814914.0, ans=0.0 2023-06-20 22:27:26,026 INFO [train.py:996] (0/4) Epoch 5, batch 13850, loss[loss=0.2975, simple_loss=0.3604, pruned_loss=0.1173, over 21617.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3341, pruned_loss=0.09273, over 4275625.18 frames. ], batch size: 263, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:28:25,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=815154.0, ans=0.1 2023-06-20 22:28:26,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=815154.0, ans=0.125 2023-06-20 22:28:28,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=815154.0, ans=0.09899494936611666 2023-06-20 22:28:52,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=815214.0, ans=0.125 2023-06-20 22:29:01,723 INFO [train.py:996] (0/4) Epoch 5, batch 13900, loss[loss=0.2684, simple_loss=0.3337, pruned_loss=0.1015, over 21689.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3391, pruned_loss=0.09634, over 4274779.51 frames. ], batch size: 389, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:29:13,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=815274.0, ans=0.0 2023-06-20 22:29:26,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=815334.0, ans=0.0 2023-06-20 22:29:27,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 2.909e+02 3.378e+02 3.977e+02 7.082e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-20 22:29:47,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=815394.0, ans=0.125 2023-06-20 22:29:54,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=815394.0, ans=0.0 2023-06-20 22:30:27,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=815514.0, ans=0.0 2023-06-20 22:30:37,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=815514.0, ans=0.0 2023-06-20 22:30:41,777 INFO [train.py:996] (0/4) Epoch 5, batch 13950, loss[loss=0.2604, simple_loss=0.3312, pruned_loss=0.09474, over 21885.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3389, pruned_loss=0.09854, over 4286293.14 frames. ], batch size: 316, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:31:03,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=815574.0, ans=0.125 2023-06-20 22:31:33,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=815694.0, ans=0.125 2023-06-20 22:31:41,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=815694.0, ans=0.125 2023-06-20 22:32:24,851 INFO [train.py:996] (0/4) Epoch 5, batch 14000, loss[loss=0.218, simple_loss=0.312, pruned_loss=0.06197, over 21819.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3329, pruned_loss=0.09475, over 4286084.20 frames. ], batch size: 282, lr: 6.26e-03, grad_scale: 32.0 2023-06-20 22:32:35,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=815874.0, ans=0.1 2023-06-20 22:32:40,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=815874.0, ans=0.125 2023-06-20 22:32:45,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.619e+02 3.139e+02 3.881e+02 6.690e+02, threshold=6.278e+02, percent-clipped=0.0 2023-06-20 22:33:01,916 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-136000.pt 2023-06-20 22:33:56,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=816114.0, ans=0.125 2023-06-20 22:33:56,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=816114.0, ans=0.1 2023-06-20 22:34:05,240 INFO [train.py:996] (0/4) Epoch 5, batch 14050, loss[loss=0.1916, simple_loss=0.273, pruned_loss=0.05503, over 21682.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3286, pruned_loss=0.09059, over 4289135.30 frames. ], batch size: 247, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:34:31,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=816234.0, ans=0.125 2023-06-20 22:34:36,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=816234.0, ans=0.125 2023-06-20 22:35:21,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=816354.0, ans=0.0 2023-06-20 22:35:24,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=816414.0, ans=0.125 2023-06-20 22:35:41,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=816414.0, ans=0.07 2023-06-20 22:35:44,107 INFO [train.py:996] (0/4) Epoch 5, batch 14100, loss[loss=0.3155, simple_loss=0.3678, pruned_loss=0.1316, over 21531.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3215, pruned_loss=0.08974, over 4286314.30 frames. ], batch size: 389, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:36:07,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.771e+02 3.154e+02 4.028e+02 6.108e+02, threshold=6.308e+02, percent-clipped=0.0 2023-06-20 22:36:07,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=816534.0, ans=0.125 2023-06-20 22:36:23,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-20 22:36:31,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-20 22:36:43,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=816654.0, ans=0.2 2023-06-20 22:37:11,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=816714.0, ans=0.125 2023-06-20 22:37:15,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=816714.0, ans=0.125 2023-06-20 22:37:18,636 INFO [train.py:996] (0/4) Epoch 5, batch 14150, loss[loss=0.2396, simple_loss=0.3169, pruned_loss=0.08113, over 21120.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.325, pruned_loss=0.09095, over 4279046.73 frames. ], batch size: 143, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:38:11,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=816894.0, ans=0.125 2023-06-20 22:38:34,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=816954.0, ans=0.125 2023-06-20 22:38:53,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=817014.0, ans=0.1 2023-06-20 22:38:57,651 INFO [train.py:996] (0/4) Epoch 5, batch 14200, loss[loss=0.2179, simple_loss=0.2905, pruned_loss=0.07261, over 21649.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3243, pruned_loss=0.08954, over 4268064.58 frames. ], batch size: 230, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:39:09,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-20 22:39:14,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=817074.0, ans=0.125 2023-06-20 22:39:19,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.484e+02 2.963e+02 3.702e+02 8.044e+02, threshold=5.927e+02, percent-clipped=3.0 2023-06-20 22:40:15,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=817314.0, ans=0.0 2023-06-20 22:40:36,170 INFO [train.py:996] (0/4) Epoch 5, batch 14250, loss[loss=0.257, simple_loss=0.3198, pruned_loss=0.09715, over 21430.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3192, pruned_loss=0.09015, over 4264291.44 frames. ], batch size: 473, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:40:43,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-20 22:40:52,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=817374.0, ans=0.1 2023-06-20 22:41:22,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=817494.0, ans=0.0 2023-06-20 22:41:25,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=817494.0, ans=0.07 2023-06-20 22:41:50,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=817554.0, ans=15.0 2023-06-20 22:42:17,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=817674.0, ans=0.0 2023-06-20 22:42:22,722 INFO [train.py:996] (0/4) Epoch 5, batch 14300, loss[loss=0.2296, simple_loss=0.3227, pruned_loss=0.06822, over 21779.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3208, pruned_loss=0.089, over 4271376.57 frames. ], batch size: 282, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:42:46,155 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 3.147e+02 3.840e+02 5.009e+02 9.347e+02, threshold=7.680e+02, percent-clipped=16.0 2023-06-20 22:43:31,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=817854.0, ans=0.125 2023-06-20 22:43:44,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=817914.0, ans=0.2 2023-06-20 22:44:02,697 INFO [train.py:996] (0/4) Epoch 5, batch 14350, loss[loss=0.2093, simple_loss=0.2705, pruned_loss=0.07404, over 21363.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3284, pruned_loss=0.0905, over 4271386.04 frames. ], batch size: 131, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:44:07,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=817974.0, ans=0.125 2023-06-20 22:44:27,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-20 22:44:46,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=818094.0, ans=0.1 2023-06-20 22:44:46,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=818094.0, ans=0.125 2023-06-20 22:44:49,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=818094.0, ans=0.07 2023-06-20 22:44:57,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=818094.0, ans=0.0 2023-06-20 22:45:03,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=818154.0, ans=0.0 2023-06-20 22:45:14,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=818214.0, ans=0.125 2023-06-20 22:45:40,634 INFO [train.py:996] (0/4) Epoch 5, batch 14400, loss[loss=0.255, simple_loss=0.3074, pruned_loss=0.1013, over 21498.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3248, pruned_loss=0.09074, over 4276868.50 frames. ], batch size: 212, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:45:58,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.774e+02 3.108e+02 3.689e+02 4.790e+02, threshold=6.217e+02, percent-clipped=0.0 2023-06-20 22:47:20,560 INFO [train.py:996] (0/4) Epoch 5, batch 14450, loss[loss=0.2502, simple_loss=0.303, pruned_loss=0.09864, over 21577.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3202, pruned_loss=0.09171, over 4277762.92 frames. ], batch size: 212, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:47:28,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=818574.0, ans=0.2 2023-06-20 22:47:34,196 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-20 22:47:45,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=818634.0, ans=0.125 2023-06-20 22:48:03,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=818694.0, ans=0.0 2023-06-20 22:48:58,553 INFO [train.py:996] (0/4) Epoch 5, batch 14500, loss[loss=0.2194, simple_loss=0.3056, pruned_loss=0.06662, over 21608.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3153, pruned_loss=0.09058, over 4274875.20 frames. ], batch size: 263, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:48:59,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=818874.0, ans=0.0 2023-06-20 22:49:02,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=818874.0, ans=0.0 2023-06-20 22:49:16,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.793e+02 3.259e+02 3.991e+02 5.427e+02, threshold=6.518e+02, percent-clipped=0.0 2023-06-20 22:49:23,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=818934.0, ans=0.125 2023-06-20 22:49:51,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=818994.0, ans=0.0 2023-06-20 22:49:57,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=819054.0, ans=0.07 2023-06-20 22:50:07,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=819054.0, ans=0.125 2023-06-20 22:50:07,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=819054.0, ans=0.0 2023-06-20 22:50:29,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=819114.0, ans=0.125 2023-06-20 22:50:40,054 INFO [train.py:996] (0/4) Epoch 5, batch 14550, loss[loss=0.3237, simple_loss=0.3855, pruned_loss=0.131, over 21497.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3206, pruned_loss=0.0932, over 4274316.55 frames. ], batch size: 131, lr: 6.24e-03, grad_scale: 32.0 2023-06-20 22:50:41,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-20 22:51:39,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=819354.0, ans=12.0 2023-06-20 22:51:53,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=819354.0, ans=0.1 2023-06-20 22:52:01,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=819354.0, ans=0.125 2023-06-20 22:52:20,178 INFO [train.py:996] (0/4) Epoch 5, batch 14600, loss[loss=0.3285, simple_loss=0.3861, pruned_loss=0.1354, over 21554.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3296, pruned_loss=0.09821, over 4274583.31 frames. ], batch size: 414, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:52:21,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2023-06-20 22:52:40,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=819534.0, ans=0.125 2023-06-20 22:52:44,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.146e+02 3.577e+02 4.655e+02 8.854e+02, threshold=7.154e+02, percent-clipped=8.0 2023-06-20 22:52:46,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=819534.0, ans=0.125 2023-06-20 22:53:13,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=819594.0, ans=0.0 2023-06-20 22:53:24,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=819654.0, ans=0.2 2023-06-20 22:53:24,870 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-20 22:53:48,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=819714.0, ans=0.125 2023-06-20 22:54:00,102 INFO [train.py:996] (0/4) Epoch 5, batch 14650, loss[loss=0.2222, simple_loss=0.2959, pruned_loss=0.0743, over 21361.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3303, pruned_loss=0.09655, over 4267946.44 frames. ], batch size: 159, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:54:27,471 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-20 22:54:40,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=819834.0, ans=0.125 2023-06-20 22:54:50,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-20 22:55:27,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=820014.0, ans=0.2 2023-06-20 22:55:32,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=820014.0, ans=0.1 2023-06-20 22:55:41,429 INFO [train.py:996] (0/4) Epoch 5, batch 14700, loss[loss=0.2201, simple_loss=0.3086, pruned_loss=0.06576, over 21758.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3232, pruned_loss=0.08975, over 4265461.07 frames. ], batch size: 247, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:55:43,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=820074.0, ans=0.1 2023-06-20 22:56:11,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.525e+02 2.974e+02 3.979e+02 6.680e+02, threshold=5.948e+02, percent-clipped=0.0 2023-06-20 22:56:33,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=820194.0, ans=0.0 2023-06-20 22:57:01,623 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.75 vs. limit=6.0 2023-06-20 22:57:23,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-20 22:57:29,230 INFO [train.py:996] (0/4) Epoch 5, batch 14750, loss[loss=0.2745, simple_loss=0.3308, pruned_loss=0.1091, over 20242.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3279, pruned_loss=0.09211, over 4267013.18 frames. ], batch size: 707, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:57:39,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=820374.0, ans=0.125 2023-06-20 22:57:39,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=820374.0, ans=0.0 2023-06-20 22:58:21,664 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:59:11,141 INFO [train.py:996] (0/4) Epoch 5, batch 14800, loss[loss=0.2599, simple_loss=0.3335, pruned_loss=0.0931, over 20787.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3385, pruned_loss=0.09784, over 4260278.85 frames. ], batch size: 611, lr: 6.24e-03, grad_scale: 32.0 2023-06-20 22:59:28,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=820734.0, ans=0.125 2023-06-20 22:59:30,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.410e+02 3.152e+02 3.633e+02 4.425e+02 1.058e+03, threshold=7.266e+02, percent-clipped=3.0 2023-06-20 22:59:33,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=820734.0, ans=0.0 2023-06-20 22:59:42,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=820734.0, ans=0.125 2023-06-20 22:59:58,925 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-20 23:00:06,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=820794.0, ans=0.2 2023-06-20 23:00:15,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=820854.0, ans=0.125 2023-06-20 23:00:30,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=820854.0, ans=0.95 2023-06-20 23:00:55,545 INFO [train.py:996] (0/4) Epoch 5, batch 14850, loss[loss=0.2384, simple_loss=0.3053, pruned_loss=0.08575, over 21570.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3325, pruned_loss=0.09679, over 4254569.06 frames. ], batch size: 230, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:01:14,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=820974.0, ans=0.125 2023-06-20 23:01:16,646 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-06-20 23:01:25,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=821034.0, ans=0.1 2023-06-20 23:01:38,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=821094.0, ans=0.125 2023-06-20 23:02:18,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-20 23:02:22,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=821214.0, ans=0.0 2023-06-20 23:02:22,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=821214.0, ans=0.125 2023-06-20 23:02:36,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-20 23:02:37,279 INFO [train.py:996] (0/4) Epoch 5, batch 14900, loss[loss=0.2518, simple_loss=0.3223, pruned_loss=0.09065, over 21578.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3345, pruned_loss=0.09792, over 4259667.45 frames. ], batch size: 263, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:03:05,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=821334.0, ans=0.1 2023-06-20 23:03:08,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 3.108e+02 3.722e+02 4.348e+02 7.688e+02, threshold=7.444e+02, percent-clipped=1.0 2023-06-20 23:03:31,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=821394.0, ans=0.125 2023-06-20 23:04:29,725 INFO [train.py:996] (0/4) Epoch 5, batch 14950, loss[loss=0.2972, simple_loss=0.3636, pruned_loss=0.1155, over 21255.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3363, pruned_loss=0.09761, over 4257572.63 frames. ], batch size: 176, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:04:38,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=821574.0, ans=0.0 2023-06-20 23:05:03,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=821634.0, ans=0.0 2023-06-20 23:05:17,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=821694.0, ans=0.0 2023-06-20 23:06:10,046 INFO [train.py:996] (0/4) Epoch 5, batch 15000, loss[loss=0.264, simple_loss=0.3453, pruned_loss=0.09135, over 19706.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3388, pruned_loss=0.09955, over 4265797.71 frames. ], batch size: 703, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:06:10,048 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-20 23:06:26,232 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2595, simple_loss=0.3578, pruned_loss=0.08055, over 1796401.00 frames. 2023-06-20 23:06:26,232 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-20 23:07:00,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.991e+02 3.617e+02 4.837e+02 7.610e+02, threshold=7.234e+02, percent-clipped=2.0 2023-06-20 23:07:15,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.06 vs. limit=22.5 2023-06-20 23:07:53,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=822114.0, ans=0.125 2023-06-20 23:07:59,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=822114.0, ans=0.0 2023-06-20 23:07:59,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=822114.0, ans=0.125 2023-06-20 23:08:12,427 INFO [train.py:996] (0/4) Epoch 5, batch 15050, loss[loss=0.2504, simple_loss=0.3466, pruned_loss=0.07715, over 21742.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3404, pruned_loss=0.1003, over 4255116.03 frames. ], batch size: 332, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:08:15,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=822174.0, ans=0.2 2023-06-20 23:08:38,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=822234.0, ans=0.04949747468305833 2023-06-20 23:09:30,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-20 23:09:31,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=822354.0, ans=0.1 2023-06-20 23:09:59,497 INFO [train.py:996] (0/4) Epoch 5, batch 15100, loss[loss=0.2602, simple_loss=0.3325, pruned_loss=0.0939, over 21803.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3434, pruned_loss=0.1001, over 4261433.37 frames. ], batch size: 282, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:10:00,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-20 23:10:22,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=822534.0, ans=0.125 2023-06-20 23:10:25,331 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.218e+02 4.050e+02 5.256e+02 8.500e+02, threshold=8.100e+02, percent-clipped=5.0 2023-06-20 23:10:38,911 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:10:56,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=822594.0, ans=0.125 2023-06-20 23:11:05,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-20 23:11:19,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=822714.0, ans=0.2 2023-06-20 23:11:44,946 INFO [train.py:996] (0/4) Epoch 5, batch 15150, loss[loss=0.3049, simple_loss=0.3834, pruned_loss=0.1131, over 20731.00 frames. ], tot_loss[loss=0.269, simple_loss=0.339, pruned_loss=0.09947, over 4260053.34 frames. ], batch size: 607, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:11:52,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-20 23:12:46,874 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-06-20 23:12:48,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=22.5 2023-06-20 23:13:20,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=823014.0, ans=0.125 2023-06-20 23:13:24,126 INFO [train.py:996] (0/4) Epoch 5, batch 15200, loss[loss=0.212, simple_loss=0.2776, pruned_loss=0.07319, over 22016.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3278, pruned_loss=0.09405, over 4254892.52 frames. ], batch size: 103, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:13:42,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=823134.0, ans=0.2 2023-06-20 23:13:45,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.736e+02 3.206e+02 4.003e+02 7.087e+02, threshold=6.412e+02, percent-clipped=0.0 2023-06-20 23:13:49,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-20 23:13:54,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=823134.0, ans=0.125 2023-06-20 23:14:19,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=823194.0, ans=0.125 2023-06-20 23:14:22,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=823254.0, ans=0.125 2023-06-20 23:15:01,161 INFO [train.py:996] (0/4) Epoch 5, batch 15250, loss[loss=0.243, simple_loss=0.2998, pruned_loss=0.09307, over 21820.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3219, pruned_loss=0.09265, over 4260633.95 frames. ], batch size: 317, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:15:35,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=823434.0, ans=0.125 2023-06-20 23:16:00,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=823554.0, ans=0.125 2023-06-20 23:16:32,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=823614.0, ans=0.125 2023-06-20 23:16:40,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-20 23:16:42,987 INFO [train.py:996] (0/4) Epoch 5, batch 15300, loss[loss=0.304, simple_loss=0.3587, pruned_loss=0.1247, over 21418.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3262, pruned_loss=0.09668, over 4245830.72 frames. ], batch size: 471, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:17:04,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.998e+02 3.594e+02 4.256e+02 7.669e+02, threshold=7.187e+02, percent-clipped=3.0 2023-06-20 23:17:48,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=823854.0, ans=0.125 2023-06-20 23:17:49,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=823854.0, ans=0.0 2023-06-20 23:17:51,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=823854.0, ans=0.95 2023-06-20 23:18:12,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=823914.0, ans=0.0 2023-06-20 23:18:18,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-20 23:18:23,683 INFO [train.py:996] (0/4) Epoch 5, batch 15350, loss[loss=0.3166, simple_loss=0.373, pruned_loss=0.1302, over 21901.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3333, pruned_loss=0.09976, over 4255202.48 frames. ], batch size: 371, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:18:27,107 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:18:32,995 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.28 vs. limit=10.0 2023-06-20 23:18:44,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-20 23:19:24,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-20 23:19:25,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-20 23:19:26,745 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:20:03,477 INFO [train.py:996] (0/4) Epoch 5, batch 15400, loss[loss=0.2828, simple_loss=0.3431, pruned_loss=0.1112, over 21738.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3341, pruned_loss=0.09725, over 4245171.39 frames. ], batch size: 389, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:20:10,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=824274.0, ans=0.0 2023-06-20 23:20:25,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.899e+02 3.241e+02 4.047e+02 6.361e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 23:21:26,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=824514.0, ans=0.0 2023-06-20 23:21:39,517 INFO [train.py:996] (0/4) Epoch 5, batch 15450, loss[loss=0.2159, simple_loss=0.3043, pruned_loss=0.06373, over 21616.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3321, pruned_loss=0.09685, over 4248236.64 frames. ], batch size: 263, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:21:48,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=824574.0, ans=10.0 2023-06-20 23:22:33,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=824694.0, ans=0.125 2023-06-20 23:22:37,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-20 23:23:02,297 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:23:13,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-20 23:23:20,712 INFO [train.py:996] (0/4) Epoch 5, batch 15500, loss[loss=0.2809, simple_loss=0.3749, pruned_loss=0.09348, over 18298.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3338, pruned_loss=0.09672, over 4254698.62 frames. ], batch size: 60, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:23:26,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=824874.0, ans=0.125 2023-06-20 23:23:54,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.818e+02 3.290e+02 3.883e+02 6.635e+02, threshold=6.579e+02, percent-clipped=1.0 2023-06-20 23:24:02,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=824994.0, ans=0.125 2023-06-20 23:25:02,107 INFO [train.py:996] (0/4) Epoch 5, batch 15550, loss[loss=0.2508, simple_loss=0.3212, pruned_loss=0.09016, over 21560.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3303, pruned_loss=0.09347, over 4262162.88 frames. ], batch size: 441, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:25:26,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=825234.0, ans=0.2 2023-06-20 23:26:23,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=825354.0, ans=0.025 2023-06-20 23:26:42,180 INFO [train.py:996] (0/4) Epoch 5, batch 15600, loss[loss=0.2156, simple_loss=0.2819, pruned_loss=0.07462, over 21758.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3234, pruned_loss=0.09211, over 4262648.79 frames. ], batch size: 351, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:27:09,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.848e+02 3.319e+02 3.887e+02 5.745e+02, threshold=6.638e+02, percent-clipped=0.0 2023-06-20 23:27:11,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=825534.0, ans=0.125 2023-06-20 23:27:14,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=825534.0, ans=0.125 2023-06-20 23:27:25,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=825594.0, ans=0.0 2023-06-20 23:27:34,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=825594.0, ans=0.07 2023-06-20 23:27:34,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-20 23:27:38,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-06-20 23:27:55,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=825654.0, ans=0.125 2023-06-20 23:28:17,191 INFO [train.py:996] (0/4) Epoch 5, batch 15650, loss[loss=0.2527, simple_loss=0.3123, pruned_loss=0.09652, over 21813.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3224, pruned_loss=0.09109, over 4258603.87 frames. ], batch size: 102, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:28:41,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=825834.0, ans=0.2 2023-06-20 23:28:50,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=825834.0, ans=0.125 2023-06-20 23:29:10,283 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:29:31,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=825954.0, ans=0.125 2023-06-20 23:29:31,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=825954.0, ans=0.1 2023-06-20 23:29:37,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=825954.0, ans=0.125 2023-06-20 23:29:46,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-20 23:30:01,354 INFO [train.py:996] (0/4) Epoch 5, batch 15700, loss[loss=0.2271, simple_loss=0.303, pruned_loss=0.07559, over 21534.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3196, pruned_loss=0.09023, over 4261569.53 frames. ], batch size: 230, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:30:02,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=826074.0, ans=0.125 2023-06-20 23:30:20,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=826074.0, ans=0.2 2023-06-20 23:30:29,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.764e+02 3.253e+02 4.322e+02 6.346e+02, threshold=6.507e+02, percent-clipped=0.0 2023-06-20 23:30:35,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=826134.0, ans=0.125 2023-06-20 23:31:16,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=826254.0, ans=0.0 2023-06-20 23:31:38,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=826314.0, ans=0.0 2023-06-20 23:31:41,404 INFO [train.py:996] (0/4) Epoch 5, batch 15750, loss[loss=0.254, simple_loss=0.3142, pruned_loss=0.09687, over 21180.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3154, pruned_loss=0.08975, over 4266770.11 frames. ], batch size: 143, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:31:48,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=826374.0, ans=0.025 2023-06-20 23:32:43,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=826554.0, ans=0.0 2023-06-20 23:33:06,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=826614.0, ans=0.0 2023-06-20 23:33:20,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=826674.0, ans=0.125 2023-06-20 23:33:21,189 INFO [train.py:996] (0/4) Epoch 5, batch 15800, loss[loss=0.208, simple_loss=0.2657, pruned_loss=0.07513, over 21266.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3109, pruned_loss=0.08897, over 4268342.40 frames. ], batch size: 176, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:33:50,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.927e+02 3.607e+02 4.746e+02 7.598e+02, threshold=7.214e+02, percent-clipped=2.0 2023-06-20 23:34:52,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=826914.0, ans=0.1 2023-06-20 23:35:01,916 INFO [train.py:996] (0/4) Epoch 5, batch 15850, loss[loss=0.2354, simple_loss=0.2894, pruned_loss=0.09074, over 21628.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3146, pruned_loss=0.09142, over 4257894.28 frames. ], batch size: 282, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:35:02,399 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:35:16,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2023-06-20 23:35:49,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=827094.0, ans=0.125 2023-06-20 23:36:41,694 INFO [train.py:996] (0/4) Epoch 5, batch 15900, loss[loss=0.2258, simple_loss=0.2796, pruned_loss=0.08602, over 21375.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3157, pruned_loss=0.09244, over 4258994.52 frames. ], batch size: 211, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:37:11,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.862e+02 3.189e+02 4.240e+02 8.969e+02, threshold=6.379e+02, percent-clipped=1.0 2023-06-20 23:37:23,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=827394.0, ans=0.125 2023-06-20 23:38:12,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=827514.0, ans=0.0 2023-06-20 23:38:19,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-20 23:38:22,892 INFO [train.py:996] (0/4) Epoch 5, batch 15950, loss[loss=0.1947, simple_loss=0.2716, pruned_loss=0.05887, over 21227.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3136, pruned_loss=0.089, over 4263450.31 frames. ], batch size: 159, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:39:22,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=827754.0, ans=0.1 2023-06-20 23:39:37,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-20 23:39:51,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=827814.0, ans=0.125 2023-06-20 23:39:51,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=827814.0, ans=0.125 2023-06-20 23:39:57,227 INFO [train.py:996] (0/4) Epoch 5, batch 16000, loss[loss=0.2439, simple_loss=0.3206, pruned_loss=0.08365, over 21250.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3133, pruned_loss=0.08581, over 4265879.20 frames. ], batch size: 143, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:40:16,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=827874.0, ans=0.2 2023-06-20 23:40:22,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=827934.0, ans=0.05 2023-06-20 23:40:30,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.522e+02 3.012e+02 3.700e+02 7.317e+02, threshold=6.025e+02, percent-clipped=2.0 2023-06-20 23:40:38,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=827994.0, ans=0.125 2023-06-20 23:40:46,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=827994.0, ans=0.125 2023-06-20 23:41:16,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-20 23:41:26,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=828114.0, ans=0.125 2023-06-20 23:41:43,641 INFO [train.py:996] (0/4) Epoch 5, batch 16050, loss[loss=0.3383, simple_loss=0.4225, pruned_loss=0.127, over 21510.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3176, pruned_loss=0.08473, over 4275534.69 frames. ], batch size: 471, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:42:28,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=828294.0, ans=0.1 2023-06-20 23:42:32,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-20 23:43:01,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=828414.0, ans=0.0 2023-06-20 23:43:23,974 INFO [train.py:996] (0/4) Epoch 5, batch 16100, loss[loss=0.2707, simple_loss=0.3249, pruned_loss=0.1082, over 21881.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3203, pruned_loss=0.08607, over 4278140.91 frames. ], batch size: 316, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:43:42,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=828474.0, ans=0.2 2023-06-20 23:43:52,554 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.759e+02 3.248e+02 4.030e+02 6.532e+02, threshold=6.496e+02, percent-clipped=1.0 2023-06-20 23:44:05,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=828594.0, ans=0.0 2023-06-20 23:44:18,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=828594.0, ans=0.0 2023-06-20 23:44:26,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=828654.0, ans=0.125 2023-06-20 23:44:40,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=828714.0, ans=0.1 2023-06-20 23:44:57,683 INFO [train.py:996] (0/4) Epoch 5, batch 16150, loss[loss=0.3035, simple_loss=0.3667, pruned_loss=0.1202, over 21568.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.32, pruned_loss=0.0889, over 4287337.63 frames. ], batch size: 471, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:45:20,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=828774.0, ans=0.1 2023-06-20 23:45:27,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=828834.0, ans=0.125 2023-06-20 23:46:31,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=829014.0, ans=0.1 2023-06-20 23:46:31,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=829014.0, ans=0.1 2023-06-20 23:46:40,203 INFO [train.py:996] (0/4) Epoch 5, batch 16200, loss[loss=0.2322, simple_loss=0.3, pruned_loss=0.08214, over 21587.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3239, pruned_loss=0.09082, over 4290943.76 frames. ], batch size: 212, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:47:09,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 2.854e+02 3.310e+02 3.979e+02 8.024e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-20 23:47:13,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=829134.0, ans=0.125 2023-06-20 23:47:23,318 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-20 23:47:27,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=829194.0, ans=10.0 2023-06-20 23:48:01,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=829314.0, ans=0.125 2023-06-20 23:48:02,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=829314.0, ans=0.0 2023-06-20 23:48:19,433 INFO [train.py:996] (0/4) Epoch 5, batch 16250, loss[loss=0.2767, simple_loss=0.3359, pruned_loss=0.1088, over 21455.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.325, pruned_loss=0.09131, over 4285861.72 frames. ], batch size: 509, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:48:46,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=15.0 2023-06-20 23:48:48,449 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-20 23:50:03,651 INFO [train.py:996] (0/4) Epoch 5, batch 16300, loss[loss=0.2474, simple_loss=0.3262, pruned_loss=0.08431, over 21547.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3204, pruned_loss=0.0869, over 4281745.56 frames. ], batch size: 230, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:50:05,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=829674.0, ans=0.125 2023-06-20 23:50:12,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=829674.0, ans=0.125 2023-06-20 23:50:29,421 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.495e+02 2.799e+02 3.333e+02 5.849e+02, threshold=5.597e+02, percent-clipped=0.0 2023-06-20 23:50:37,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=829734.0, ans=0.125 2023-06-20 23:51:01,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=829854.0, ans=0.125 2023-06-20 23:51:14,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=829854.0, ans=0.015 2023-06-20 23:51:20,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=829854.0, ans=0.125 2023-06-20 23:51:33,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=829914.0, ans=0.0 2023-06-20 23:51:44,335 INFO [train.py:996] (0/4) Epoch 5, batch 16350, loss[loss=0.2651, simple_loss=0.3765, pruned_loss=0.07685, over 20760.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3223, pruned_loss=0.08911, over 4284290.49 frames. ], batch size: 608, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:53:18,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.40 vs. limit=15.0 2023-06-20 23:53:23,364 INFO [train.py:996] (0/4) Epoch 5, batch 16400, loss[loss=0.2679, simple_loss=0.3244, pruned_loss=0.1057, over 21552.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3275, pruned_loss=0.09088, over 4288897.22 frames. ], batch size: 548, lr: 6.20e-03, grad_scale: 32.0 2023-06-20 23:53:35,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-20 23:53:52,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-20 23:53:52,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 2.889e+02 3.302e+02 3.961e+02 7.962e+02, threshold=6.603e+02, percent-clipped=4.0 2023-06-20 23:54:15,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-20 23:54:31,613 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-20 23:54:53,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=830514.0, ans=0.125 2023-06-20 23:55:02,583 INFO [train.py:996] (0/4) Epoch 5, batch 16450, loss[loss=0.2693, simple_loss=0.3343, pruned_loss=0.1021, over 21847.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3273, pruned_loss=0.09274, over 4292962.81 frames. ], batch size: 414, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:55:32,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.37 vs. limit=10.0 2023-06-20 23:55:47,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=830694.0, ans=0.09899494936611666 2023-06-20 23:56:41,427 INFO [train.py:996] (0/4) Epoch 5, batch 16500, loss[loss=0.1822, simple_loss=0.2335, pruned_loss=0.06547, over 21200.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3233, pruned_loss=0.09239, over 4294003.22 frames. ], batch size: 143, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:56:45,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=830874.0, ans=0.125 2023-06-20 23:57:19,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.018e+02 3.661e+02 4.243e+02 1.006e+03, threshold=7.323e+02, percent-clipped=9.0 2023-06-20 23:58:08,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=831114.0, ans=0.2 2023-06-20 23:58:23,181 INFO [train.py:996] (0/4) Epoch 5, batch 16550, loss[loss=0.3061, simple_loss=0.4219, pruned_loss=0.09522, over 19846.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3182, pruned_loss=0.08875, over 4284800.88 frames. ], batch size: 702, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:58:51,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=831234.0, ans=0.5 2023-06-20 23:59:55,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.29 vs. limit=22.5 2023-06-21 00:00:14,158 INFO [train.py:996] (0/4) Epoch 5, batch 16600, loss[loss=0.2862, simple_loss=0.3841, pruned_loss=0.09421, over 21847.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3282, pruned_loss=0.09315, over 4283275.17 frames. ], batch size: 282, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:00:34,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=831534.0, ans=0.125 2023-06-21 00:00:41,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=831534.0, ans=0.125 2023-06-21 00:00:42,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.262e+02 3.858e+02 4.542e+02 8.769e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-21 00:01:13,990 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-21 00:01:57,527 INFO [train.py:996] (0/4) Epoch 5, batch 16650, loss[loss=0.3439, simple_loss=0.4117, pruned_loss=0.1381, over 21838.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3376, pruned_loss=0.09654, over 4276707.52 frames. ], batch size: 118, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:02:29,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=831834.0, ans=0.125 2023-06-21 00:02:33,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=831834.0, ans=0.125 2023-06-21 00:03:03,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=831954.0, ans=0.0 2023-06-21 00:03:15,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.14 vs. limit=10.0 2023-06-21 00:03:16,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=831954.0, ans=0.02 2023-06-21 00:03:18,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-21 00:03:40,753 INFO [train.py:996] (0/4) Epoch 5, batch 16700, loss[loss=0.2468, simple_loss=0.3433, pruned_loss=0.07511, over 20734.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3387, pruned_loss=0.09685, over 4274492.84 frames. ], batch size: 608, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:04:18,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 2.932e+02 3.507e+02 4.315e+02 8.242e+02, threshold=7.013e+02, percent-clipped=1.0 2023-06-21 00:04:25,063 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-21 00:05:30,390 INFO [train.py:996] (0/4) Epoch 5, batch 16750, loss[loss=0.325, simple_loss=0.4034, pruned_loss=0.1234, over 21468.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3414, pruned_loss=0.09744, over 4273414.47 frames. ], batch size: 471, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:06:03,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=832434.0, ans=0.125 2023-06-21 00:06:50,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=832554.0, ans=0.1 2023-06-21 00:07:16,843 INFO [train.py:996] (0/4) Epoch 5, batch 16800, loss[loss=0.3123, simple_loss=0.3787, pruned_loss=0.123, over 21605.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3463, pruned_loss=0.09841, over 4271611.26 frames. ], batch size: 471, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:07:17,196 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:07:33,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=832674.0, ans=0.125 2023-06-21 00:07:43,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=832734.0, ans=0.125 2023-06-21 00:07:48,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.643e+02 3.479e+02 3.935e+02 4.857e+02 8.503e+02, threshold=7.870e+02, percent-clipped=2.0 2023-06-21 00:07:55,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=832794.0, ans=0.0 2023-06-21 00:08:16,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=832854.0, ans=0.0 2023-06-21 00:08:22,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=832854.0, ans=0.0 2023-06-21 00:08:55,653 INFO [train.py:996] (0/4) Epoch 5, batch 16850, loss[loss=0.2507, simple_loss=0.3183, pruned_loss=0.09152, over 21918.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.342, pruned_loss=0.09761, over 4277480.02 frames. ], batch size: 107, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:09:25,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-21 00:10:35,991 INFO [train.py:996] (0/4) Epoch 5, batch 16900, loss[loss=0.2016, simple_loss=0.2671, pruned_loss=0.06807, over 21331.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3368, pruned_loss=0.0966, over 4278177.24 frames. ], batch size: 211, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:10:50,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-21 00:10:52,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=833274.0, ans=0.2 2023-06-21 00:11:07,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.951e+02 3.440e+02 4.010e+02 6.855e+02, threshold=6.879e+02, percent-clipped=0.0 2023-06-21 00:11:08,735 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-21 00:11:57,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-21 00:12:09,938 INFO [train.py:996] (0/4) Epoch 5, batch 16950, loss[loss=0.2347, simple_loss=0.2988, pruned_loss=0.08532, over 21314.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3302, pruned_loss=0.09542, over 4281548.15 frames. ], batch size: 176, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:12:43,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=833634.0, ans=0.2 2023-06-21 00:13:59,449 INFO [train.py:996] (0/4) Epoch 5, batch 17000, loss[loss=0.2694, simple_loss=0.3299, pruned_loss=0.1044, over 21923.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3272, pruned_loss=0.09519, over 4289614.05 frames. ], batch size: 414, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:14:06,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.96 vs. limit=10.0 2023-06-21 00:14:09,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=833874.0, ans=0.125 2023-06-21 00:14:27,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 2.869e+02 3.423e+02 4.013e+02 9.065e+02, threshold=6.846e+02, percent-clipped=1.0 2023-06-21 00:14:37,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=833994.0, ans=0.035 2023-06-21 00:14:37,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=833994.0, ans=0.0 2023-06-21 00:15:14,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=834114.0, ans=0.125 2023-06-21 00:15:36,734 INFO [train.py:996] (0/4) Epoch 5, batch 17050, loss[loss=0.2786, simple_loss=0.3623, pruned_loss=0.09739, over 21853.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3331, pruned_loss=0.09672, over 4289064.58 frames. ], batch size: 316, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:16:33,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=834354.0, ans=0.125 2023-06-21 00:16:37,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=834354.0, ans=0.2 2023-06-21 00:16:44,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2023-06-21 00:17:14,721 INFO [train.py:996] (0/4) Epoch 5, batch 17100, loss[loss=0.2222, simple_loss=0.2876, pruned_loss=0.07839, over 21432.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3325, pruned_loss=0.09783, over 4293109.07 frames. ], batch size: 194, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:17:26,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=834474.0, ans=0.025 2023-06-21 00:17:43,353 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.091e+02 3.634e+02 4.796e+02 1.009e+03, threshold=7.268e+02, percent-clipped=8.0 2023-06-21 00:17:51,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=834594.0, ans=0.2 2023-06-21 00:17:52,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-21 00:17:59,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=834594.0, ans=0.5 2023-06-21 00:18:03,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=834594.0, ans=0.1 2023-06-21 00:18:53,485 INFO [train.py:996] (0/4) Epoch 5, batch 17150, loss[loss=0.2484, simple_loss=0.3152, pruned_loss=0.09078, over 21873.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3282, pruned_loss=0.09726, over 4296707.08 frames. ], batch size: 118, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:18:53,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=834774.0, ans=0.125 2023-06-21 00:19:18,210 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:19:58,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=834954.0, ans=0.125 2023-06-21 00:20:25,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-21 00:20:33,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.55 vs. limit=10.0 2023-06-21 00:20:33,460 INFO [train.py:996] (0/4) Epoch 5, batch 17200, loss[loss=0.255, simple_loss=0.326, pruned_loss=0.09205, over 21881.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3273, pruned_loss=0.09704, over 4292566.04 frames. ], batch size: 371, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:20:42,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=835074.0, ans=0.0 2023-06-21 00:20:56,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=835134.0, ans=0.125 2023-06-21 00:21:11,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=835134.0, ans=0.1 2023-06-21 00:21:12,570 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 2.764e+02 3.023e+02 3.387e+02 5.035e+02, threshold=6.046e+02, percent-clipped=0.0 2023-06-21 00:21:38,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=835254.0, ans=0.125 2023-06-21 00:21:42,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=835254.0, ans=0.0 2023-06-21 00:21:44,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-21 00:22:14,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=835314.0, ans=0.125 2023-06-21 00:22:18,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=835374.0, ans=0.125 2023-06-21 00:22:19,277 INFO [train.py:996] (0/4) Epoch 5, batch 17250, loss[loss=0.2681, simple_loss=0.3449, pruned_loss=0.09566, over 21827.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3316, pruned_loss=0.09942, over 4293440.30 frames. ], batch size: 247, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:22:47,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=835434.0, ans=0.2 2023-06-21 00:23:03,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=835494.0, ans=0.125 2023-06-21 00:23:46,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=835614.0, ans=0.125 2023-06-21 00:23:46,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=835614.0, ans=0.125 2023-06-21 00:23:57,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=835614.0, ans=0.125 2023-06-21 00:24:02,223 INFO [train.py:996] (0/4) Epoch 5, batch 17300, loss[loss=0.3134, simple_loss=0.379, pruned_loss=0.1239, over 21923.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3396, pruned_loss=0.1028, over 4299254.58 frames. ], batch size: 372, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:24:41,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.738e+02 3.630e+02 4.657e+02 6.212e+02 1.066e+03, threshold=9.314e+02, percent-clipped=26.0 2023-06-21 00:25:19,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-21 00:25:21,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=835854.0, ans=15.0 2023-06-21 00:25:48,518 INFO [train.py:996] (0/4) Epoch 5, batch 17350, loss[loss=0.2115, simple_loss=0.3041, pruned_loss=0.05947, over 21802.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3395, pruned_loss=0.1019, over 4288846.92 frames. ], batch size: 316, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:26:30,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=836094.0, ans=0.2 2023-06-21 00:26:40,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=836094.0, ans=0.125 2023-06-21 00:26:58,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=836154.0, ans=0.125 2023-06-21 00:27:13,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=836214.0, ans=0.04949747468305833 2023-06-21 00:27:29,139 INFO [train.py:996] (0/4) Epoch 5, batch 17400, loss[loss=0.2031, simple_loss=0.2547, pruned_loss=0.07572, over 21361.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3339, pruned_loss=0.09687, over 4279106.48 frames. ], batch size: 159, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:27:39,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=836274.0, ans=0.125 2023-06-21 00:27:49,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=836334.0, ans=0.0 2023-06-21 00:28:10,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 2.783e+02 3.227e+02 3.615e+02 5.491e+02, threshold=6.454e+02, percent-clipped=0.0 2023-06-21 00:28:12,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=836394.0, ans=0.125 2023-06-21 00:28:34,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=836454.0, ans=0.0 2023-06-21 00:28:50,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=836454.0, ans=0.125 2023-06-21 00:29:16,102 INFO [train.py:996] (0/4) Epoch 5, batch 17450, loss[loss=0.1999, simple_loss=0.2827, pruned_loss=0.05855, over 21752.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3324, pruned_loss=0.09452, over 4280673.68 frames. ], batch size: 282, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:29:26,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=836574.0, ans=0.125 2023-06-21 00:29:42,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=836634.0, ans=0.2 2023-06-21 00:31:00,464 INFO [train.py:996] (0/4) Epoch 5, batch 17500, loss[loss=0.2793, simple_loss=0.3299, pruned_loss=0.1144, over 21790.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3283, pruned_loss=0.09211, over 4282770.47 frames. ], batch size: 441, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:31:10,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=836874.0, ans=0.0 2023-06-21 00:31:13,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=836874.0, ans=0.125 2023-06-21 00:31:19,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=836934.0, ans=0.1 2023-06-21 00:31:34,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.759e+02 3.126e+02 4.015e+02 6.726e+02, threshold=6.252e+02, percent-clipped=1.0 2023-06-21 00:31:52,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=837054.0, ans=0.125 2023-06-21 00:32:08,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=837054.0, ans=0.0 2023-06-21 00:32:24,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=837114.0, ans=0.125 2023-06-21 00:32:31,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=837174.0, ans=0.0 2023-06-21 00:32:32,779 INFO [train.py:996] (0/4) Epoch 5, batch 17550, loss[loss=0.2147, simple_loss=0.307, pruned_loss=0.06119, over 21838.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3277, pruned_loss=0.09056, over 4285848.61 frames. ], batch size: 118, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:33:25,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=837294.0, ans=0.2 2023-06-21 00:33:45,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=837354.0, ans=0.2 2023-06-21 00:33:49,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=837354.0, ans=0.125 2023-06-21 00:34:02,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=837414.0, ans=0.2 2023-06-21 00:34:18,779 INFO [train.py:996] (0/4) Epoch 5, batch 17600, loss[loss=0.2561, simple_loss=0.3305, pruned_loss=0.09082, over 21405.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3292, pruned_loss=0.09049, over 4272613.49 frames. ], batch size: 549, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:34:22,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=837474.0, ans=0.0 2023-06-21 00:34:27,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=837474.0, ans=0.2 2023-06-21 00:34:46,815 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-21 00:34:53,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.862e+02 3.527e+02 4.406e+02 6.176e+02, threshold=7.053e+02, percent-clipped=0.0 2023-06-21 00:35:16,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=837654.0, ans=0.125 2023-06-21 00:35:18,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=837654.0, ans=0.2 2023-06-21 00:35:52,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=837774.0, ans=0.2 2023-06-21 00:35:59,204 INFO [train.py:996] (0/4) Epoch 5, batch 17650, loss[loss=0.2323, simple_loss=0.313, pruned_loss=0.07586, over 21693.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3261, pruned_loss=0.0906, over 4274346.97 frames. ], batch size: 415, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:35:59,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=837774.0, ans=0.125 2023-06-21 00:36:11,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=837774.0, ans=0.0 2023-06-21 00:36:26,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=837834.0, ans=0.125 2023-06-21 00:36:31,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=837834.0, ans=0.0 2023-06-21 00:36:31,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=837834.0, ans=0.0 2023-06-21 00:36:36,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=837894.0, ans=0.1 2023-06-21 00:36:36,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=837894.0, ans=0.2 2023-06-21 00:37:24,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=838014.0, ans=0.0 2023-06-21 00:37:42,013 INFO [train.py:996] (0/4) Epoch 5, batch 17700, loss[loss=0.2258, simple_loss=0.3037, pruned_loss=0.07395, over 21276.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3222, pruned_loss=0.088, over 4274066.39 frames. ], batch size: 176, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:37:45,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=838074.0, ans=0.2 2023-06-21 00:38:17,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 2.950e+02 3.482e+02 4.668e+02 9.100e+02, threshold=6.963e+02, percent-clipped=4.0 2023-06-21 00:38:41,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=838194.0, ans=0.0 2023-06-21 00:39:13,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=838314.0, ans=0.0 2023-06-21 00:39:17,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=12.0 2023-06-21 00:39:21,500 INFO [train.py:996] (0/4) Epoch 5, batch 17750, loss[loss=0.3125, simple_loss=0.377, pruned_loss=0.124, over 21677.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3313, pruned_loss=0.09205, over 4278857.07 frames. ], batch size: 351, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:39:36,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=838374.0, ans=0.125 2023-06-21 00:39:36,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=838374.0, ans=0.2 2023-06-21 00:40:16,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=838494.0, ans=0.0 2023-06-21 00:40:21,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=838494.0, ans=0.07 2023-06-21 00:40:41,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=838554.0, ans=0.125 2023-06-21 00:41:07,674 INFO [train.py:996] (0/4) Epoch 5, batch 17800, loss[loss=0.244, simple_loss=0.3271, pruned_loss=0.08045, over 20111.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3298, pruned_loss=0.09066, over 4269308.56 frames. ], batch size: 702, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:41:22,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-21 00:41:49,221 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.927e+02 3.424e+02 3.955e+02 9.585e+02, threshold=6.848e+02, percent-clipped=3.0 2023-06-21 00:42:17,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=838854.0, ans=0.125 2023-06-21 00:42:23,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=838854.0, ans=0.05 2023-06-21 00:42:38,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-21 00:42:49,092 INFO [train.py:996] (0/4) Epoch 5, batch 17850, loss[loss=0.2809, simple_loss=0.3437, pruned_loss=0.109, over 21731.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.33, pruned_loss=0.09162, over 4270422.46 frames. ], batch size: 298, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:42:55,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-21 00:44:24,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=839214.0, ans=0.95 2023-06-21 00:44:25,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=839214.0, ans=0.05 2023-06-21 00:44:29,939 INFO [train.py:996] (0/4) Epoch 5, batch 17900, loss[loss=0.2723, simple_loss=0.3594, pruned_loss=0.09261, over 21709.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3345, pruned_loss=0.09334, over 4272679.75 frames. ], batch size: 298, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:45:19,614 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 2.900e+02 3.378e+02 3.906e+02 6.654e+02, threshold=6.756e+02, percent-clipped=0.0 2023-06-21 00:46:22,399 INFO [train.py:996] (0/4) Epoch 5, batch 17950, loss[loss=0.2506, simple_loss=0.3588, pruned_loss=0.07124, over 21192.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3366, pruned_loss=0.09146, over 4270234.74 frames. ], batch size: 548, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:47:22,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=839754.0, ans=0.125 2023-06-21 00:48:01,523 INFO [train.py:996] (0/4) Epoch 5, batch 18000, loss[loss=0.2467, simple_loss=0.3042, pruned_loss=0.09462, over 21748.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3291, pruned_loss=0.08962, over 4269424.37 frames. ], batch size: 317, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:48:01,524 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 00:48:17,786 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2664, simple_loss=0.3658, pruned_loss=0.08353, over 1796401.00 frames. 2023-06-21 00:48:17,787 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 00:48:39,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=839934.0, ans=0.125 2023-06-21 00:48:59,515 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-140000.pt 2023-06-21 00:49:01,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=839994.0, ans=0.05 2023-06-21 00:49:02,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.602e+02 3.109e+02 3.503e+02 6.028e+02, threshold=6.218e+02, percent-clipped=0.0 2023-06-21 00:49:07,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=839994.0, ans=0.125 2023-06-21 00:49:41,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-21 00:49:58,603 INFO [train.py:996] (0/4) Epoch 5, batch 18050, loss[loss=0.2577, simple_loss=0.3145, pruned_loss=0.1005, over 21655.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3235, pruned_loss=0.08917, over 4271635.50 frames. ], batch size: 247, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:50:26,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-06-21 00:50:40,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=840294.0, ans=0.125 2023-06-21 00:50:46,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=840294.0, ans=0.025 2023-06-21 00:51:14,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=840354.0, ans=0.125 2023-06-21 00:51:22,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-21 00:51:39,042 INFO [train.py:996] (0/4) Epoch 5, batch 18100, loss[loss=0.2377, simple_loss=0.3385, pruned_loss=0.06846, over 21555.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3279, pruned_loss=0.09102, over 4263514.24 frames. ], batch size: 230, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:51:47,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=840474.0, ans=0.1 2023-06-21 00:51:49,461 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:52:27,304 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.902e+02 3.495e+02 4.106e+02 8.308e+02, threshold=6.990e+02, percent-clipped=1.0 2023-06-21 00:52:48,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=840654.0, ans=0.125 2023-06-21 00:53:11,224 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.25 vs. limit=10.0 2023-06-21 00:53:22,813 INFO [train.py:996] (0/4) Epoch 5, batch 18150, loss[loss=0.2456, simple_loss=0.3128, pruned_loss=0.08923, over 21590.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.329, pruned_loss=0.09131, over 4267118.59 frames. ], batch size: 414, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:53:43,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-21 00:53:54,213 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:54:27,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=840954.0, ans=0.1 2023-06-21 00:54:38,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=841014.0, ans=0.0 2023-06-21 00:54:54,873 INFO [train.py:996] (0/4) Epoch 5, batch 18200, loss[loss=0.2197, simple_loss=0.2883, pruned_loss=0.07551, over 21784.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3251, pruned_loss=0.09118, over 4254663.35 frames. ], batch size: 102, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:55:14,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=841074.0, ans=0.0 2023-06-21 00:55:22,452 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:55:23,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=841134.0, ans=0.0 2023-06-21 00:55:37,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.776e+02 3.291e+02 4.569e+02 1.152e+03, threshold=6.583e+02, percent-clipped=3.0 2023-06-21 00:55:39,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=841194.0, ans=0.1 2023-06-21 00:56:32,321 INFO [train.py:996] (0/4) Epoch 5, batch 18250, loss[loss=0.2087, simple_loss=0.2754, pruned_loss=0.07099, over 21664.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3173, pruned_loss=0.08752, over 4250683.38 frames. ], batch size: 298, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:57:04,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=841434.0, ans=0.125 2023-06-21 00:57:05,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=841434.0, ans=0.125 2023-06-21 00:57:13,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=841494.0, ans=0.2 2023-06-21 00:57:16,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=841494.0, ans=0.0 2023-06-21 00:57:53,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-21 00:58:05,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=841614.0, ans=0.2 2023-06-21 00:58:11,266 INFO [train.py:996] (0/4) Epoch 5, batch 18300, loss[loss=0.2867, simple_loss=0.3678, pruned_loss=0.1028, over 21711.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3192, pruned_loss=0.08905, over 4254909.93 frames. ], batch size: 441, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:58:53,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=841794.0, ans=0.125 2023-06-21 00:58:54,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 2.809e+02 3.144e+02 3.817e+02 6.593e+02, threshold=6.288e+02, percent-clipped=1.0 2023-06-21 00:59:00,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=841794.0, ans=0.1 2023-06-21 00:59:41,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-21 00:59:49,954 INFO [train.py:996] (0/4) Epoch 5, batch 18350, loss[loss=0.2432, simple_loss=0.3245, pruned_loss=0.08093, over 21252.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3232, pruned_loss=0.08884, over 4253521.96 frames. ], batch size: 548, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 01:00:50,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=842154.0, ans=0.125 2023-06-21 01:01:11,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=842154.0, ans=0.1 2023-06-21 01:01:19,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=842214.0, ans=0.1 2023-06-21 01:01:30,187 INFO [train.py:996] (0/4) Epoch 5, batch 18400, loss[loss=0.2674, simple_loss=0.316, pruned_loss=0.1095, over 21188.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3181, pruned_loss=0.08733, over 4254437.46 frames. ], batch size: 143, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:01:32,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=842274.0, ans=0.1 2023-06-21 01:02:02,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-21 01:02:14,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.972e+02 3.476e+02 4.424e+02 9.442e+02, threshold=6.951e+02, percent-clipped=6.0 2023-06-21 01:02:27,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=842394.0, ans=0.015 2023-06-21 01:02:51,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=842454.0, ans=0.0 2023-06-21 01:03:10,012 INFO [train.py:996] (0/4) Epoch 5, batch 18450, loss[loss=0.2191, simple_loss=0.3081, pruned_loss=0.06499, over 21596.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3164, pruned_loss=0.08446, over 4259868.90 frames. ], batch size: 442, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:03:34,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=842634.0, ans=0.125 2023-06-21 01:03:44,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-21 01:03:48,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=842634.0, ans=0.0 2023-06-21 01:03:59,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=842694.0, ans=0.1 2023-06-21 01:04:35,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=842814.0, ans=0.2 2023-06-21 01:04:47,261 INFO [train.py:996] (0/4) Epoch 5, batch 18500, loss[loss=0.2088, simple_loss=0.2746, pruned_loss=0.07155, over 21513.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3109, pruned_loss=0.08323, over 4254694.05 frames. ], batch size: 230, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:05:28,259 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=22.5 2023-06-21 01:05:30,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.509e+02 2.864e+02 3.266e+02 4.867e+02, threshold=5.728e+02, percent-clipped=0.0 2023-06-21 01:05:31,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.63 vs. limit=15.0 2023-06-21 01:05:42,548 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=12.0 2023-06-21 01:05:56,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=843054.0, ans=0.125 2023-06-21 01:06:26,377 INFO [train.py:996] (0/4) Epoch 5, batch 18550, loss[loss=0.2327, simple_loss=0.2913, pruned_loss=0.08702, over 21929.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3079, pruned_loss=0.08162, over 4254614.03 frames. ], batch size: 113, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:07:49,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=843414.0, ans=0.04949747468305833 2023-06-21 01:08:06,352 INFO [train.py:996] (0/4) Epoch 5, batch 18600, loss[loss=0.2124, simple_loss=0.2867, pruned_loss=0.06907, over 15464.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3057, pruned_loss=0.08144, over 4253762.51 frames. ], batch size: 60, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:08:32,109 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:08:33,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=843534.0, ans=0.2 2023-06-21 01:08:44,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.99 vs. limit=10.0 2023-06-21 01:08:49,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 2.765e+02 3.271e+02 3.896e+02 6.265e+02, threshold=6.542e+02, percent-clipped=2.0 2023-06-21 01:09:37,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=843714.0, ans=0.0 2023-06-21 01:09:40,835 INFO [train.py:996] (0/4) Epoch 5, batch 18650, loss[loss=0.2174, simple_loss=0.308, pruned_loss=0.06339, over 20785.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.306, pruned_loss=0.08262, over 4251643.86 frames. ], batch size: 608, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:10:26,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=843894.0, ans=0.125 2023-06-21 01:10:47,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=843954.0, ans=0.2 2023-06-21 01:10:57,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=844014.0, ans=0.0 2023-06-21 01:11:08,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=844014.0, ans=0.0 2023-06-21 01:11:13,530 INFO [train.py:996] (0/4) Epoch 5, batch 18700, loss[loss=0.2555, simple_loss=0.3127, pruned_loss=0.09915, over 14516.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3037, pruned_loss=0.08372, over 4257346.69 frames. ], batch size: 60, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:11:25,095 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:11:25,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=844074.0, ans=0.0 2023-06-21 01:11:56,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.719e+02 3.161e+02 4.088e+02 6.146e+02, threshold=6.321e+02, percent-clipped=0.0 2023-06-21 01:12:50,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=844314.0, ans=0.125 2023-06-21 01:12:52,650 INFO [train.py:996] (0/4) Epoch 5, batch 18750, loss[loss=0.2242, simple_loss=0.2872, pruned_loss=0.08063, over 21290.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3074, pruned_loss=0.08688, over 4267133.51 frames. ], batch size: 176, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:13:27,734 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-21 01:13:28,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=844434.0, ans=0.2 2023-06-21 01:13:46,073 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:14:01,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=844554.0, ans=0.0 2023-06-21 01:14:11,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=844554.0, ans=0.0 2023-06-21 01:14:12,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=844554.0, ans=0.035 2023-06-21 01:14:30,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-21 01:14:32,831 INFO [train.py:996] (0/4) Epoch 5, batch 18800, loss[loss=0.2971, simple_loss=0.3768, pruned_loss=0.1087, over 20797.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3144, pruned_loss=0.08931, over 4259434.98 frames. ], batch size: 607, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:14:39,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=844674.0, ans=0.05 2023-06-21 01:14:40,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.35 vs. limit=10.0 2023-06-21 01:14:42,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=844674.0, ans=0.0 2023-06-21 01:15:00,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=844734.0, ans=0.125 2023-06-21 01:15:11,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.115e+02 3.803e+02 4.953e+02 7.292e+02, threshold=7.607e+02, percent-clipped=7.0 2023-06-21 01:15:29,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=844854.0, ans=0.04949747468305833 2023-06-21 01:15:55,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-21 01:16:07,902 INFO [train.py:996] (0/4) Epoch 5, batch 18850, loss[loss=0.197, simple_loss=0.267, pruned_loss=0.06352, over 21504.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3099, pruned_loss=0.08475, over 4255650.30 frames. ], batch size: 230, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:16:12,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.46 vs. limit=6.0 2023-06-21 01:17:46,580 INFO [train.py:996] (0/4) Epoch 5, batch 18900, loss[loss=0.2516, simple_loss=0.3081, pruned_loss=0.09755, over 21221.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.306, pruned_loss=0.08336, over 4256623.42 frames. ], batch size: 159, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:18:24,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=845334.0, ans=0.1 2023-06-21 01:18:24,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=845334.0, ans=0.05 2023-06-21 01:18:31,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.559e+02 2.920e+02 3.727e+02 6.054e+02, threshold=5.840e+02, percent-clipped=0.0 2023-06-21 01:18:45,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-21 01:19:19,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=845514.0, ans=0.125 2023-06-21 01:19:27,658 INFO [train.py:996] (0/4) Epoch 5, batch 18950, loss[loss=0.2754, simple_loss=0.377, pruned_loss=0.08687, over 21736.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3063, pruned_loss=0.08509, over 4269740.06 frames. ], batch size: 414, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:20:42,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-21 01:21:02,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=845814.0, ans=0.125 2023-06-21 01:21:04,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=845814.0, ans=0.1 2023-06-21 01:21:08,463 INFO [train.py:996] (0/4) Epoch 5, batch 19000, loss[loss=0.2878, simple_loss=0.3737, pruned_loss=0.101, over 21706.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.315, pruned_loss=0.0864, over 4271377.92 frames. ], batch size: 416, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:21:33,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=845934.0, ans=0.2 2023-06-21 01:21:53,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.876e+02 3.456e+02 4.188e+02 7.110e+02, threshold=6.912e+02, percent-clipped=2.0 2023-06-21 01:22:36,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=846114.0, ans=0.125 2023-06-21 01:22:39,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=846114.0, ans=0.0 2023-06-21 01:22:47,104 INFO [train.py:996] (0/4) Epoch 5, batch 19050, loss[loss=0.2677, simple_loss=0.3263, pruned_loss=0.1045, over 21941.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3204, pruned_loss=0.09056, over 4280575.15 frames. ], batch size: 316, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:22:47,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=846174.0, ans=0.2 2023-06-21 01:23:26,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=846234.0, ans=0.2 2023-06-21 01:23:58,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=846354.0, ans=0.025 2023-06-21 01:24:07,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=846354.0, ans=0.0 2023-06-21 01:24:30,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=846474.0, ans=0.0 2023-06-21 01:24:31,391 INFO [train.py:996] (0/4) Epoch 5, batch 19100, loss[loss=0.2304, simple_loss=0.2932, pruned_loss=0.08375, over 21812.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3194, pruned_loss=0.09217, over 4275217.56 frames. ], batch size: 371, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:24:58,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=846534.0, ans=0.125 2023-06-21 01:25:14,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=846594.0, ans=0.125 2023-06-21 01:25:22,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 2.859e+02 3.382e+02 4.111e+02 6.618e+02, threshold=6.763e+02, percent-clipped=0.0 2023-06-21 01:26:17,786 INFO [train.py:996] (0/4) Epoch 5, batch 19150, loss[loss=0.2586, simple_loss=0.3271, pruned_loss=0.09505, over 21372.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3201, pruned_loss=0.09274, over 4272053.01 frames. ], batch size: 194, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:26:46,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=846834.0, ans=0.125 2023-06-21 01:27:22,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=846954.0, ans=0.0 2023-06-21 01:27:25,197 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:28:00,638 INFO [train.py:996] (0/4) Epoch 5, batch 19200, loss[loss=0.2647, simple_loss=0.3616, pruned_loss=0.08392, over 21790.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3304, pruned_loss=0.09319, over 4272634.28 frames. ], batch size: 332, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 01:28:25,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=847134.0, ans=0.2 2023-06-21 01:28:33,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=847134.0, ans=0.5 2023-06-21 01:28:47,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.827e+02 3.204e+02 4.140e+02 7.071e+02, threshold=6.408e+02, percent-clipped=1.0 2023-06-21 01:28:48,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=847194.0, ans=0.125 2023-06-21 01:28:48,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=847194.0, ans=0.125 2023-06-21 01:29:11,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=847254.0, ans=0.0 2023-06-21 01:29:30,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=847314.0, ans=0.0 2023-06-21 01:29:38,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=847314.0, ans=0.125 2023-06-21 01:29:41,556 INFO [train.py:996] (0/4) Epoch 5, batch 19250, loss[loss=0.2044, simple_loss=0.2746, pruned_loss=0.06714, over 21427.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3278, pruned_loss=0.08682, over 4278655.26 frames. ], batch size: 131, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:30:09,681 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:30:14,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=847434.0, ans=0.05 2023-06-21 01:30:30,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=847494.0, ans=0.1 2023-06-21 01:31:20,364 INFO [train.py:996] (0/4) Epoch 5, batch 19300, loss[loss=0.2051, simple_loss=0.3041, pruned_loss=0.053, over 20827.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3264, pruned_loss=0.08654, over 4281263.85 frames. ], batch size: 608, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:31:48,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=847734.0, ans=0.0 2023-06-21 01:31:56,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.63 vs. limit=15.0 2023-06-21 01:32:07,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.704e+02 3.211e+02 3.924e+02 6.818e+02, threshold=6.422e+02, percent-clipped=2.0 2023-06-21 01:32:10,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-21 01:32:23,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=847854.0, ans=0.07 2023-06-21 01:32:39,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=847854.0, ans=0.125 2023-06-21 01:33:01,227 INFO [train.py:996] (0/4) Epoch 5, batch 19350, loss[loss=0.2102, simple_loss=0.2912, pruned_loss=0.06462, over 21747.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3216, pruned_loss=0.08283, over 4285272.59 frames. ], batch size: 282, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:33:58,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=848154.0, ans=0.125 2023-06-21 01:34:39,465 INFO [train.py:996] (0/4) Epoch 5, batch 19400, loss[loss=0.2367, simple_loss=0.3034, pruned_loss=0.08499, over 21639.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3181, pruned_loss=0.08148, over 4289170.35 frames. ], batch size: 263, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:35:08,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=848334.0, ans=0.125 2023-06-21 01:35:17,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=848334.0, ans=15.0 2023-06-21 01:35:25,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.765e+02 3.064e+02 3.598e+02 5.687e+02, threshold=6.129e+02, percent-clipped=0.0 2023-06-21 01:35:56,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=848454.0, ans=0.0 2023-06-21 01:36:22,722 INFO [train.py:996] (0/4) Epoch 5, batch 19450, loss[loss=0.2422, simple_loss=0.3002, pruned_loss=0.09206, over 21859.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3162, pruned_loss=0.08458, over 4296347.91 frames. ], batch size: 373, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:36:29,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=848574.0, ans=0.125 2023-06-21 01:36:55,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=848634.0, ans=0.125 2023-06-21 01:36:57,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=848634.0, ans=0.0 2023-06-21 01:36:57,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=848634.0, ans=0.125 2023-06-21 01:37:18,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=848754.0, ans=0.1 2023-06-21 01:37:41,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=848814.0, ans=0.125 2023-06-21 01:37:58,172 INFO [train.py:996] (0/4) Epoch 5, batch 19500, loss[loss=0.2147, simple_loss=0.2719, pruned_loss=0.07869, over 21234.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3126, pruned_loss=0.08586, over 4298140.95 frames. ], batch size: 159, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:37:58,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=848874.0, ans=0.0 2023-06-21 01:38:44,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 2.810e+02 3.330e+02 3.940e+02 7.380e+02, threshold=6.661e+02, percent-clipped=6.0 2023-06-21 01:39:09,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-21 01:39:16,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=849054.0, ans=0.125 2023-06-21 01:39:36,273 INFO [train.py:996] (0/4) Epoch 5, batch 19550, loss[loss=0.3067, simple_loss=0.3559, pruned_loss=0.1287, over 20035.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.308, pruned_loss=0.08467, over 4287070.17 frames. ], batch size: 702, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 01:40:06,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=849234.0, ans=0.0 2023-06-21 01:40:13,158 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:40:59,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-21 01:41:00,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=849414.0, ans=0.0 2023-06-21 01:41:06,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.35 vs. limit=10.0 2023-06-21 01:41:19,074 INFO [train.py:996] (0/4) Epoch 5, batch 19600, loss[loss=0.2263, simple_loss=0.2911, pruned_loss=0.08072, over 21171.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3106, pruned_loss=0.08565, over 4293503.53 frames. ], batch size: 608, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:41:21,484 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:41:37,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=849474.0, ans=0.1 2023-06-21 01:41:59,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=849594.0, ans=0.125 2023-06-21 01:42:00,756 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.061e+02 3.495e+02 4.046e+02 6.477e+02, threshold=6.990e+02, percent-clipped=0.0 2023-06-21 01:42:55,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.47 vs. limit=10.0 2023-06-21 01:42:57,963 INFO [train.py:996] (0/4) Epoch 5, batch 19650, loss[loss=0.3123, simple_loss=0.3592, pruned_loss=0.1327, over 21552.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3178, pruned_loss=0.09099, over 4294800.26 frames. ], batch size: 471, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:43:15,386 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-06-21 01:44:46,874 INFO [train.py:996] (0/4) Epoch 5, batch 19700, loss[loss=0.2939, simple_loss=0.3813, pruned_loss=0.1033, over 21479.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3213, pruned_loss=0.09186, over 4285035.38 frames. ], batch size: 471, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:45:34,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 2.950e+02 3.404e+02 4.157e+02 1.102e+03, threshold=6.808e+02, percent-clipped=4.0 2023-06-21 01:45:35,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.01 vs. limit=22.5 2023-06-21 01:46:10,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=850314.0, ans=0.1 2023-06-21 01:46:23,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=850314.0, ans=0.0 2023-06-21 01:46:27,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-21 01:46:27,934 INFO [train.py:996] (0/4) Epoch 5, batch 19750, loss[loss=0.2912, simple_loss=0.3767, pruned_loss=0.1028, over 21699.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3298, pruned_loss=0.09241, over 4280460.67 frames. ], batch size: 389, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:47:00,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=850434.0, ans=0.1 2023-06-21 01:47:30,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=850554.0, ans=0.2 2023-06-21 01:47:37,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-21 01:48:01,029 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:48:06,773 INFO [train.py:996] (0/4) Epoch 5, batch 19800, loss[loss=0.2511, simple_loss=0.3109, pruned_loss=0.09559, over 21796.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3301, pruned_loss=0.09266, over 4288080.95 frames. ], batch size: 112, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:48:17,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-21 01:48:49,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=850794.0, ans=0.125 2023-06-21 01:48:54,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.157e+02 4.058e+02 5.975e+02 1.111e+03, threshold=8.116e+02, percent-clipped=16.0 2023-06-21 01:49:20,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=850854.0, ans=0.0 2023-06-21 01:49:34,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=850914.0, ans=0.125 2023-06-21 01:49:39,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=850914.0, ans=0.0 2023-06-21 01:49:44,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-21 01:49:52,599 INFO [train.py:996] (0/4) Epoch 5, batch 19850, loss[loss=0.2474, simple_loss=0.3227, pruned_loss=0.086, over 21461.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3226, pruned_loss=0.08777, over 4288327.39 frames. ], batch size: 194, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:49:56,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=850974.0, ans=0.2 2023-06-21 01:50:30,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-21 01:50:32,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=851094.0, ans=0.1 2023-06-21 01:50:52,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=851094.0, ans=0.5 2023-06-21 01:51:29,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=851214.0, ans=0.2 2023-06-21 01:51:32,079 INFO [train.py:996] (0/4) Epoch 5, batch 19900, loss[loss=0.2076, simple_loss=0.2845, pruned_loss=0.0654, over 21601.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3224, pruned_loss=0.08493, over 4282536.59 frames. ], batch size: 247, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:52:00,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=851334.0, ans=0.125 2023-06-21 01:52:18,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=851394.0, ans=0.125 2023-06-21 01:52:19,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.633e+02 2.856e+02 3.289e+02 5.435e+02, threshold=5.712e+02, percent-clipped=0.0 2023-06-21 01:52:26,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-21 01:52:29,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-21 01:53:07,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=851574.0, ans=0.04949747468305833 2023-06-21 01:53:08,177 INFO [train.py:996] (0/4) Epoch 5, batch 19950, loss[loss=0.2605, simple_loss=0.3401, pruned_loss=0.09042, over 20698.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3168, pruned_loss=0.08489, over 4281120.56 frames. ], batch size: 607, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:53:23,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=851574.0, ans=0.0 2023-06-21 01:53:26,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=851574.0, ans=0.125 2023-06-21 01:53:53,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=851694.0, ans=0.0 2023-06-21 01:54:08,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=851694.0, ans=0.0 2023-06-21 01:54:46,777 INFO [train.py:996] (0/4) Epoch 5, batch 20000, loss[loss=0.2569, simple_loss=0.3274, pruned_loss=0.09315, over 21511.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3177, pruned_loss=0.08551, over 4284998.07 frames. ], batch size: 548, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:55:13,631 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.38 vs. limit=15.0 2023-06-21 01:55:38,164 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.174e+02 3.672e+02 4.869e+02 7.405e+02, threshold=7.343e+02, percent-clipped=12.0 2023-06-21 01:55:38,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=851994.0, ans=0.0 2023-06-21 01:55:42,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=851994.0, ans=0.0 2023-06-21 01:56:25,474 INFO [train.py:996] (0/4) Epoch 5, batch 20050, loss[loss=0.2798, simple_loss=0.3429, pruned_loss=0.1084, over 21857.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3186, pruned_loss=0.0876, over 4279173.73 frames. ], batch size: 351, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:58:11,254 INFO [train.py:996] (0/4) Epoch 5, batch 20100, loss[loss=0.2856, simple_loss=0.3727, pruned_loss=0.09925, over 21783.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3209, pruned_loss=0.09006, over 4287400.54 frames. ], batch size: 282, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:58:11,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=852474.0, ans=0.125 2023-06-21 01:58:16,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=852474.0, ans=0.125 2023-06-21 01:58:58,326 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 2.825e+02 3.167e+02 3.914e+02 6.858e+02, threshold=6.334e+02, percent-clipped=0.0 2023-06-21 01:59:21,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=852654.0, ans=0.2 2023-06-21 01:59:39,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=852714.0, ans=0.125 2023-06-21 01:59:51,890 INFO [train.py:996] (0/4) Epoch 5, batch 20150, loss[loss=0.2775, simple_loss=0.3446, pruned_loss=0.1052, over 21373.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3324, pruned_loss=0.09401, over 4288067.86 frames. ], batch size: 549, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 02:00:09,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=852774.0, ans=0.125 2023-06-21 02:00:54,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=852894.0, ans=0.0 2023-06-21 02:01:44,380 INFO [train.py:996] (0/4) Epoch 5, batch 20200, loss[loss=0.2538, simple_loss=0.3341, pruned_loss=0.08669, over 21655.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3393, pruned_loss=0.0983, over 4286991.56 frames. ], batch size: 263, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:02:28,619 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.103e+02 3.828e+02 4.837e+02 8.948e+02, threshold=7.656e+02, percent-clipped=6.0 2023-06-21 02:02:49,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-21 02:02:53,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=853254.0, ans=0.0 2023-06-21 02:02:55,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2023-06-21 02:03:21,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-21 02:03:24,909 INFO [train.py:996] (0/4) Epoch 5, batch 20250, loss[loss=0.227, simple_loss=0.3031, pruned_loss=0.07547, over 21660.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3398, pruned_loss=0.09675, over 4291820.81 frames. ], batch size: 230, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:03:33,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=853374.0, ans=0.125 2023-06-21 02:03:45,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-21 02:04:10,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=853494.0, ans=0.125 2023-06-21 02:04:23,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=853554.0, ans=0.2 2023-06-21 02:04:26,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=853554.0, ans=0.0 2023-06-21 02:05:04,421 INFO [train.py:996] (0/4) Epoch 5, batch 20300, loss[loss=0.2546, simple_loss=0.3273, pruned_loss=0.09094, over 21357.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3368, pruned_loss=0.09312, over 4290585.54 frames. ], batch size: 211, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:05:14,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=853674.0, ans=0.0 2023-06-21 02:05:14,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=853674.0, ans=0.2 2023-06-21 02:05:20,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=853734.0, ans=0.2 2023-06-21 02:05:20,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=853734.0, ans=0.0 2023-06-21 02:05:23,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=853734.0, ans=0.0 2023-06-21 02:05:51,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.734e+02 3.068e+02 3.713e+02 6.256e+02, threshold=6.135e+02, percent-clipped=0.0 2023-06-21 02:06:28,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-21 02:06:41,905 INFO [train.py:996] (0/4) Epoch 5, batch 20350, loss[loss=0.2801, simple_loss=0.344, pruned_loss=0.1081, over 21616.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3367, pruned_loss=0.09332, over 4278433.67 frames. ], batch size: 263, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:07:43,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=854154.0, ans=0.125 2023-06-21 02:08:08,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=854214.0, ans=0.125 2023-06-21 02:08:17,293 INFO [train.py:996] (0/4) Epoch 5, batch 20400, loss[loss=0.2178, simple_loss=0.2846, pruned_loss=0.07544, over 17250.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3395, pruned_loss=0.09634, over 4261832.99 frames. ], batch size: 62, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 02:08:18,414 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-21 02:08:26,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=854274.0, ans=0.1 2023-06-21 02:08:39,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-21 02:08:42,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=854334.0, ans=0.125 2023-06-21 02:08:58,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=854394.0, ans=0.1 2023-06-21 02:09:00,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=854394.0, ans=0.2 2023-06-21 02:09:05,785 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.145e+02 3.695e+02 4.616e+02 6.973e+02, threshold=7.390e+02, percent-clipped=6.0 2023-06-21 02:09:17,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=854454.0, ans=0.0 2023-06-21 02:09:56,718 INFO [train.py:996] (0/4) Epoch 5, batch 20450, loss[loss=0.2757, simple_loss=0.3346, pruned_loss=0.1084, over 21827.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3411, pruned_loss=0.1001, over 4261468.13 frames. ], batch size: 414, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:10:08,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-21 02:10:27,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-21 02:10:31,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=854694.0, ans=0.0 2023-06-21 02:10:53,286 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.74 vs. limit=10.0 2023-06-21 02:11:31,784 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:11:34,434 INFO [train.py:996] (0/4) Epoch 5, batch 20500, loss[loss=0.2228, simple_loss=0.2836, pruned_loss=0.08104, over 21336.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3362, pruned_loss=0.09972, over 4259921.77 frames. ], batch size: 144, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:12:09,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=854994.0, ans=0.1 2023-06-21 02:12:17,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.857e+02 3.263e+02 3.907e+02 6.416e+02, threshold=6.525e+02, percent-clipped=0.0 2023-06-21 02:12:17,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=854994.0, ans=0.0 2023-06-21 02:12:30,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=855054.0, ans=0.125 2023-06-21 02:12:42,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=855054.0, ans=0.0 2023-06-21 02:13:09,829 INFO [train.py:996] (0/4) Epoch 5, batch 20550, loss[loss=0.2072, simple_loss=0.2821, pruned_loss=0.06616, over 17001.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3286, pruned_loss=0.09803, over 4259639.41 frames. ], batch size: 62, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:13:15,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=855174.0, ans=0.2 2023-06-21 02:13:23,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=855174.0, ans=0.125 2023-06-21 02:13:57,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=855294.0, ans=0.0 2023-06-21 02:14:22,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=855354.0, ans=0.2 2023-06-21 02:14:46,308 INFO [train.py:996] (0/4) Epoch 5, batch 20600, loss[loss=0.2538, simple_loss=0.3256, pruned_loss=0.09097, over 21856.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3308, pruned_loss=0.0949, over 4254588.24 frames. ], batch size: 371, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:15:05,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=855534.0, ans=0.1 2023-06-21 02:15:08,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=855534.0, ans=0.125 2023-06-21 02:15:35,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 2.804e+02 3.280e+02 3.757e+02 7.089e+02, threshold=6.559e+02, percent-clipped=1.0 2023-06-21 02:15:44,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=855654.0, ans=0.2 2023-06-21 02:16:26,184 INFO [train.py:996] (0/4) Epoch 5, batch 20650, loss[loss=0.2636, simple_loss=0.3155, pruned_loss=0.1059, over 21815.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3275, pruned_loss=0.09563, over 4249331.28 frames. ], batch size: 351, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:16:37,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=855774.0, ans=0.125 2023-06-21 02:17:16,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=855894.0, ans=0.95 2023-06-21 02:17:59,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=856014.0, ans=0.1 2023-06-21 02:18:04,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=856014.0, ans=0.125 2023-06-21 02:18:06,943 INFO [train.py:996] (0/4) Epoch 5, batch 20700, loss[loss=0.3649, simple_loss=0.4277, pruned_loss=0.151, over 21483.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3203, pruned_loss=0.09175, over 4261754.17 frames. ], batch size: 471, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:18:32,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=856134.0, ans=0.125 2023-06-21 02:18:45,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=856134.0, ans=0.0 2023-06-21 02:18:57,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.666e+02 3.123e+02 3.714e+02 6.425e+02, threshold=6.247e+02, percent-clipped=0.0 2023-06-21 02:19:52,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=8.0 2023-06-21 02:19:53,138 INFO [train.py:996] (0/4) Epoch 5, batch 20750, loss[loss=0.2715, simple_loss=0.3722, pruned_loss=0.08543, over 21743.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.323, pruned_loss=0.09088, over 4265105.54 frames. ], batch size: 351, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:20:01,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=856374.0, ans=0.1 2023-06-21 02:20:10,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=856374.0, ans=0.0 2023-06-21 02:21:29,578 INFO [train.py:996] (0/4) Epoch 5, batch 20800, loss[loss=0.238, simple_loss=0.3015, pruned_loss=0.08729, over 21829.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.327, pruned_loss=0.09225, over 4268284.48 frames. ], batch size: 352, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:22:14,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=856794.0, ans=0.1 2023-06-21 02:22:25,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.108e+02 3.901e+02 5.599e+02 9.709e+02, threshold=7.803e+02, percent-clipped=19.0 2023-06-21 02:22:48,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=856914.0, ans=0.125 2023-06-21 02:23:14,844 INFO [train.py:996] (0/4) Epoch 5, batch 20850, loss[loss=0.2318, simple_loss=0.2933, pruned_loss=0.08521, over 21851.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3189, pruned_loss=0.08979, over 4265480.30 frames. ], batch size: 98, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:23:52,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=857094.0, ans=0.125 2023-06-21 02:24:10,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=857094.0, ans=0.125 2023-06-21 02:24:29,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=857214.0, ans=0.0 2023-06-21 02:24:32,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=857214.0, ans=0.0 2023-06-21 02:24:55,606 INFO [train.py:996] (0/4) Epoch 5, batch 20900, loss[loss=0.2405, simple_loss=0.3104, pruned_loss=0.08528, over 21351.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.319, pruned_loss=0.09065, over 4270932.17 frames. ], batch size: 131, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:25:44,709 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.849e+02 3.550e+02 4.829e+02 8.716e+02, threshold=7.101e+02, percent-clipped=1.0 2023-06-21 02:25:49,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=857394.0, ans=0.1 2023-06-21 02:26:24,236 INFO [train.py:996] (0/4) Epoch 5, batch 20950, loss[loss=0.2227, simple_loss=0.2907, pruned_loss=0.07731, over 21456.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3162, pruned_loss=0.08787, over 4270867.92 frames. ], batch size: 471, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:26:54,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=857634.0, ans=0.1 2023-06-21 02:26:58,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=857634.0, ans=0.1 2023-06-21 02:27:47,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=857814.0, ans=0.125 2023-06-21 02:28:02,775 INFO [train.py:996] (0/4) Epoch 5, batch 21000, loss[loss=0.2445, simple_loss=0.302, pruned_loss=0.09349, over 21394.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3134, pruned_loss=0.08723, over 4266673.73 frames. ], batch size: 159, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:28:02,777 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 02:28:23,291 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2707, simple_loss=0.3706, pruned_loss=0.0854, over 1796401.00 frames. 2023-06-21 02:28:23,292 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 02:29:02,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=857994.0, ans=0.125 2023-06-21 02:29:03,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=857994.0, ans=0.2 2023-06-21 02:29:08,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.446e+02 2.941e+02 3.372e+02 5.847e+02, threshold=5.881e+02, percent-clipped=0.0 2023-06-21 02:29:15,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=858054.0, ans=0.125 2023-06-21 02:29:28,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-21 02:29:30,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=858114.0, ans=0.1 2023-06-21 02:29:32,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=858114.0, ans=0.125 2023-06-21 02:29:52,545 INFO [train.py:996] (0/4) Epoch 5, batch 21050, loss[loss=0.2506, simple_loss=0.3211, pruned_loss=0.09006, over 21594.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3114, pruned_loss=0.08749, over 4274597.02 frames. ], batch size: 414, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:30:26,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=858234.0, ans=0.2 2023-06-21 02:30:51,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=858354.0, ans=0.125 2023-06-21 02:31:11,377 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-21 02:31:31,905 INFO [train.py:996] (0/4) Epoch 5, batch 21100, loss[loss=0.2491, simple_loss=0.3118, pruned_loss=0.09323, over 21913.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3086, pruned_loss=0.08747, over 4268546.94 frames. ], batch size: 107, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:31:37,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=858474.0, ans=0.0 2023-06-21 02:32:23,052 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.646e+02 3.104e+02 3.741e+02 7.727e+02, threshold=6.208e+02, percent-clipped=4.0 2023-06-21 02:33:10,729 INFO [train.py:996] (0/4) Epoch 5, batch 21150, loss[loss=0.2234, simple_loss=0.2817, pruned_loss=0.08262, over 21092.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3042, pruned_loss=0.08738, over 4269282.23 frames. ], batch size: 176, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:33:11,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=12.0 2023-06-21 02:33:15,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=858774.0, ans=0.09899494936611666 2023-06-21 02:33:37,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=858834.0, ans=0.1 2023-06-21 02:33:55,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=858894.0, ans=0.0 2023-06-21 02:34:00,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=858894.0, ans=0.125 2023-06-21 02:34:27,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=859014.0, ans=0.125 2023-06-21 02:34:49,193 INFO [train.py:996] (0/4) Epoch 5, batch 21200, loss[loss=0.2112, simple_loss=0.2602, pruned_loss=0.08112, over 20271.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.2998, pruned_loss=0.08597, over 4265328.35 frames. ], batch size: 703, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:35:12,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=859134.0, ans=0.125 2023-06-21 02:35:41,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=859194.0, ans=0.0 2023-06-21 02:35:42,266 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.556e+02 2.983e+02 3.477e+02 7.677e+02, threshold=5.965e+02, percent-clipped=1.0 2023-06-21 02:36:30,239 INFO [train.py:996] (0/4) Epoch 5, batch 21250, loss[loss=0.2205, simple_loss=0.2855, pruned_loss=0.07772, over 21173.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.2974, pruned_loss=0.08572, over 4265089.80 frames. ], batch size: 143, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:37:10,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=859494.0, ans=0.0 2023-06-21 02:37:20,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=859494.0, ans=0.1 2023-06-21 02:37:23,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=859494.0, ans=0.125 2023-06-21 02:37:26,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=859554.0, ans=0.05 2023-06-21 02:37:41,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-06-21 02:37:44,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-21 02:37:45,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=859614.0, ans=0.0 2023-06-21 02:38:09,625 INFO [train.py:996] (0/4) Epoch 5, batch 21300, loss[loss=0.2878, simple_loss=0.3459, pruned_loss=0.1148, over 21878.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3044, pruned_loss=0.08875, over 4268282.99 frames. ], batch size: 414, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:38:17,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-21 02:38:23,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=859674.0, ans=0.1 2023-06-21 02:39:01,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=859794.0, ans=0.05 2023-06-21 02:39:01,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=859794.0, ans=0.0 2023-06-21 02:39:02,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 2.969e+02 3.329e+02 4.486e+02 8.975e+02, threshold=6.657e+02, percent-clipped=6.0 2023-06-21 02:39:21,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=859854.0, ans=0.0 2023-06-21 02:39:47,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=859914.0, ans=0.2 2023-06-21 02:39:50,043 INFO [train.py:996] (0/4) Epoch 5, batch 21350, loss[loss=0.2307, simple_loss=0.3279, pruned_loss=0.06673, over 21268.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3078, pruned_loss=0.08866, over 4263114.68 frames. ], batch size: 548, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:40:03,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=859974.0, ans=0.0 2023-06-21 02:40:04,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=860034.0, ans=0.125 2023-06-21 02:40:35,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=860094.0, ans=0.125 2023-06-21 02:40:40,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=860094.0, ans=0.0 2023-06-21 02:41:29,874 INFO [train.py:996] (0/4) Epoch 5, batch 21400, loss[loss=0.2922, simple_loss=0.3622, pruned_loss=0.1112, over 21772.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3111, pruned_loss=0.08753, over 4265998.71 frames. ], batch size: 441, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:41:52,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=860334.0, ans=0.125 2023-06-21 02:42:22,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.768e+02 3.163e+02 3.686e+02 6.049e+02, threshold=6.326e+02, percent-clipped=0.0 2023-06-21 02:42:26,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=860454.0, ans=0.125 2023-06-21 02:42:31,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=860454.0, ans=0.125 2023-06-21 02:42:37,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-21 02:43:05,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-21 02:43:09,486 INFO [train.py:996] (0/4) Epoch 5, batch 21450, loss[loss=0.2718, simple_loss=0.3345, pruned_loss=0.1045, over 21885.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3156, pruned_loss=0.08979, over 4274697.50 frames. ], batch size: 107, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:43:54,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=860694.0, ans=0.0 2023-06-21 02:43:54,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=860694.0, ans=0.2 2023-06-21 02:44:00,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=860694.0, ans=0.0 2023-06-21 02:44:05,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=860754.0, ans=0.0 2023-06-21 02:44:05,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=860754.0, ans=0.125 2023-06-21 02:44:34,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=860814.0, ans=0.0 2023-06-21 02:44:48,136 INFO [train.py:996] (0/4) Epoch 5, batch 21500, loss[loss=0.2876, simple_loss=0.3775, pruned_loss=0.0989, over 19861.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3138, pruned_loss=0.09064, over 4264017.54 frames. ], batch size: 703, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:45:31,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=860994.0, ans=0.0 2023-06-21 02:45:40,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 3.006e+02 3.483e+02 4.242e+02 6.315e+02, threshold=6.966e+02, percent-clipped=0.0 2023-06-21 02:45:50,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=861054.0, ans=0.0 2023-06-21 02:45:59,173 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:45:59,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-21 02:46:08,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=861054.0, ans=0.2 2023-06-21 02:46:10,830 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:46:26,543 INFO [train.py:996] (0/4) Epoch 5, batch 21550, loss[loss=0.2473, simple_loss=0.3065, pruned_loss=0.09404, over 21596.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3085, pruned_loss=0.08924, over 4256080.21 frames. ], batch size: 415, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:47:45,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=861354.0, ans=12.0 2023-06-21 02:47:46,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=861354.0, ans=0.0 2023-06-21 02:47:48,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=861354.0, ans=0.2 2023-06-21 02:47:49,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=861414.0, ans=0.125 2023-06-21 02:48:03,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-21 02:48:07,453 INFO [train.py:996] (0/4) Epoch 5, batch 21600, loss[loss=0.2426, simple_loss=0.3274, pruned_loss=0.07895, over 21649.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3057, pruned_loss=0.08759, over 4251037.75 frames. ], batch size: 298, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:48:32,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=861534.0, ans=0.125 2023-06-21 02:48:35,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=861534.0, ans=0.0 2023-06-21 02:49:05,663 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.906e+02 3.367e+02 4.103e+02 7.141e+02, threshold=6.734e+02, percent-clipped=1.0 2023-06-21 02:49:06,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=861594.0, ans=0.0 2023-06-21 02:49:26,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=861654.0, ans=0.1 2023-06-21 02:49:43,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=861714.0, ans=0.125 2023-06-21 02:49:46,667 INFO [train.py:996] (0/4) Epoch 5, batch 21650, loss[loss=0.1968, simple_loss=0.2887, pruned_loss=0.0525, over 21638.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3086, pruned_loss=0.08502, over 4254810.65 frames. ], batch size: 263, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:50:40,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=861894.0, ans=0.1 2023-06-21 02:50:47,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=861954.0, ans=0.125 2023-06-21 02:51:26,201 INFO [train.py:996] (0/4) Epoch 5, batch 21700, loss[loss=0.2653, simple_loss=0.3068, pruned_loss=0.1119, over 21269.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3073, pruned_loss=0.08273, over 4259649.69 frames. ], batch size: 507, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:51:59,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=862134.0, ans=0.2 2023-06-21 02:52:22,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.607e+02 2.964e+02 3.424e+02 5.516e+02, threshold=5.928e+02, percent-clipped=0.0 2023-06-21 02:52:24,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=862194.0, ans=0.125 2023-06-21 02:52:32,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=862254.0, ans=0.125 2023-06-21 02:52:55,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=862314.0, ans=0.125 2023-06-21 02:52:55,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=862314.0, ans=0.2 2023-06-21 02:53:10,476 INFO [train.py:996] (0/4) Epoch 5, batch 21750, loss[loss=0.2163, simple_loss=0.2712, pruned_loss=0.08071, over 21591.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3029, pruned_loss=0.08264, over 4259827.25 frames. ], batch size: 247, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:53:14,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=862374.0, ans=0.95 2023-06-21 02:53:34,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=862434.0, ans=0.1 2023-06-21 02:53:36,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=862434.0, ans=0.1 2023-06-21 02:53:46,327 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.86 vs. limit=10.0 2023-06-21 02:54:01,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=862494.0, ans=0.125 2023-06-21 02:54:49,593 INFO [train.py:996] (0/4) Epoch 5, batch 21800, loss[loss=0.2585, simple_loss=0.3315, pruned_loss=0.09274, over 21891.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3029, pruned_loss=0.08423, over 4258586.02 frames. ], batch size: 373, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:55:20,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=862734.0, ans=0.1 2023-06-21 02:55:38,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=862794.0, ans=0.1 2023-06-21 02:55:43,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 2.757e+02 3.112e+02 3.604e+02 5.308e+02, threshold=6.224e+02, percent-clipped=0.0 2023-06-21 02:56:00,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=862854.0, ans=0.05 2023-06-21 02:56:29,371 INFO [train.py:996] (0/4) Epoch 5, batch 21850, loss[loss=0.239, simple_loss=0.3349, pruned_loss=0.07155, over 21628.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3094, pruned_loss=0.08483, over 4266635.86 frames. ], batch size: 263, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 02:56:57,396 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-21 02:57:15,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=863094.0, ans=0.125 2023-06-21 02:57:26,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=863094.0, ans=0.0 2023-06-21 02:57:42,810 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:58:08,185 INFO [train.py:996] (0/4) Epoch 5, batch 21900, loss[loss=0.2509, simple_loss=0.3198, pruned_loss=0.09098, over 21244.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3086, pruned_loss=0.08591, over 4278942.03 frames. ], batch size: 159, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 02:58:49,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-06-21 02:59:06,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 2.823e+02 3.229e+02 3.710e+02 5.018e+02, threshold=6.457e+02, percent-clipped=0.0 2023-06-21 02:59:24,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-21 02:59:46,637 INFO [train.py:996] (0/4) Epoch 5, batch 21950, loss[loss=0.2616, simple_loss=0.3052, pruned_loss=0.109, over 20164.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3046, pruned_loss=0.08599, over 4281652.89 frames. ], batch size: 703, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:00:01,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=863574.0, ans=0.0 2023-06-21 03:01:13,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-06-21 03:01:27,639 INFO [train.py:996] (0/4) Epoch 5, batch 22000, loss[loss=0.2858, simple_loss=0.3382, pruned_loss=0.1167, over 21322.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2991, pruned_loss=0.08345, over 4285490.09 frames. ], batch size: 471, lr: 6.08e-03, grad_scale: 32.0 2023-06-21 03:01:44,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=863874.0, ans=0.125 2023-06-21 03:01:55,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=863934.0, ans=0.2 2023-06-21 03:02:01,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=863934.0, ans=0.0 2023-06-21 03:02:09,750 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-144000.pt 2023-06-21 03:02:29,964 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.512e+02 2.928e+02 3.420e+02 5.826e+02, threshold=5.856e+02, percent-clipped=0.0 2023-06-21 03:03:08,623 INFO [train.py:996] (0/4) Epoch 5, batch 22050, loss[loss=0.1882, simple_loss=0.2627, pruned_loss=0.0569, over 16157.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3034, pruned_loss=0.08448, over 4268554.61 frames. ], batch size: 60, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:03:18,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=864174.0, ans=0.125 2023-06-21 03:03:34,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=864234.0, ans=0.0 2023-06-21 03:03:37,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=864234.0, ans=0.1 2023-06-21 03:04:03,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=864294.0, ans=0.0 2023-06-21 03:04:03,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=864294.0, ans=0.125 2023-06-21 03:04:11,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=864354.0, ans=0.125 2023-06-21 03:04:15,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=864354.0, ans=0.125 2023-06-21 03:04:22,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=864354.0, ans=0.125 2023-06-21 03:04:28,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=864414.0, ans=0.0 2023-06-21 03:04:52,781 INFO [train.py:996] (0/4) Epoch 5, batch 22100, loss[loss=0.2717, simple_loss=0.3318, pruned_loss=0.1059, over 21401.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3149, pruned_loss=0.08998, over 4277157.64 frames. ], batch size: 176, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:05:23,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-21 03:05:29,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=864594.0, ans=0.1 2023-06-21 03:05:36,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=864594.0, ans=0.0 2023-06-21 03:05:48,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 3.305e+02 3.693e+02 4.258e+02 6.395e+02, threshold=7.386e+02, percent-clipped=3.0 2023-06-21 03:05:55,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=864654.0, ans=0.125 2023-06-21 03:05:56,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-21 03:06:14,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=864714.0, ans=0.0 2023-06-21 03:06:21,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=864714.0, ans=0.2 2023-06-21 03:06:31,559 INFO [train.py:996] (0/4) Epoch 5, batch 22150, loss[loss=0.2567, simple_loss=0.3196, pruned_loss=0.09691, over 21845.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3183, pruned_loss=0.09178, over 4284766.35 frames. ], batch size: 282, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:06:48,344 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=15.0 2023-06-21 03:06:54,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=864834.0, ans=0.1 2023-06-21 03:07:25,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=864894.0, ans=0.5 2023-06-21 03:07:33,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=864954.0, ans=0.05 2023-06-21 03:07:59,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=865014.0, ans=0.2 2023-06-21 03:08:06,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=865014.0, ans=0.2 2023-06-21 03:08:10,584 INFO [train.py:996] (0/4) Epoch 5, batch 22200, loss[loss=0.2618, simple_loss=0.3406, pruned_loss=0.09152, over 21258.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3202, pruned_loss=0.09278, over 4288081.63 frames. ], batch size: 159, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:08:24,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=865074.0, ans=10.0 2023-06-21 03:08:28,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=865074.0, ans=0.0 2023-06-21 03:09:09,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 3.019e+02 3.347e+02 3.956e+02 6.093e+02, threshold=6.693e+02, percent-clipped=0.0 2023-06-21 03:09:14,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=865254.0, ans=0.07 2023-06-21 03:09:44,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.63 vs. limit=15.0 2023-06-21 03:09:57,884 INFO [train.py:996] (0/4) Epoch 5, batch 22250, loss[loss=0.244, simple_loss=0.3519, pruned_loss=0.06806, over 20866.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3262, pruned_loss=0.09431, over 4289862.83 frames. ], batch size: 608, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:10:31,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=865434.0, ans=0.1 2023-06-21 03:10:32,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=865434.0, ans=0.2 2023-06-21 03:10:46,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=865494.0, ans=0.2 2023-06-21 03:10:56,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=865554.0, ans=0.0 2023-06-21 03:11:24,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=865614.0, ans=0.1 2023-06-21 03:11:26,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-21 03:11:32,241 INFO [train.py:996] (0/4) Epoch 5, batch 22300, loss[loss=0.256, simple_loss=0.3188, pruned_loss=0.09654, over 21860.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3292, pruned_loss=0.09721, over 4290310.96 frames. ], batch size: 371, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:11:59,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=865734.0, ans=0.2 2023-06-21 03:12:07,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=865794.0, ans=0.2 2023-06-21 03:12:18,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=865794.0, ans=0.125 2023-06-21 03:12:27,221 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.377e+02 3.190e+02 3.753e+02 5.122e+02 1.002e+03, threshold=7.506e+02, percent-clipped=11.0 2023-06-21 03:12:33,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-21 03:13:14,458 INFO [train.py:996] (0/4) Epoch 5, batch 22350, loss[loss=0.2311, simple_loss=0.2979, pruned_loss=0.08216, over 21654.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3266, pruned_loss=0.0972, over 4297517.54 frames. ], batch size: 263, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:13:39,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=866034.0, ans=0.2 2023-06-21 03:14:35,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=866214.0, ans=0.125 2023-06-21 03:14:53,842 INFO [train.py:996] (0/4) Epoch 5, batch 22400, loss[loss=0.222, simple_loss=0.3324, pruned_loss=0.05581, over 20846.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3222, pruned_loss=0.09338, over 4286943.75 frames. ], batch size: 607, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:15:45,058 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.706e+02 3.089e+02 3.768e+02 7.797e+02, threshold=6.178e+02, percent-clipped=1.0 2023-06-21 03:15:47,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=866454.0, ans=0.0 2023-06-21 03:15:58,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=866454.0, ans=0.125 2023-06-21 03:16:04,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=866454.0, ans=10.0 2023-06-21 03:16:32,983 INFO [train.py:996] (0/4) Epoch 5, batch 22450, loss[loss=0.2269, simple_loss=0.2784, pruned_loss=0.08765, over 21694.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3149, pruned_loss=0.09087, over 4276268.13 frames. ], batch size: 283, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:16:57,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=12.0 2023-06-21 03:17:38,886 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=22.5 2023-06-21 03:18:14,434 INFO [train.py:996] (0/4) Epoch 5, batch 22500, loss[loss=0.2688, simple_loss=0.3453, pruned_loss=0.09616, over 21262.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3112, pruned_loss=0.09118, over 4273383.89 frames. ], batch size: 176, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:18:22,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=866874.0, ans=0.125 2023-06-21 03:18:25,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-21 03:18:39,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-21 03:19:05,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.870e+02 3.254e+02 4.012e+02 8.224e+02, threshold=6.508e+02, percent-clipped=4.0 2023-06-21 03:19:35,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.18 vs. limit=15.0 2023-06-21 03:19:53,760 INFO [train.py:996] (0/4) Epoch 5, batch 22550, loss[loss=0.2514, simple_loss=0.343, pruned_loss=0.0799, over 21615.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3164, pruned_loss=0.09194, over 4277397.22 frames. ], batch size: 263, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:20:08,732 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:20:10,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=867174.0, ans=0.1 2023-06-21 03:20:20,612 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:20:22,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=867234.0, ans=10.0 2023-06-21 03:20:32,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=867294.0, ans=0.1 2023-06-21 03:21:17,344 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-21 03:21:22,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=867414.0, ans=0.125 2023-06-21 03:21:24,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=867414.0, ans=0.2 2023-06-21 03:21:40,222 INFO [train.py:996] (0/4) Epoch 5, batch 22600, loss[loss=0.1839, simple_loss=0.2218, pruned_loss=0.07302, over 16642.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.32, pruned_loss=0.09258, over 4279132.36 frames. ], batch size: 63, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:21:53,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=867474.0, ans=0.1 2023-06-21 03:22:04,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=867534.0, ans=0.125 2023-06-21 03:22:30,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.96 vs. limit=15.0 2023-06-21 03:22:40,291 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.062e+02 3.535e+02 4.633e+02 8.415e+02, threshold=7.070e+02, percent-clipped=6.0 2023-06-21 03:22:48,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=867654.0, ans=0.125 2023-06-21 03:23:01,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=867714.0, ans=0.125 2023-06-21 03:23:01,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=867714.0, ans=0.0 2023-06-21 03:23:18,532 INFO [train.py:996] (0/4) Epoch 5, batch 22650, loss[loss=0.2239, simple_loss=0.278, pruned_loss=0.08494, over 21578.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3184, pruned_loss=0.09181, over 4275100.36 frames. ], batch size: 231, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:23:25,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=867774.0, ans=0.125 2023-06-21 03:23:55,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=867894.0, ans=0.0 2023-06-21 03:24:11,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.41 vs. limit=5.0 2023-06-21 03:24:57,516 INFO [train.py:996] (0/4) Epoch 5, batch 22700, loss[loss=0.2282, simple_loss=0.2803, pruned_loss=0.08803, over 21603.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3111, pruned_loss=0.09084, over 4266897.10 frames. ], batch size: 231, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:25:07,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=868074.0, ans=0.95 2023-06-21 03:25:59,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.710e+02 3.113e+02 3.866e+02 5.786e+02, threshold=6.226e+02, percent-clipped=0.0 2023-06-21 03:26:01,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=868254.0, ans=0.0 2023-06-21 03:26:30,992 INFO [train.py:996] (0/4) Epoch 5, batch 22750, loss[loss=0.2814, simple_loss=0.3423, pruned_loss=0.1102, over 21454.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3113, pruned_loss=0.09246, over 4272607.46 frames. ], batch size: 194, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:26:57,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=868434.0, ans=0.0 2023-06-21 03:27:17,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=868494.0, ans=0.2 2023-06-21 03:27:18,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.56 vs. limit=15.0 2023-06-21 03:28:15,801 INFO [train.py:996] (0/4) Epoch 5, batch 22800, loss[loss=0.2399, simple_loss=0.3128, pruned_loss=0.08347, over 21884.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3157, pruned_loss=0.09469, over 4277687.95 frames. ], batch size: 107, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:28:57,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=868794.0, ans=0.125 2023-06-21 03:28:57,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=868794.0, ans=0.125 2023-06-21 03:29:09,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=868794.0, ans=0.125 2023-06-21 03:29:17,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.397e+02 2.823e+02 3.345e+02 3.974e+02 6.068e+02, threshold=6.691e+02, percent-clipped=0.0 2023-06-21 03:29:23,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-21 03:29:35,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=868914.0, ans=0.125 2023-06-21 03:29:49,119 INFO [train.py:996] (0/4) Epoch 5, batch 22850, loss[loss=0.2534, simple_loss=0.2993, pruned_loss=0.1037, over 21189.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3108, pruned_loss=0.09309, over 4278781.51 frames. ], batch size: 159, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:29:49,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=868974.0, ans=0.125 2023-06-21 03:30:28,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=869034.0, ans=0.125 2023-06-21 03:30:53,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-21 03:31:11,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-21 03:31:33,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=869214.0, ans=0.1 2023-06-21 03:31:35,938 INFO [train.py:996] (0/4) Epoch 5, batch 22900, loss[loss=0.329, simple_loss=0.4204, pruned_loss=0.1188, over 21513.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3145, pruned_loss=0.09295, over 4268242.90 frames. ], batch size: 471, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:31:36,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=869274.0, ans=0.0 2023-06-21 03:32:05,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=869334.0, ans=0.2 2023-06-21 03:32:39,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.436e+02 3.293e+02 3.917e+02 5.124e+02 7.831e+02, threshold=7.834e+02, percent-clipped=10.0 2023-06-21 03:33:15,497 INFO [train.py:996] (0/4) Epoch 5, batch 22950, loss[loss=0.2607, simple_loss=0.3636, pruned_loss=0.07893, over 21484.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3281, pruned_loss=0.09133, over 4267388.70 frames. ], batch size: 195, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:34:12,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=869694.0, ans=0.0 2023-06-21 03:34:18,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=869754.0, ans=0.04949747468305833 2023-06-21 03:34:19,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=869754.0, ans=0.2 2023-06-21 03:34:52,762 INFO [train.py:996] (0/4) Epoch 5, batch 23000, loss[loss=0.2728, simple_loss=0.3304, pruned_loss=0.1076, over 21231.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3273, pruned_loss=0.08946, over 4278327.70 frames. ], batch size: 143, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:35:47,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=869994.0, ans=0.2 2023-06-21 03:35:56,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.793e+02 3.379e+02 3.965e+02 7.564e+02, threshold=6.759e+02, percent-clipped=0.0 2023-06-21 03:36:43,318 INFO [train.py:996] (0/4) Epoch 5, batch 23050, loss[loss=0.3097, simple_loss=0.3614, pruned_loss=0.129, over 21820.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3296, pruned_loss=0.09236, over 4272066.50 frames. ], batch size: 441, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:36:59,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=870174.0, ans=0.2 2023-06-21 03:37:10,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=870234.0, ans=0.1 2023-06-21 03:37:18,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=870294.0, ans=0.125 2023-06-21 03:37:56,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.96 vs. limit=15.0 2023-06-21 03:38:12,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=870414.0, ans=0.0 2023-06-21 03:38:22,828 INFO [train.py:996] (0/4) Epoch 5, batch 23100, loss[loss=0.1942, simple_loss=0.2487, pruned_loss=0.06978, over 20789.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3242, pruned_loss=0.09288, over 4273300.66 frames. ], batch size: 609, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:38:35,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=870474.0, ans=0.125 2023-06-21 03:38:37,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=870474.0, ans=0.125 2023-06-21 03:38:48,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=870534.0, ans=0.2 2023-06-21 03:38:51,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=870534.0, ans=0.0 2023-06-21 03:38:53,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=870534.0, ans=0.2 2023-06-21 03:38:58,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=870594.0, ans=0.125 2023-06-21 03:39:08,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=870594.0, ans=0.1 2023-06-21 03:39:10,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=870594.0, ans=0.0 2023-06-21 03:39:20,134 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 2.864e+02 3.390e+02 4.261e+02 7.523e+02, threshold=6.780e+02, percent-clipped=3.0 2023-06-21 03:39:26,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-21 03:39:28,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=870654.0, ans=0.125 2023-06-21 03:39:39,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=870714.0, ans=0.0 2023-06-21 03:40:00,817 INFO [train.py:996] (0/4) Epoch 5, batch 23150, loss[loss=0.2294, simple_loss=0.2891, pruned_loss=0.08488, over 21808.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3186, pruned_loss=0.09201, over 4279271.78 frames. ], batch size: 298, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:40:02,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=870774.0, ans=0.125 2023-06-21 03:40:15,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=870774.0, ans=15.0 2023-06-21 03:40:15,623 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.81 vs. limit=10.0 2023-06-21 03:40:18,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=870774.0, ans=0.125 2023-06-21 03:40:29,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.43 vs. limit=6.0 2023-06-21 03:40:38,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=870894.0, ans=0.125 2023-06-21 03:40:46,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=870894.0, ans=0.125 2023-06-21 03:41:28,459 INFO [train.py:996] (0/4) Epoch 5, batch 23200, loss[loss=0.2305, simple_loss=0.2956, pruned_loss=0.08275, over 21585.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3182, pruned_loss=0.09324, over 4289770.01 frames. ], batch size: 212, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:42:29,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.864e+02 3.235e+02 3.730e+02 5.431e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-21 03:42:33,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=871254.0, ans=0.2 2023-06-21 03:43:11,499 INFO [train.py:996] (0/4) Epoch 5, batch 23250, loss[loss=0.225, simple_loss=0.2958, pruned_loss=0.07707, over 21678.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3181, pruned_loss=0.09456, over 4297008.42 frames. ], batch size: 230, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:43:33,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-21 03:43:46,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=871434.0, ans=0.125 2023-06-21 03:43:54,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=871494.0, ans=0.1 2023-06-21 03:44:57,869 INFO [train.py:996] (0/4) Epoch 5, batch 23300, loss[loss=0.2581, simple_loss=0.3494, pruned_loss=0.08346, over 21743.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3264, pruned_loss=0.09628, over 4297104.96 frames. ], batch size: 351, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:45:00,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=871674.0, ans=0.1 2023-06-21 03:45:52,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.979e+02 3.447e+02 3.938e+02 6.103e+02, threshold=6.894e+02, percent-clipped=0.0 2023-06-21 03:46:33,357 INFO [train.py:996] (0/4) Epoch 5, batch 23350, loss[loss=0.1615, simple_loss=0.2436, pruned_loss=0.03974, over 21426.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3307, pruned_loss=0.09515, over 4278459.03 frames. ], batch size: 211, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:46:37,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=871974.0, ans=0.125 2023-06-21 03:48:11,202 INFO [train.py:996] (0/4) Epoch 5, batch 23400, loss[loss=0.2598, simple_loss=0.3246, pruned_loss=0.09748, over 21475.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3241, pruned_loss=0.0912, over 4282111.91 frames. ], batch size: 548, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:48:38,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=872334.0, ans=0.125 2023-06-21 03:48:47,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-21 03:49:17,350 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 2.664e+02 3.181e+02 4.182e+02 6.937e+02, threshold=6.362e+02, percent-clipped=1.0 2023-06-21 03:49:33,548 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:49:52,825 INFO [train.py:996] (0/4) Epoch 5, batch 23450, loss[loss=0.3145, simple_loss=0.3614, pruned_loss=0.1339, over 21483.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3247, pruned_loss=0.09374, over 4282528.82 frames. ], batch size: 509, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:50:11,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=872634.0, ans=0.2 2023-06-21 03:50:45,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=872694.0, ans=0.125 2023-06-21 03:51:04,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=872754.0, ans=0.1 2023-06-21 03:51:23,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.24 vs. limit=6.0 2023-06-21 03:51:30,854 INFO [train.py:996] (0/4) Epoch 5, batch 23500, loss[loss=0.2551, simple_loss=0.3182, pruned_loss=0.09599, over 21846.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3244, pruned_loss=0.09537, over 4282213.64 frames. ], batch size: 124, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:51:39,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=872874.0, ans=0.125 2023-06-21 03:51:54,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=872934.0, ans=0.1 2023-06-21 03:52:38,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.401e+02 3.046e+02 3.693e+02 4.776e+02 9.117e+02, threshold=7.385e+02, percent-clipped=5.0 2023-06-21 03:53:08,224 INFO [train.py:996] (0/4) Epoch 5, batch 23550, loss[loss=0.2415, simple_loss=0.3018, pruned_loss=0.09058, over 21744.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3205, pruned_loss=0.09531, over 4275982.27 frames. ], batch size: 112, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:53:12,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=873174.0, ans=0.2 2023-06-21 03:54:26,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=873354.0, ans=0.0 2023-06-21 03:54:30,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=12.0 2023-06-21 03:54:46,881 INFO [train.py:996] (0/4) Epoch 5, batch 23600, loss[loss=0.302, simple_loss=0.3666, pruned_loss=0.1187, over 21347.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3193, pruned_loss=0.09473, over 4271561.64 frames. ], batch size: 159, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:55:13,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=873534.0, ans=0.0 2023-06-21 03:55:29,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=873594.0, ans=0.1 2023-06-21 03:55:53,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=873654.0, ans=0.125 2023-06-21 03:55:55,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.629e+02 3.088e+02 3.713e+02 7.100e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-21 03:56:03,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=873654.0, ans=0.0 2023-06-21 03:56:19,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=873714.0, ans=0.125 2023-06-21 03:56:32,185 INFO [train.py:996] (0/4) Epoch 5, batch 23650, loss[loss=0.2332, simple_loss=0.3003, pruned_loss=0.08301, over 20093.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3194, pruned_loss=0.09265, over 4269313.03 frames. ], batch size: 704, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:56:44,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=873774.0, ans=0.125 2023-06-21 03:57:59,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=874014.0, ans=0.1 2023-06-21 03:58:04,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=874014.0, ans=0.125 2023-06-21 03:58:09,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=874014.0, ans=0.0 2023-06-21 03:58:13,484 INFO [train.py:996] (0/4) Epoch 5, batch 23700, loss[loss=0.2706, simple_loss=0.3427, pruned_loss=0.09925, over 21736.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3229, pruned_loss=0.09257, over 4279261.25 frames. ], batch size: 441, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:58:56,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=874134.0, ans=0.1 2023-06-21 03:59:16,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=874254.0, ans=0.125 2023-06-21 03:59:17,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.032e+02 3.536e+02 4.190e+02 7.050e+02, threshold=7.071e+02, percent-clipped=3.0 2023-06-21 03:59:53,576 INFO [train.py:996] (0/4) Epoch 5, batch 23750, loss[loss=0.2511, simple_loss=0.3163, pruned_loss=0.09294, over 21158.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3265, pruned_loss=0.09399, over 4286716.97 frames. ], batch size: 143, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 04:00:04,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=874374.0, ans=0.0 2023-06-21 04:00:43,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=874494.0, ans=10.0 2023-06-21 04:01:38,917 INFO [train.py:996] (0/4) Epoch 5, batch 23800, loss[loss=0.246, simple_loss=0.3083, pruned_loss=0.09178, over 21775.00 frames. ], tot_loss[loss=0.252, simple_loss=0.323, pruned_loss=0.0905, over 4279901.13 frames. ], batch size: 124, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:01:41,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=874674.0, ans=0.1 2023-06-21 04:01:56,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=874674.0, ans=0.2 2023-06-21 04:02:43,709 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.704e+02 3.208e+02 4.045e+02 9.409e+02, threshold=6.416e+02, percent-clipped=3.0 2023-06-21 04:03:19,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=874914.0, ans=0.125 2023-06-21 04:03:29,780 INFO [train.py:996] (0/4) Epoch 5, batch 23850, loss[loss=0.3032, simple_loss=0.4155, pruned_loss=0.0954, over 19760.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3333, pruned_loss=0.09246, over 4278313.90 frames. ], batch size: 702, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:04:51,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=875214.0, ans=0.07 2023-06-21 04:04:53,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=875214.0, ans=0.125 2023-06-21 04:04:58,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=875214.0, ans=0.0 2023-06-21 04:05:04,030 INFO [train.py:996] (0/4) Epoch 5, batch 23900, loss[loss=0.2824, simple_loss=0.352, pruned_loss=0.1064, over 21730.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3408, pruned_loss=0.09507, over 4282483.85 frames. ], batch size: 351, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:05:55,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=875454.0, ans=0.1 2023-06-21 04:05:55,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=875454.0, ans=0.125 2023-06-21 04:06:02,869 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 3.113e+02 3.559e+02 4.462e+02 8.067e+02, threshold=7.118e+02, percent-clipped=8.0 2023-06-21 04:06:03,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=875454.0, ans=0.0 2023-06-21 04:06:41,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=875574.0, ans=0.2 2023-06-21 04:06:42,008 INFO [train.py:996] (0/4) Epoch 5, batch 23950, loss[loss=0.268, simple_loss=0.3341, pruned_loss=0.101, over 21712.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3332, pruned_loss=0.09412, over 4272227.57 frames. ], batch size: 351, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:07:36,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=875694.0, ans=0.0 2023-06-21 04:07:47,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=875754.0, ans=0.2 2023-06-21 04:07:48,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-21 04:08:01,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=875754.0, ans=0.125 2023-06-21 04:08:21,285 INFO [train.py:996] (0/4) Epoch 5, batch 24000, loss[loss=0.2573, simple_loss=0.3292, pruned_loss=0.09272, over 21526.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3355, pruned_loss=0.09819, over 4280681.59 frames. ], batch size: 230, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:08:21,286 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 04:08:38,095 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2683, simple_loss=0.3693, pruned_loss=0.08367, over 1796401.00 frames. 2023-06-21 04:08:38,096 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 04:08:53,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=875874.0, ans=0.125 2023-06-21 04:09:43,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.178e+02 3.721e+02 4.593e+02 6.442e+02, threshold=7.441e+02, percent-clipped=0.0 2023-06-21 04:10:12,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=876174.0, ans=0.07 2023-06-21 04:10:18,512 INFO [train.py:996] (0/4) Epoch 5, batch 24050, loss[loss=0.2279, simple_loss=0.3237, pruned_loss=0.06608, over 21281.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3379, pruned_loss=0.09917, over 4274031.51 frames. ], batch size: 548, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:10:53,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=876234.0, ans=0.125 2023-06-21 04:11:00,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-21 04:11:22,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=876354.0, ans=0.0 2023-06-21 04:11:23,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=876354.0, ans=0.125 2023-06-21 04:11:38,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=876414.0, ans=0.0 2023-06-21 04:11:57,809 INFO [train.py:996] (0/4) Epoch 5, batch 24100, loss[loss=0.264, simple_loss=0.3328, pruned_loss=0.0976, over 21407.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.338, pruned_loss=0.09732, over 4273255.57 frames. ], batch size: 194, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:12:49,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=876594.0, ans=0.025 2023-06-21 04:12:57,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 2.915e+02 3.291e+02 4.001e+02 6.877e+02, threshold=6.582e+02, percent-clipped=0.0 2023-06-21 04:13:28,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=876714.0, ans=0.1 2023-06-21 04:13:31,058 INFO [train.py:996] (0/4) Epoch 5, batch 24150, loss[loss=0.3045, simple_loss=0.348, pruned_loss=0.1305, over 21709.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3376, pruned_loss=0.09944, over 4282738.83 frames. ], batch size: 507, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:13:43,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=876774.0, ans=0.125 2023-06-21 04:14:38,750 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:14:48,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=877014.0, ans=0.125 2023-06-21 04:15:11,018 INFO [train.py:996] (0/4) Epoch 5, batch 24200, loss[loss=0.292, simple_loss=0.3725, pruned_loss=0.1058, over 21615.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3393, pruned_loss=0.1004, over 4285269.37 frames. ], batch size: 389, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:15:15,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=877074.0, ans=0.125 2023-06-21 04:15:26,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=877074.0, ans=0.09899494936611666 2023-06-21 04:16:17,770 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.992e+02 3.434e+02 4.148e+02 5.774e+02, threshold=6.868e+02, percent-clipped=0.0 2023-06-21 04:16:58,535 INFO [train.py:996] (0/4) Epoch 5, batch 24250, loss[loss=0.2009, simple_loss=0.3026, pruned_loss=0.04958, over 21698.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3355, pruned_loss=0.09337, over 4278534.90 frames. ], batch size: 247, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:17:07,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=877374.0, ans=0.125 2023-06-21 04:17:26,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=877434.0, ans=0.1 2023-06-21 04:17:30,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=877434.0, ans=0.1 2023-06-21 04:18:09,910 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:18:23,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=877614.0, ans=0.1 2023-06-21 04:18:31,858 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:18:37,961 INFO [train.py:996] (0/4) Epoch 5, batch 24300, loss[loss=0.1902, simple_loss=0.2826, pruned_loss=0.04893, over 21572.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3269, pruned_loss=0.08698, over 4270224.92 frames. ], batch size: 441, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:18:59,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=877674.0, ans=0.1 2023-06-21 04:19:02,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=877734.0, ans=0.0 2023-06-21 04:19:03,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=877734.0, ans=0.125 2023-06-21 04:19:15,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=877734.0, ans=0.025 2023-06-21 04:19:38,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=877854.0, ans=0.025 2023-06-21 04:19:42,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.429e+02 3.041e+02 4.140e+02 6.830e+02, threshold=6.081e+02, percent-clipped=0.0 2023-06-21 04:20:16,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=877914.0, ans=0.125 2023-06-21 04:20:20,906 INFO [train.py:996] (0/4) Epoch 5, batch 24350, loss[loss=0.2707, simple_loss=0.3355, pruned_loss=0.1029, over 21862.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3243, pruned_loss=0.08784, over 4277964.92 frames. ], batch size: 371, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:20:32,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=877974.0, ans=0.2 2023-06-21 04:20:50,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=878034.0, ans=0.2 2023-06-21 04:20:58,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=878094.0, ans=0.0 2023-06-21 04:21:13,278 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-21 04:21:16,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=878094.0, ans=0.07 2023-06-21 04:21:19,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-21 04:21:48,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=878214.0, ans=0.0 2023-06-21 04:22:04,680 INFO [train.py:996] (0/4) Epoch 5, batch 24400, loss[loss=0.2842, simple_loss=0.3966, pruned_loss=0.08596, over 19666.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3303, pruned_loss=0.09263, over 4273171.72 frames. ], batch size: 702, lr: 6.03e-03, grad_scale: 32.0 2023-06-21 04:22:05,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.36 vs. limit=6.0 2023-06-21 04:22:38,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=878334.0, ans=0.025 2023-06-21 04:22:46,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=878394.0, ans=0.015 2023-06-21 04:23:06,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.330e+02 3.732e+02 4.584e+02 7.697e+02, threshold=7.464e+02, percent-clipped=2.0 2023-06-21 04:23:08,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=878454.0, ans=0.0 2023-06-21 04:23:18,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=878454.0, ans=0.0 2023-06-21 04:23:33,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=878514.0, ans=0.0 2023-06-21 04:23:35,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=878514.0, ans=0.125 2023-06-21 04:23:36,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=878514.0, ans=0.0 2023-06-21 04:23:38,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=878514.0, ans=0.125 2023-06-21 04:23:43,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=878574.0, ans=0.125 2023-06-21 04:23:44,675 INFO [train.py:996] (0/4) Epoch 5, batch 24450, loss[loss=0.225, simple_loss=0.2992, pruned_loss=0.0754, over 21398.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3315, pruned_loss=0.09268, over 4269944.45 frames. ], batch size: 194, lr: 6.03e-03, grad_scale: 32.0 2023-06-21 04:23:57,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=878574.0, ans=0.125 2023-06-21 04:25:07,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=878814.0, ans=0.09899494936611666 2023-06-21 04:25:23,358 INFO [train.py:996] (0/4) Epoch 5, batch 24500, loss[loss=0.2553, simple_loss=0.3266, pruned_loss=0.092, over 21609.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3306, pruned_loss=0.09212, over 4271400.69 frames. ], batch size: 471, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:25:25,737 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.089e-03 2023-06-21 04:26:20,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=879054.0, ans=0.0 2023-06-21 04:26:31,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.806e+02 3.370e+02 4.048e+02 6.223e+02, threshold=6.740e+02, percent-clipped=0.0 2023-06-21 04:26:50,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=879114.0, ans=0.125 2023-06-21 04:27:02,342 INFO [train.py:996] (0/4) Epoch 5, batch 24550, loss[loss=0.3089, simple_loss=0.3727, pruned_loss=0.1226, over 21601.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3318, pruned_loss=0.09402, over 4271259.67 frames. ], batch size: 389, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:27:24,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=879234.0, ans=0.0 2023-06-21 04:27:59,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=879294.0, ans=0.125 2023-06-21 04:28:42,277 INFO [train.py:996] (0/4) Epoch 5, batch 24600, loss[loss=0.3002, simple_loss=0.352, pruned_loss=0.1242, over 21446.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3282, pruned_loss=0.09516, over 4272481.22 frames. ], batch size: 473, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:28:59,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=879474.0, ans=0.0 2023-06-21 04:29:53,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.125e+02 3.627e+02 4.480e+02 7.581e+02, threshold=7.254e+02, percent-clipped=2.0 2023-06-21 04:30:10,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=879714.0, ans=0.05 2023-06-21 04:30:21,441 INFO [train.py:996] (0/4) Epoch 5, batch 24650, loss[loss=0.2117, simple_loss=0.2761, pruned_loss=0.07361, over 21851.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3203, pruned_loss=0.0937, over 4267990.89 frames. ], batch size: 373, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:30:35,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=879774.0, ans=0.125 2023-06-21 04:30:40,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=879774.0, ans=0.0 2023-06-21 04:31:01,204 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.575e-03 2023-06-21 04:31:33,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=879954.0, ans=0.0 2023-06-21 04:31:57,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-21 04:32:07,133 INFO [train.py:996] (0/4) Epoch 5, batch 24700, loss[loss=0.2568, simple_loss=0.3089, pruned_loss=0.1023, over 21338.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3193, pruned_loss=0.09249, over 4258789.50 frames. ], batch size: 473, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:33:13,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.830e+02 3.084e+02 3.762e+02 5.962e+02, threshold=6.167e+02, percent-clipped=0.0 2023-06-21 04:33:24,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=880314.0, ans=0.1 2023-06-21 04:33:38,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=880374.0, ans=0.1 2023-06-21 04:33:39,767 INFO [train.py:996] (0/4) Epoch 5, batch 24750, loss[loss=0.2085, simple_loss=0.2735, pruned_loss=0.07179, over 21854.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3129, pruned_loss=0.08966, over 4267772.78 frames. ], batch size: 107, lr: 6.02e-03, grad_scale: 8.0 2023-06-21 04:34:15,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=880434.0, ans=0.2 2023-06-21 04:34:39,476 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:35:18,255 INFO [train.py:996] (0/4) Epoch 5, batch 24800, loss[loss=0.2193, simple_loss=0.2801, pruned_loss=0.07928, over 21489.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3072, pruned_loss=0.08889, over 4270457.43 frames. ], batch size: 195, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:35:47,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=880734.0, ans=0.1 2023-06-21 04:36:20,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=880854.0, ans=0.125 2023-06-21 04:36:31,839 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.672e+02 2.950e+02 3.460e+02 6.225e+02, threshold=5.900e+02, percent-clipped=1.0 2023-06-21 04:36:57,254 INFO [train.py:996] (0/4) Epoch 5, batch 24850, loss[loss=0.2276, simple_loss=0.2807, pruned_loss=0.08724, over 21310.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3083, pruned_loss=0.09097, over 4280928.58 frames. ], batch size: 176, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:37:01,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=880974.0, ans=0.125 2023-06-21 04:38:36,652 INFO [train.py:996] (0/4) Epoch 5, batch 24900, loss[loss=0.2577, simple_loss=0.356, pruned_loss=0.07975, over 20799.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3115, pruned_loss=0.09164, over 4287792.82 frames. ], batch size: 608, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:39:31,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=881394.0, ans=0.125 2023-06-21 04:39:33,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.49 vs. limit=15.0 2023-06-21 04:39:35,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=881394.0, ans=0.0 2023-06-21 04:39:40,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=881394.0, ans=0.0 2023-06-21 04:39:51,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.066e+02 3.454e+02 4.012e+02 6.143e+02, threshold=6.909e+02, percent-clipped=1.0 2023-06-21 04:39:59,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=881514.0, ans=0.125 2023-06-21 04:40:06,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=881514.0, ans=0.125 2023-06-21 04:40:17,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=881514.0, ans=0.2 2023-06-21 04:40:22,256 INFO [train.py:996] (0/4) Epoch 5, batch 24950, loss[loss=0.3243, simple_loss=0.3757, pruned_loss=0.1364, over 21351.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3211, pruned_loss=0.09669, over 4286218.15 frames. ], batch size: 143, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:40:34,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=881574.0, ans=0.0 2023-06-21 04:41:07,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=881634.0, ans=15.0 2023-06-21 04:41:34,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=881754.0, ans=0.2 2023-06-21 04:41:42,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=881754.0, ans=0.2 2023-06-21 04:42:02,092 INFO [train.py:996] (0/4) Epoch 5, batch 25000, loss[loss=0.2431, simple_loss=0.3333, pruned_loss=0.07646, over 21947.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3281, pruned_loss=0.09855, over 4276871.50 frames. ], batch size: 317, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:42:52,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=881994.0, ans=0.125 2023-06-21 04:43:01,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=881994.0, ans=0.125 2023-06-21 04:43:10,525 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.862e+02 3.406e+02 4.060e+02 6.504e+02, threshold=6.812e+02, percent-clipped=0.0 2023-06-21 04:43:29,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-21 04:43:46,045 INFO [train.py:996] (0/4) Epoch 5, batch 25050, loss[loss=0.2369, simple_loss=0.2909, pruned_loss=0.09148, over 21290.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3207, pruned_loss=0.09651, over 4281150.09 frames. ], batch size: 144, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:44:06,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=882234.0, ans=0.125 2023-06-21 04:44:08,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-21 04:44:33,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=882294.0, ans=0.2 2023-06-21 04:44:49,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=882354.0, ans=0.125 2023-06-21 04:44:57,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=882354.0, ans=0.125 2023-06-21 04:45:20,807 INFO [train.py:996] (0/4) Epoch 5, batch 25100, loss[loss=0.2431, simple_loss=0.3184, pruned_loss=0.08388, over 21192.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3138, pruned_loss=0.09427, over 4278485.05 frames. ], batch size: 159, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:45:21,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=882474.0, ans=0.1 2023-06-21 04:45:28,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=882474.0, ans=0.0 2023-06-21 04:45:42,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=882474.0, ans=0.0 2023-06-21 04:46:23,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.27 vs. limit=10.0 2023-06-21 04:46:25,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-21 04:46:29,033 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.734e+02 3.137e+02 3.918e+02 6.199e+02, threshold=6.274e+02, percent-clipped=0.0 2023-06-21 04:46:45,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=882714.0, ans=0.125 2023-06-21 04:46:51,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=882714.0, ans=0.2 2023-06-21 04:46:54,754 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:46:59,088 INFO [train.py:996] (0/4) Epoch 5, batch 25150, loss[loss=0.2448, simple_loss=0.3266, pruned_loss=0.08152, over 21766.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3187, pruned_loss=0.09253, over 4274234.84 frames. ], batch size: 298, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:47:55,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=882894.0, ans=0.125 2023-06-21 04:48:04,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=882954.0, ans=0.0 2023-06-21 04:48:34,815 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-21 04:48:37,289 INFO [train.py:996] (0/4) Epoch 5, batch 25200, loss[loss=0.2462, simple_loss=0.3168, pruned_loss=0.08777, over 16292.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3169, pruned_loss=0.08942, over 4266588.51 frames. ], batch size: 62, lr: 6.02e-03, grad_scale: 32.0 2023-06-21 04:48:45,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=883074.0, ans=0.0 2023-06-21 04:49:19,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=883134.0, ans=0.125 2023-06-21 04:49:22,853 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:49:42,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=883254.0, ans=0.0 2023-06-21 04:49:44,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=883254.0, ans=0.0 2023-06-21 04:49:46,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.669e+02 3.257e+02 4.012e+02 7.318e+02, threshold=6.513e+02, percent-clipped=2.0 2023-06-21 04:49:59,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=883314.0, ans=0.2 2023-06-21 04:50:00,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-21 04:50:17,422 INFO [train.py:996] (0/4) Epoch 5, batch 25250, loss[loss=0.2093, simple_loss=0.2728, pruned_loss=0.07293, over 21194.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3143, pruned_loss=0.0873, over 4268031.06 frames. ], batch size: 548, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:50:29,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=883374.0, ans=0.0 2023-06-21 04:50:45,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=883434.0, ans=0.125 2023-06-21 04:50:58,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=883494.0, ans=0.125 2023-06-21 04:51:25,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=883554.0, ans=0.125 2023-06-21 04:51:57,250 INFO [train.py:996] (0/4) Epoch 5, batch 25300, loss[loss=0.3586, simple_loss=0.4021, pruned_loss=0.1575, over 21373.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3124, pruned_loss=0.08702, over 4260226.41 frames. ], batch size: 508, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:52:35,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=883734.0, ans=0.1 2023-06-21 04:52:44,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=883794.0, ans=0.125 2023-06-21 04:53:02,448 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.770e+02 3.140e+02 3.813e+02 4.907e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-21 04:53:33,706 INFO [train.py:996] (0/4) Epoch 5, batch 25350, loss[loss=0.186, simple_loss=0.2602, pruned_loss=0.05584, over 21231.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3156, pruned_loss=0.08694, over 4255119.52 frames. ], batch size: 159, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:54:51,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2023-06-21 04:54:55,180 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-21 04:55:07,855 INFO [train.py:996] (0/4) Epoch 5, batch 25400, loss[loss=0.2516, simple_loss=0.3056, pruned_loss=0.09875, over 21643.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3098, pruned_loss=0.0855, over 4259162.37 frames. ], batch size: 247, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:56:15,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.763e+02 3.058e+02 3.669e+02 6.374e+02, threshold=6.116e+02, percent-clipped=1.0 2023-06-21 04:56:17,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=884454.0, ans=0.125 2023-06-21 04:56:29,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=884514.0, ans=0.1 2023-06-21 04:56:46,813 INFO [train.py:996] (0/4) Epoch 5, batch 25450, loss[loss=0.2853, simple_loss=0.3718, pruned_loss=0.09944, over 21486.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3124, pruned_loss=0.088, over 4263415.24 frames. ], batch size: 471, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:57:46,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=884694.0, ans=0.125 2023-06-21 04:57:51,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=884754.0, ans=0.2 2023-06-21 04:58:08,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=884754.0, ans=0.125 2023-06-21 04:58:31,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=15.0 2023-06-21 04:58:31,848 INFO [train.py:996] (0/4) Epoch 5, batch 25500, loss[loss=0.2256, simple_loss=0.2857, pruned_loss=0.08275, over 15584.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3126, pruned_loss=0.08531, over 4244225.69 frames. ], batch size: 62, lr: 6.01e-03, grad_scale: 16.0 2023-06-21 04:59:44,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.815e+02 3.207e+02 3.771e+02 6.756e+02, threshold=6.413e+02, percent-clipped=1.0 2023-06-21 04:59:46,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-21 05:00:00,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=885114.0, ans=0.125 2023-06-21 05:00:13,107 INFO [train.py:996] (0/4) Epoch 5, batch 25550, loss[loss=0.2828, simple_loss=0.3915, pruned_loss=0.08706, over 20724.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3196, pruned_loss=0.08645, over 4243954.49 frames. ], batch size: 607, lr: 6.01e-03, grad_scale: 16.0 2023-06-21 05:00:41,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=885234.0, ans=0.125 2023-06-21 05:01:36,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=885414.0, ans=0.1 2023-06-21 05:02:02,418 INFO [train.py:996] (0/4) Epoch 5, batch 25600, loss[loss=0.2553, simple_loss=0.3326, pruned_loss=0.08907, over 21443.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3232, pruned_loss=0.08686, over 4249178.24 frames. ], batch size: 548, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:02:11,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=885474.0, ans=0.05 2023-06-21 05:02:12,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=885474.0, ans=0.125 2023-06-21 05:02:12,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=885474.0, ans=0.125 2023-06-21 05:02:37,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=885594.0, ans=0.125 2023-06-21 05:03:03,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.775e+02 3.286e+02 3.783e+02 5.833e+02, threshold=6.573e+02, percent-clipped=0.0 2023-06-21 05:03:41,883 INFO [train.py:996] (0/4) Epoch 5, batch 25650, loss[loss=0.2389, simple_loss=0.2995, pruned_loss=0.08917, over 21785.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3239, pruned_loss=0.0901, over 4243028.52 frames. ], batch size: 118, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:03:51,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=885774.0, ans=0.0 2023-06-21 05:04:00,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=885834.0, ans=0.125 2023-06-21 05:04:11,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=12.0 2023-06-21 05:05:08,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=886014.0, ans=0.1 2023-06-21 05:05:21,305 INFO [train.py:996] (0/4) Epoch 5, batch 25700, loss[loss=0.2839, simple_loss=0.3468, pruned_loss=0.1105, over 21403.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3217, pruned_loss=0.09165, over 4256790.79 frames. ], batch size: 131, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:05:21,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=886074.0, ans=0.04949747468305833 2023-06-21 05:05:28,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=886074.0, ans=0.2 2023-06-21 05:06:19,668 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-21 05:06:23,796 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.865e+02 3.376e+02 4.055e+02 7.604e+02, threshold=6.752e+02, percent-clipped=2.0 2023-06-21 05:06:37,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=886314.0, ans=0.0 2023-06-21 05:06:37,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=886314.0, ans=0.0 2023-06-21 05:06:44,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=886314.0, ans=0.2 2023-06-21 05:06:47,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=886314.0, ans=0.0 2023-06-21 05:06:48,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=886314.0, ans=0.125 2023-06-21 05:06:56,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=886314.0, ans=0.1 2023-06-21 05:06:56,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-21 05:06:58,904 INFO [train.py:996] (0/4) Epoch 5, batch 25750, loss[loss=0.3362, simple_loss=0.4185, pruned_loss=0.1269, over 21846.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3265, pruned_loss=0.09378, over 4264625.10 frames. ], batch size: 371, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:07:28,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=886434.0, ans=0.125 2023-06-21 05:07:52,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=886494.0, ans=0.0 2023-06-21 05:08:14,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=886554.0, ans=0.0 2023-06-21 05:08:16,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=886554.0, ans=0.125 2023-06-21 05:08:46,844 INFO [train.py:996] (0/4) Epoch 5, batch 25800, loss[loss=0.2824, simple_loss=0.3471, pruned_loss=0.1088, over 21441.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3377, pruned_loss=0.09752, over 4272007.43 frames. ], batch size: 211, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:09:58,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=886854.0, ans=0.0 2023-06-21 05:09:59,474 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.944e+02 3.581e+02 4.306e+02 8.254e+02, threshold=7.162e+02, percent-clipped=3.0 2023-06-21 05:10:26,656 INFO [train.py:996] (0/4) Epoch 5, batch 25850, loss[loss=0.2588, simple_loss=0.323, pruned_loss=0.09726, over 21470.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3385, pruned_loss=0.09644, over 4277667.42 frames. ], batch size: 194, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:10:27,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=886974.0, ans=0.1 2023-06-21 05:10:50,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-21 05:11:12,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=887094.0, ans=0.125 2023-06-21 05:11:25,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=887094.0, ans=0.0 2023-06-21 05:11:55,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-21 05:12:07,873 INFO [train.py:996] (0/4) Epoch 5, batch 25900, loss[loss=0.3424, simple_loss=0.4239, pruned_loss=0.1305, over 21671.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3402, pruned_loss=0.09725, over 4280208.64 frames. ], batch size: 414, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:13:22,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=887454.0, ans=0.125 2023-06-21 05:13:26,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.088e+02 3.549e+02 4.240e+02 5.933e+02, threshold=7.098e+02, percent-clipped=0.0 2023-06-21 05:13:44,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-21 05:13:58,665 INFO [train.py:996] (0/4) Epoch 5, batch 25950, loss[loss=0.2359, simple_loss=0.3102, pruned_loss=0.08076, over 20717.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3472, pruned_loss=0.1008, over 4279223.81 frames. ], batch size: 607, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:14:47,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=887694.0, ans=10.0 2023-06-21 05:14:48,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=887694.0, ans=0.0 2023-06-21 05:14:53,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=887694.0, ans=0.125 2023-06-21 05:14:55,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=887694.0, ans=0.95 2023-06-21 05:15:11,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=887754.0, ans=0.125 2023-06-21 05:15:40,352 INFO [train.py:996] (0/4) Epoch 5, batch 26000, loss[loss=0.2773, simple_loss=0.352, pruned_loss=0.1012, over 21652.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3452, pruned_loss=0.09838, over 4272482.95 frames. ], batch size: 263, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:16:25,288 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-148000.pt 2023-06-21 05:16:46,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=888054.0, ans=0.0 2023-06-21 05:16:48,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-06-21 05:16:52,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 2.994e+02 3.502e+02 4.127e+02 6.076e+02, threshold=7.004e+02, percent-clipped=0.0 2023-06-21 05:16:53,556 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.99 vs. limit=22.5 2023-06-21 05:17:19,597 INFO [train.py:996] (0/4) Epoch 5, batch 26050, loss[loss=0.3412, simple_loss=0.4363, pruned_loss=0.1231, over 19766.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3448, pruned_loss=0.09952, over 4275006.17 frames. ], batch size: 703, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:17:29,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=888174.0, ans=0.1 2023-06-21 05:17:30,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-06-21 05:18:40,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-21 05:18:58,133 INFO [train.py:996] (0/4) Epoch 5, batch 26100, loss[loss=0.2678, simple_loss=0.33, pruned_loss=0.1028, over 21890.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3394, pruned_loss=0.09853, over 4282324.74 frames. ], batch size: 371, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:19:09,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.99 vs. limit=12.0 2023-06-21 05:19:15,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=888474.0, ans=0.125 2023-06-21 05:19:15,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=888474.0, ans=0.0 2023-06-21 05:19:21,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.79 vs. limit=8.0 2023-06-21 05:19:36,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=888534.0, ans=0.2 2023-06-21 05:20:01,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=888654.0, ans=0.0 2023-06-21 05:20:05,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.461e+02 2.983e+02 3.615e+02 4.836e+02 1.225e+03, threshold=7.230e+02, percent-clipped=7.0 2023-06-21 05:20:39,135 INFO [train.py:996] (0/4) Epoch 5, batch 26150, loss[loss=0.2777, simple_loss=0.3441, pruned_loss=0.1057, over 21935.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3372, pruned_loss=0.09973, over 4293087.98 frames. ], batch size: 372, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:21:45,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=888954.0, ans=0.0 2023-06-21 05:21:54,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=888954.0, ans=0.125 2023-06-21 05:22:20,402 INFO [train.py:996] (0/4) Epoch 5, batch 26200, loss[loss=0.2518, simple_loss=0.3452, pruned_loss=0.0792, over 21658.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3375, pruned_loss=0.09793, over 4289068.61 frames. ], batch size: 263, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:22:30,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=889074.0, ans=0.125 2023-06-21 05:22:56,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=889134.0, ans=0.125 2023-06-21 05:23:00,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=10.0 2023-06-21 05:23:12,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=889194.0, ans=0.0 2023-06-21 05:23:29,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=889254.0, ans=0.0 2023-06-21 05:23:33,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.909e+02 3.359e+02 4.257e+02 6.778e+02, threshold=6.718e+02, percent-clipped=0.0 2023-06-21 05:24:01,203 INFO [train.py:996] (0/4) Epoch 5, batch 26250, loss[loss=0.2821, simple_loss=0.3498, pruned_loss=0.1072, over 21758.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.34, pruned_loss=0.0964, over 4289702.75 frames. ], batch size: 112, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:24:03,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=889374.0, ans=0.0 2023-06-21 05:24:44,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=889494.0, ans=0.0 2023-06-21 05:24:58,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=889494.0, ans=0.1 2023-06-21 05:24:59,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=889494.0, ans=0.125 2023-06-21 05:25:39,650 INFO [train.py:996] (0/4) Epoch 5, batch 26300, loss[loss=0.2829, simple_loss=0.3473, pruned_loss=0.1093, over 21439.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3369, pruned_loss=0.09668, over 4295139.05 frames. ], batch size: 131, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:25:51,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=889674.0, ans=0.125 2023-06-21 05:26:04,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=889734.0, ans=0.125 2023-06-21 05:26:58,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.888e+02 3.226e+02 3.870e+02 6.035e+02, threshold=6.451e+02, percent-clipped=0.0 2023-06-21 05:27:25,032 INFO [train.py:996] (0/4) Epoch 5, batch 26350, loss[loss=0.3779, simple_loss=0.414, pruned_loss=0.1709, over 21346.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3352, pruned_loss=0.09695, over 4285415.17 frames. ], batch size: 507, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:27:36,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=889974.0, ans=0.125 2023-06-21 05:27:45,192 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.61 vs. limit=15.0 2023-06-21 05:28:06,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=890094.0, ans=0.125 2023-06-21 05:28:59,024 INFO [train.py:996] (0/4) Epoch 5, batch 26400, loss[loss=0.228, simple_loss=0.2781, pruned_loss=0.0889, over 21372.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3293, pruned_loss=0.09688, over 4279362.98 frames. ], batch size: 194, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:29:17,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=890334.0, ans=0.125 2023-06-21 05:29:21,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-21 05:29:46,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=890394.0, ans=0.0 2023-06-21 05:30:10,883 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 2.951e+02 3.748e+02 4.421e+02 1.228e+03, threshold=7.496e+02, percent-clipped=6.0 2023-06-21 05:30:13,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=890454.0, ans=0.125 2023-06-21 05:30:22,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=890514.0, ans=0.0 2023-06-21 05:30:37,995 INFO [train.py:996] (0/4) Epoch 5, batch 26450, loss[loss=0.3018, simple_loss=0.3944, pruned_loss=0.1046, over 21719.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3273, pruned_loss=0.09601, over 4273555.97 frames. ], batch size: 332, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:30:43,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=890574.0, ans=0.125 2023-06-21 05:31:10,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=890634.0, ans=0.125 2023-06-21 05:31:15,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=890694.0, ans=0.5 2023-06-21 05:31:59,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=890814.0, ans=0.125 2023-06-21 05:32:19,188 INFO [train.py:996] (0/4) Epoch 5, batch 26500, loss[loss=0.1805, simple_loss=0.2381, pruned_loss=0.06145, over 21838.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.329, pruned_loss=0.09391, over 4272808.46 frames. ], batch size: 107, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:32:19,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=890874.0, ans=0.125 2023-06-21 05:32:23,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=890874.0, ans=0.0 2023-06-21 05:33:07,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=890994.0, ans=0.125 2023-06-21 05:33:14,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=890994.0, ans=0.1 2023-06-21 05:33:15,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-21 05:33:27,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=891054.0, ans=15.0 2023-06-21 05:33:42,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.071e+02 3.801e+02 4.543e+02 1.004e+03, threshold=7.603e+02, percent-clipped=5.0 2023-06-21 05:33:59,223 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-21 05:34:01,428 INFO [train.py:996] (0/4) Epoch 5, batch 26550, loss[loss=0.2708, simple_loss=0.3601, pruned_loss=0.09077, over 21549.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3249, pruned_loss=0.09057, over 4263436.74 frames. ], batch size: 473, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:34:15,681 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-21 05:34:16,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=891174.0, ans=0.0 2023-06-21 05:34:38,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-21 05:34:50,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=891294.0, ans=0.0 2023-06-21 05:35:37,295 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=12.0 2023-06-21 05:35:40,880 INFO [train.py:996] (0/4) Epoch 5, batch 26600, loss[loss=0.2425, simple_loss=0.313, pruned_loss=0.08598, over 21144.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3245, pruned_loss=0.08769, over 4271671.03 frames. ], batch size: 548, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:36:21,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-21 05:36:27,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=891594.0, ans=0.1 2023-06-21 05:36:27,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=891594.0, ans=0.0 2023-06-21 05:36:41,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=891654.0, ans=0.0 2023-06-21 05:36:52,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=891654.0, ans=0.1 2023-06-21 05:36:59,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.044e+02 3.559e+02 4.505e+02 6.702e+02, threshold=7.118e+02, percent-clipped=0.0 2023-06-21 05:37:23,633 INFO [train.py:996] (0/4) Epoch 5, batch 26650, loss[loss=0.227, simple_loss=0.3048, pruned_loss=0.07463, over 21543.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3177, pruned_loss=0.08656, over 4260331.09 frames. ], batch size: 441, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:37:27,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-21 05:37:30,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=891774.0, ans=0.04949747468305833 2023-06-21 05:37:48,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-21 05:38:02,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-21 05:38:11,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=891894.0, ans=0.125 2023-06-21 05:38:39,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=892014.0, ans=0.125 2023-06-21 05:39:01,922 INFO [train.py:996] (0/4) Epoch 5, batch 26700, loss[loss=0.2061, simple_loss=0.2754, pruned_loss=0.06839, over 21295.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3114, pruned_loss=0.0839, over 4262635.10 frames. ], batch size: 143, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:39:29,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=892134.0, ans=0.2 2023-06-21 05:39:36,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.78 vs. limit=15.0 2023-06-21 05:40:18,710 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 2.509e+02 2.895e+02 3.334e+02 4.980e+02, threshold=5.790e+02, percent-clipped=0.0 2023-06-21 05:40:47,754 INFO [train.py:996] (0/4) Epoch 5, batch 26750, loss[loss=0.2155, simple_loss=0.3106, pruned_loss=0.0602, over 21759.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3123, pruned_loss=0.08331, over 4265707.64 frames. ], batch size: 351, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:40:49,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=892374.0, ans=0.0 2023-06-21 05:41:04,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=892434.0, ans=0.125 2023-06-21 05:41:37,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=892494.0, ans=0.2 2023-06-21 05:41:54,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-21 05:41:55,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=892554.0, ans=0.025 2023-06-21 05:41:58,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=892554.0, ans=0.0 2023-06-21 05:42:01,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=892614.0, ans=0.2 2023-06-21 05:42:15,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=892614.0, ans=0.125 2023-06-21 05:42:23,664 INFO [train.py:996] (0/4) Epoch 5, batch 26800, loss[loss=0.3097, simple_loss=0.3703, pruned_loss=0.1245, over 21364.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3218, pruned_loss=0.08917, over 4269297.32 frames. ], batch size: 548, lr: 5.98e-03, grad_scale: 32.0 2023-06-21 05:43:16,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=892794.0, ans=0.0 2023-06-21 05:43:16,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=892794.0, ans=0.125 2023-06-21 05:43:25,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2023-06-21 05:43:42,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=892854.0, ans=0.95 2023-06-21 05:43:45,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 3.040e+02 3.566e+02 4.548e+02 6.934e+02, threshold=7.132e+02, percent-clipped=4.0 2023-06-21 05:43:58,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=892914.0, ans=0.125 2023-06-21 05:44:03,878 INFO [train.py:996] (0/4) Epoch 5, batch 26850, loss[loss=0.2218, simple_loss=0.281, pruned_loss=0.08134, over 21535.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3245, pruned_loss=0.0926, over 4274418.95 frames. ], batch size: 230, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:44:09,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=892974.0, ans=0.2 2023-06-21 05:44:49,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=893094.0, ans=0.2 2023-06-21 05:45:05,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=893154.0, ans=0.04949747468305833 2023-06-21 05:45:25,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=893154.0, ans=0.0 2023-06-21 05:45:38,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=893214.0, ans=0.125 2023-06-21 05:45:43,446 INFO [train.py:996] (0/4) Epoch 5, batch 26900, loss[loss=0.2451, simple_loss=0.295, pruned_loss=0.0976, over 21694.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3155, pruned_loss=0.09096, over 4270075.83 frames. ], batch size: 417, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:47:04,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.809e+02 3.300e+02 3.699e+02 7.956e+02, threshold=6.601e+02, percent-clipped=1.0 2023-06-21 05:47:22,089 INFO [train.py:996] (0/4) Epoch 5, batch 26950, loss[loss=0.2373, simple_loss=0.329, pruned_loss=0.07276, over 19802.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3153, pruned_loss=0.09093, over 4271767.71 frames. ], batch size: 702, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:48:06,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-21 05:49:02,220 INFO [train.py:996] (0/4) Epoch 5, batch 27000, loss[loss=0.2026, simple_loss=0.2791, pruned_loss=0.06306, over 21517.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.316, pruned_loss=0.0886, over 4269250.92 frames. ], batch size: 195, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:49:02,221 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 05:49:13,783 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6179, 3.0794, 3.0823, 2.8787], device='cuda:0') 2023-06-21 05:49:20,488 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2444, simple_loss=0.3449, pruned_loss=0.07195, over 1796401.00 frames. 2023-06-21 05:49:20,488 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 05:50:04,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=893994.0, ans=0.125 2023-06-21 05:50:12,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=893994.0, ans=0.0 2023-06-21 05:50:38,952 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.497e+02 2.990e+02 3.496e+02 4.876e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-21 05:51:01,172 INFO [train.py:996] (0/4) Epoch 5, batch 27050, loss[loss=0.2466, simple_loss=0.3179, pruned_loss=0.08771, over 21839.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3176, pruned_loss=0.08518, over 4266314.53 frames. ], batch size: 124, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:52:42,216 INFO [train.py:996] (0/4) Epoch 5, batch 27100, loss[loss=0.2966, simple_loss=0.3725, pruned_loss=0.1104, over 21689.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3205, pruned_loss=0.0862, over 4274917.46 frames. ], batch size: 441, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:52:49,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=894474.0, ans=0.1 2023-06-21 05:53:12,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=894534.0, ans=0.125 2023-06-21 05:53:14,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-21 05:53:25,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=894594.0, ans=0.0 2023-06-21 05:53:59,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=894654.0, ans=0.125 2023-06-21 05:54:00,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.188e+02 3.922e+02 5.852e+02 9.183e+02, threshold=7.845e+02, percent-clipped=23.0 2023-06-21 05:54:02,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=894714.0, ans=0.0 2023-06-21 05:54:14,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=894714.0, ans=0.125 2023-06-21 05:54:18,407 INFO [train.py:996] (0/4) Epoch 5, batch 27150, loss[loss=0.3454, simple_loss=0.4222, pruned_loss=0.1343, over 21642.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3296, pruned_loss=0.08864, over 4277615.76 frames. ], batch size: 389, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:54:27,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=894774.0, ans=0.1 2023-06-21 05:54:35,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=894774.0, ans=0.95 2023-06-21 05:55:58,216 INFO [train.py:996] (0/4) Epoch 5, batch 27200, loss[loss=0.2906, simple_loss=0.3672, pruned_loss=0.107, over 21460.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3353, pruned_loss=0.09102, over 4276474.39 frames. ], batch size: 131, lr: 5.98e-03, grad_scale: 32.0 2023-06-21 05:56:21,772 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-06-21 05:56:35,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=895134.0, ans=0.125 2023-06-21 05:57:23,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.260e+02 3.703e+02 4.553e+02 9.386e+02, threshold=7.407e+02, percent-clipped=2.0 2023-06-21 05:57:25,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=895314.0, ans=0.125 2023-06-21 05:57:52,290 INFO [train.py:996] (0/4) Epoch 5, batch 27250, loss[loss=0.2998, simple_loss=0.3571, pruned_loss=0.1213, over 21390.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3405, pruned_loss=0.09607, over 4280351.24 frames. ], batch size: 549, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 05:58:16,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-21 05:58:30,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-21 05:59:23,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=895614.0, ans=0.125 2023-06-21 05:59:27,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=895614.0, ans=0.125 2023-06-21 05:59:33,996 INFO [train.py:996] (0/4) Epoch 5, batch 27300, loss[loss=0.2401, simple_loss=0.3255, pruned_loss=0.07742, over 21786.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.341, pruned_loss=0.09663, over 4278522.97 frames. ], batch size: 332, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 05:59:49,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=895674.0, ans=0.125 2023-06-21 06:00:22,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=895794.0, ans=0.2 2023-06-21 06:00:44,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=895854.0, ans=0.0 2023-06-21 06:00:52,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-21 06:00:57,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 2.999e+02 3.424e+02 4.068e+02 6.879e+02, threshold=6.848e+02, percent-clipped=0.0 2023-06-21 06:01:01,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=895914.0, ans=0.0 2023-06-21 06:01:15,086 INFO [train.py:996] (0/4) Epoch 5, batch 27350, loss[loss=0.2618, simple_loss=0.3383, pruned_loss=0.09266, over 21908.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3454, pruned_loss=0.09833, over 4276327.91 frames. ], batch size: 316, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:01:49,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=896034.0, ans=0.2 2023-06-21 06:01:49,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=896034.0, ans=0.125 2023-06-21 06:02:49,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=896274.0, ans=0.2 2023-06-21 06:02:54,654 INFO [train.py:996] (0/4) Epoch 5, batch 27400, loss[loss=0.2245, simple_loss=0.2887, pruned_loss=0.08019, over 21682.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3401, pruned_loss=0.09791, over 4280768.23 frames. ], batch size: 230, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:02:55,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-21 06:03:19,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=896334.0, ans=0.125 2023-06-21 06:03:22,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=896334.0, ans=0.125 2023-06-21 06:04:08,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.765e+02 3.152e+02 3.980e+02 5.730e+02, threshold=6.304e+02, percent-clipped=0.0 2023-06-21 06:04:14,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-21 06:04:34,764 INFO [train.py:996] (0/4) Epoch 5, batch 27450, loss[loss=0.2648, simple_loss=0.3477, pruned_loss=0.09094, over 21733.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3345, pruned_loss=0.09588, over 4274576.86 frames. ], batch size: 351, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:05:51,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=896814.0, ans=0.0 2023-06-21 06:05:55,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.19 vs. limit=10.0 2023-06-21 06:06:14,869 INFO [train.py:996] (0/4) Epoch 5, batch 27500, loss[loss=0.2428, simple_loss=0.3078, pruned_loss=0.08894, over 21873.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3333, pruned_loss=0.09695, over 4277295.25 frames. ], batch size: 298, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:07:29,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.898e+02 3.228e+02 3.815e+02 7.854e+02, threshold=6.456e+02, percent-clipped=2.0 2023-06-21 06:07:33,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=897114.0, ans=0.2 2023-06-21 06:07:54,349 INFO [train.py:996] (0/4) Epoch 5, batch 27550, loss[loss=0.332, simple_loss=0.4299, pruned_loss=0.1171, over 19858.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3296, pruned_loss=0.09382, over 4272788.27 frames. ], batch size: 702, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:08:19,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=897234.0, ans=0.1 2023-06-21 06:08:43,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=897294.0, ans=0.1 2023-06-21 06:09:29,813 INFO [train.py:996] (0/4) Epoch 5, batch 27600, loss[loss=0.2056, simple_loss=0.2757, pruned_loss=0.06779, over 21202.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3219, pruned_loss=0.09219, over 4275844.81 frames. ], batch size: 176, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:09:53,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=897534.0, ans=0.1 2023-06-21 06:10:43,736 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 2.760e+02 3.130e+02 3.904e+02 5.692e+02, threshold=6.260e+02, percent-clipped=0.0 2023-06-21 06:10:50,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=897714.0, ans=0.0 2023-06-21 06:11:08,645 INFO [train.py:996] (0/4) Epoch 5, batch 27650, loss[loss=0.2535, simple_loss=0.3113, pruned_loss=0.09785, over 21672.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3156, pruned_loss=0.09092, over 4279581.38 frames. ], batch size: 391, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:11:13,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=897774.0, ans=0.0 2023-06-21 06:12:24,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=898014.0, ans=0.0 2023-06-21 06:12:25,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=898014.0, ans=0.0 2023-06-21 06:12:49,025 INFO [train.py:996] (0/4) Epoch 5, batch 27700, loss[loss=0.2154, simple_loss=0.288, pruned_loss=0.07138, over 21274.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3157, pruned_loss=0.08888, over 4279286.18 frames. ], batch size: 159, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:13:09,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-21 06:13:57,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-21 06:14:00,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=898254.0, ans=0.0 2023-06-21 06:14:07,894 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.066e+02 3.761e+02 4.326e+02 8.310e+02, threshold=7.523e+02, percent-clipped=4.0 2023-06-21 06:14:28,411 INFO [train.py:996] (0/4) Epoch 5, batch 27750, loss[loss=0.3218, simple_loss=0.3963, pruned_loss=0.1237, over 21564.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.318, pruned_loss=0.08831, over 4268806.14 frames. ], batch size: 473, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:14:29,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2023-06-21 06:14:53,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=898434.0, ans=0.125 2023-06-21 06:15:29,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=898554.0, ans=0.125 2023-06-21 06:15:59,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=898614.0, ans=0.0 2023-06-21 06:16:06,811 INFO [train.py:996] (0/4) Epoch 5, batch 27800, loss[loss=0.2508, simple_loss=0.3059, pruned_loss=0.09782, over 21443.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3178, pruned_loss=0.0896, over 4280292.69 frames. ], batch size: 159, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:17:26,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.753e+02 3.252e+02 3.951e+02 6.290e+02, threshold=6.504e+02, percent-clipped=0.0 2023-06-21 06:17:45,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=898914.0, ans=0.2 2023-06-21 06:17:48,419 INFO [train.py:996] (0/4) Epoch 5, batch 27850, loss[loss=0.2626, simple_loss=0.3485, pruned_loss=0.08834, over 21773.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3178, pruned_loss=0.09104, over 4285894.29 frames. ], batch size: 414, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:19:05,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=899154.0, ans=0.125 2023-06-21 06:19:17,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=899214.0, ans=0.125 2023-06-21 06:19:31,897 INFO [train.py:996] (0/4) Epoch 5, batch 27900, loss[loss=0.3219, simple_loss=0.4273, pruned_loss=0.1082, over 20856.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3271, pruned_loss=0.09215, over 4282013.57 frames. ], batch size: 607, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:19:51,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=899274.0, ans=0.1 2023-06-21 06:20:41,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=899454.0, ans=0.125 2023-06-21 06:20:57,633 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 2.911e+02 3.342e+02 3.967e+02 6.742e+02, threshold=6.683e+02, percent-clipped=1.0 2023-06-21 06:21:16,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=899514.0, ans=0.125 2023-06-21 06:21:19,121 INFO [train.py:996] (0/4) Epoch 5, batch 27950, loss[loss=0.2236, simple_loss=0.3143, pruned_loss=0.06643, over 21870.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3278, pruned_loss=0.08884, over 4284505.81 frames. ], batch size: 316, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:21:24,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=899574.0, ans=0.125 2023-06-21 06:21:52,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=899634.0, ans=0.025 2023-06-21 06:22:59,485 INFO [train.py:996] (0/4) Epoch 5, batch 28000, loss[loss=0.2201, simple_loss=0.3106, pruned_loss=0.06478, over 21349.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3253, pruned_loss=0.08626, over 4281123.22 frames. ], batch size: 548, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:23:04,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.21 vs. limit=15.0 2023-06-21 06:23:57,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-21 06:24:19,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.872e+02 3.186e+02 3.800e+02 5.572e+02, threshold=6.373e+02, percent-clipped=0.0 2023-06-21 06:24:39,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=900174.0, ans=0.0 2023-06-21 06:24:40,561 INFO [train.py:996] (0/4) Epoch 5, batch 28050, loss[loss=0.2063, simple_loss=0.2658, pruned_loss=0.07338, over 21463.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.323, pruned_loss=0.08759, over 4288220.50 frames. ], batch size: 211, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:24:57,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=900174.0, ans=0.1 2023-06-21 06:25:02,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=900234.0, ans=0.125 2023-06-21 06:25:44,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.03 vs. limit=22.5 2023-06-21 06:26:08,203 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:26:20,968 INFO [train.py:996] (0/4) Epoch 5, batch 28100, loss[loss=0.2691, simple_loss=0.3173, pruned_loss=0.1104, over 21496.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.319, pruned_loss=0.08696, over 4271896.43 frames. ], batch size: 441, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:27:47,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.042e+02 3.636e+02 4.421e+02 1.163e+03, threshold=7.272e+02, percent-clipped=7.0 2023-06-21 06:27:48,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=900714.0, ans=0.1 2023-06-21 06:27:59,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-21 06:28:07,020 INFO [train.py:996] (0/4) Epoch 5, batch 28150, loss[loss=0.2793, simple_loss=0.4232, pruned_loss=0.06772, over 19760.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3135, pruned_loss=0.08703, over 4273831.26 frames. ], batch size: 702, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:28:07,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=900774.0, ans=0.125 2023-06-21 06:28:09,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=900774.0, ans=0.125 2023-06-21 06:29:43,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=901014.0, ans=0.125 2023-06-21 06:29:48,175 INFO [train.py:996] (0/4) Epoch 5, batch 28200, loss[loss=0.2463, simple_loss=0.3138, pruned_loss=0.08945, over 21758.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3109, pruned_loss=0.08834, over 4276779.75 frames. ], batch size: 247, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:30:14,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=901134.0, ans=0.125 2023-06-21 06:30:22,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=901134.0, ans=0.125 2023-06-21 06:30:30,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=901194.0, ans=0.05 2023-06-21 06:30:54,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-21 06:31:14,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.109e+02 3.691e+02 4.482e+02 7.045e+02, threshold=7.382e+02, percent-clipped=0.0 2023-06-21 06:31:33,992 INFO [train.py:996] (0/4) Epoch 5, batch 28250, loss[loss=0.2333, simple_loss=0.291, pruned_loss=0.0878, over 21203.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3149, pruned_loss=0.09174, over 4272423.86 frames. ], batch size: 159, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:33:00,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-21 06:33:15,403 INFO [train.py:996] (0/4) Epoch 5, batch 28300, loss[loss=0.198, simple_loss=0.2885, pruned_loss=0.05374, over 21574.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3129, pruned_loss=0.08961, over 4266023.65 frames. ], batch size: 230, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:33:24,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=901674.0, ans=0.125 2023-06-21 06:34:08,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=901794.0, ans=0.125 2023-06-21 06:34:18,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=901854.0, ans=0.125 2023-06-21 06:34:41,915 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.744e+02 3.366e+02 4.135e+02 8.525e+02, threshold=6.731e+02, percent-clipped=3.0 2023-06-21 06:34:56,284 INFO [train.py:996] (0/4) Epoch 5, batch 28350, loss[loss=0.2168, simple_loss=0.2717, pruned_loss=0.08093, over 21792.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3107, pruned_loss=0.08408, over 4271042.89 frames. ], batch size: 124, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:35:05,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.36 vs. limit=15.0 2023-06-21 06:35:31,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=902094.0, ans=0.05 2023-06-21 06:35:38,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=902094.0, ans=0.125 2023-06-21 06:35:45,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.36 vs. limit=10.0 2023-06-21 06:36:00,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-21 06:36:04,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=902154.0, ans=0.1 2023-06-21 06:36:17,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=902154.0, ans=0.125 2023-06-21 06:36:34,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=902214.0, ans=0.2 2023-06-21 06:36:40,531 INFO [train.py:996] (0/4) Epoch 5, batch 28400, loss[loss=0.3237, simple_loss=0.3725, pruned_loss=0.1375, over 21382.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3078, pruned_loss=0.08387, over 4269307.69 frames. ], batch size: 471, lr: 5.95e-03, grad_scale: 32.0 2023-06-21 06:36:48,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.53 vs. limit=22.5 2023-06-21 06:36:57,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=902334.0, ans=0.2 2023-06-21 06:37:33,333 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.657e-03 2023-06-21 06:37:54,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=902454.0, ans=0.035 2023-06-21 06:37:56,539 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-21 06:38:03,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.069e+02 3.636e+02 4.494e+02 7.236e+02, threshold=7.272e+02, percent-clipped=3.0 2023-06-21 06:38:20,759 INFO [train.py:996] (0/4) Epoch 5, batch 28450, loss[loss=0.2481, simple_loss=0.3161, pruned_loss=0.09, over 21423.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3131, pruned_loss=0.08705, over 4261015.48 frames. ], batch size: 194, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:38:24,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=902574.0, ans=0.125 2023-06-21 06:38:27,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=902574.0, ans=0.0 2023-06-21 06:38:56,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=902694.0, ans=0.0 2023-06-21 06:39:35,375 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:39:46,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=902814.0, ans=0.0 2023-06-21 06:39:59,736 INFO [train.py:996] (0/4) Epoch 5, batch 28500, loss[loss=0.262, simple_loss=0.3273, pruned_loss=0.0984, over 21596.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3143, pruned_loss=0.08887, over 4267931.41 frames. ], batch size: 263, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:40:07,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=902874.0, ans=0.0 2023-06-21 06:40:18,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=902934.0, ans=0.1 2023-06-21 06:40:45,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=902994.0, ans=0.0 2023-06-21 06:41:25,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-21 06:41:28,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.871e+02 3.406e+02 3.870e+02 6.038e+02, threshold=6.812e+02, percent-clipped=0.0 2023-06-21 06:41:41,921 INFO [train.py:996] (0/4) Epoch 5, batch 28550, loss[loss=0.2882, simple_loss=0.3508, pruned_loss=0.1128, over 20741.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3236, pruned_loss=0.09331, over 4272617.45 frames. ], batch size: 608, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:42:16,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=903234.0, ans=0.0 2023-06-21 06:42:29,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=903294.0, ans=0.0 2023-06-21 06:42:53,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-06-21 06:43:10,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=903414.0, ans=0.04949747468305833 2023-06-21 06:43:24,813 INFO [train.py:996] (0/4) Epoch 5, batch 28600, loss[loss=0.3302, simple_loss=0.3849, pruned_loss=0.1377, over 21325.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3308, pruned_loss=0.09582, over 4272497.47 frames. ], batch size: 507, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:43:27,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=903474.0, ans=0.1 2023-06-21 06:44:54,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 2.937e+02 3.367e+02 4.039e+02 6.744e+02, threshold=6.734e+02, percent-clipped=0.0 2023-06-21 06:45:12,144 INFO [train.py:996] (0/4) Epoch 5, batch 28650, loss[loss=0.2247, simple_loss=0.2846, pruned_loss=0.0824, over 21510.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3251, pruned_loss=0.0949, over 4276237.60 frames. ], batch size: 391, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:45:43,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=903834.0, ans=0.0 2023-06-21 06:46:27,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=903954.0, ans=0.0 2023-06-21 06:46:56,189 INFO [train.py:996] (0/4) Epoch 5, batch 28700, loss[loss=0.2201, simple_loss=0.2544, pruned_loss=0.09294, over 20085.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3231, pruned_loss=0.09582, over 4276557.54 frames. ], batch size: 704, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:47:59,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=904254.0, ans=0.125 2023-06-21 06:48:13,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 2.955e+02 3.205e+02 3.884e+02 6.833e+02, threshold=6.409e+02, percent-clipped=1.0 2023-06-21 06:48:16,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=904314.0, ans=0.2 2023-06-21 06:48:37,257 INFO [train.py:996] (0/4) Epoch 5, batch 28750, loss[loss=0.2642, simple_loss=0.3398, pruned_loss=0.0943, over 21799.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3244, pruned_loss=0.09684, over 4279599.42 frames. ], batch size: 414, lr: 5.94e-03, grad_scale: 16.0 2023-06-21 06:48:53,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=904374.0, ans=0.125 2023-06-21 06:49:09,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=904434.0, ans=0.0 2023-06-21 06:49:18,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=904494.0, ans=0.125 2023-06-21 06:49:24,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=904494.0, ans=0.125 2023-06-21 06:49:37,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=904554.0, ans=0.125 2023-06-21 06:49:49,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=904554.0, ans=0.125 2023-06-21 06:50:00,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=904614.0, ans=0.125 2023-06-21 06:50:17,685 INFO [train.py:996] (0/4) Epoch 5, batch 28800, loss[loss=0.3369, simple_loss=0.3882, pruned_loss=0.1428, over 21437.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.328, pruned_loss=0.0972, over 4279137.66 frames. ], batch size: 471, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:50:24,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=904674.0, ans=0.2 2023-06-21 06:50:37,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=904674.0, ans=0.2 2023-06-21 06:50:49,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-21 06:51:10,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=904794.0, ans=0.95 2023-06-21 06:51:11,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=904794.0, ans=0.2 2023-06-21 06:51:36,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=904914.0, ans=0.1 2023-06-21 06:51:45,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.918e+02 3.315e+02 4.122e+02 9.599e+02, threshold=6.630e+02, percent-clipped=10.0 2023-06-21 06:51:46,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=904914.0, ans=0.0 2023-06-21 06:52:06,845 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:52:08,213 INFO [train.py:996] (0/4) Epoch 5, batch 28850, loss[loss=0.2698, simple_loss=0.3257, pruned_loss=0.107, over 21820.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3302, pruned_loss=0.09901, over 4283734.59 frames. ], batch size: 441, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:52:27,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-21 06:53:21,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=905154.0, ans=0.125 2023-06-21 06:53:48,817 INFO [train.py:996] (0/4) Epoch 5, batch 28900, loss[loss=0.2436, simple_loss=0.303, pruned_loss=0.09205, over 21313.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3317, pruned_loss=0.1004, over 4285976.53 frames. ], batch size: 176, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:53:58,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-21 06:54:09,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=905334.0, ans=0.125 2023-06-21 06:55:05,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=905454.0, ans=0.0 2023-06-21 06:55:16,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=905514.0, ans=0.2 2023-06-21 06:55:17,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.129e+02 3.502e+02 4.010e+02 6.253e+02, threshold=7.003e+02, percent-clipped=0.0 2023-06-21 06:55:31,413 INFO [train.py:996] (0/4) Epoch 5, batch 28950, loss[loss=0.2181, simple_loss=0.2811, pruned_loss=0.07756, over 21274.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3313, pruned_loss=0.09885, over 4281170.90 frames. ], batch size: 176, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:56:34,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=905754.0, ans=0.125 2023-06-21 06:56:58,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=905814.0, ans=0.2 2023-06-21 06:57:12,967 INFO [train.py:996] (0/4) Epoch 5, batch 29000, loss[loss=0.2953, simple_loss=0.366, pruned_loss=0.1123, over 21787.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3362, pruned_loss=0.09798, over 4279122.82 frames. ], batch size: 124, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:58:39,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.974e+02 3.473e+02 4.065e+02 5.758e+02, threshold=6.947e+02, percent-clipped=0.0 2023-06-21 06:58:39,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=906114.0, ans=0.125 2023-06-21 06:58:41,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=906114.0, ans=10.0 2023-06-21 06:58:52,029 INFO [train.py:996] (0/4) Epoch 5, batch 29050, loss[loss=0.258, simple_loss=0.3219, pruned_loss=0.09705, over 21811.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3349, pruned_loss=0.09898, over 4278624.91 frames. ], batch size: 441, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:59:44,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-21 07:00:00,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=906354.0, ans=0.125 2023-06-21 07:00:24,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=906414.0, ans=0.1 2023-06-21 07:00:26,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-21 07:00:27,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=906414.0, ans=0.04949747468305833 2023-06-21 07:00:32,321 INFO [train.py:996] (0/4) Epoch 5, batch 29100, loss[loss=0.2231, simple_loss=0.2823, pruned_loss=0.08192, over 21493.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3256, pruned_loss=0.09542, over 4269017.33 frames. ], batch size: 132, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:01:10,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=906534.0, ans=0.125 2023-06-21 07:01:15,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=906594.0, ans=0.0 2023-06-21 07:01:23,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=906594.0, ans=0.1 2023-06-21 07:01:34,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=906594.0, ans=0.2 2023-06-21 07:01:35,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=22.5 2023-06-21 07:01:53,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=906654.0, ans=0.1 2023-06-21 07:02:00,030 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.771e+02 3.124e+02 3.774e+02 6.095e+02, threshold=6.248e+02, percent-clipped=0.0 2023-06-21 07:02:13,092 INFO [train.py:996] (0/4) Epoch 5, batch 29150, loss[loss=0.2624, simple_loss=0.3693, pruned_loss=0.07772, over 20738.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3262, pruned_loss=0.0941, over 4260760.37 frames. ], batch size: 607, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:02:13,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=906774.0, ans=0.125 2023-06-21 07:02:29,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=906774.0, ans=0.125 2023-06-21 07:02:32,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=906834.0, ans=0.125 2023-06-21 07:03:00,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=906894.0, ans=0.0 2023-06-21 07:03:16,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=906894.0, ans=0.0 2023-06-21 07:03:21,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-21 07:03:49,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=12.0 2023-06-21 07:03:53,193 INFO [train.py:996] (0/4) Epoch 5, batch 29200, loss[loss=0.2395, simple_loss=0.2959, pruned_loss=0.0916, over 21550.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.322, pruned_loss=0.09346, over 4252362.88 frames. ], batch size: 247, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:05:02,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=22.5 2023-06-21 07:05:11,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=907254.0, ans=0.0 2023-06-21 07:05:18,656 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=15.0 2023-06-21 07:05:22,448 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.832e+02 3.192e+02 3.760e+02 6.246e+02, threshold=6.385e+02, percent-clipped=0.0 2023-06-21 07:05:31,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=907314.0, ans=0.0 2023-06-21 07:05:40,672 INFO [train.py:996] (0/4) Epoch 5, batch 29250, loss[loss=0.2042, simple_loss=0.2814, pruned_loss=0.06354, over 21174.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3195, pruned_loss=0.09114, over 4244344.66 frames. ], batch size: 176, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:05:44,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=907374.0, ans=0.0 2023-06-21 07:06:38,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=12.0 2023-06-21 07:06:52,570 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:06:54,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=907554.0, ans=0.0 2023-06-21 07:07:15,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=907614.0, ans=0.1 2023-06-21 07:07:21,327 INFO [train.py:996] (0/4) Epoch 5, batch 29300, loss[loss=0.2129, simple_loss=0.2816, pruned_loss=0.07208, over 19981.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3203, pruned_loss=0.08937, over 4254917.70 frames. ], batch size: 703, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:07:51,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=907734.0, ans=0.2 2023-06-21 07:07:59,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=907734.0, ans=0.125 2023-06-21 07:08:05,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-21 07:08:18,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=907794.0, ans=0.1 2023-06-21 07:08:23,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=907794.0, ans=0.125 2023-06-21 07:08:27,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=907854.0, ans=0.0 2023-06-21 07:08:46,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.808e+02 3.408e+02 4.024e+02 6.878e+02, threshold=6.816e+02, percent-clipped=1.0 2023-06-21 07:09:05,119 INFO [train.py:996] (0/4) Epoch 5, batch 29350, loss[loss=0.2127, simple_loss=0.2753, pruned_loss=0.0751, over 21157.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3156, pruned_loss=0.08877, over 4247320.21 frames. ], batch size: 176, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:09:57,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.88 vs. limit=12.0 2023-06-21 07:10:15,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=908154.0, ans=0.0 2023-06-21 07:10:22,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-21 07:10:24,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-21 07:10:53,137 INFO [train.py:996] (0/4) Epoch 5, batch 29400, loss[loss=0.2111, simple_loss=0.2891, pruned_loss=0.06654, over 21712.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3161, pruned_loss=0.08697, over 4252903.28 frames. ], batch size: 298, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:11:11,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=908274.0, ans=0.125 2023-06-21 07:11:24,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=908334.0, ans=0.0 2023-06-21 07:11:47,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-21 07:12:00,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=908454.0, ans=0.0 2023-06-21 07:12:23,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-21 07:12:26,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.910e+02 3.307e+02 3.988e+02 6.309e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-21 07:12:34,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-21 07:12:41,840 INFO [train.py:996] (0/4) Epoch 5, batch 29450, loss[loss=0.2554, simple_loss=0.3274, pruned_loss=0.0917, over 20765.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3149, pruned_loss=0.08622, over 4251469.22 frames. ], batch size: 609, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:13:07,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=908634.0, ans=0.125 2023-06-21 07:13:24,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-21 07:13:25,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=908694.0, ans=0.125 2023-06-21 07:14:11,613 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.38 vs. limit=15.0 2023-06-21 07:14:21,527 INFO [train.py:996] (0/4) Epoch 5, batch 29500, loss[loss=0.2325, simple_loss=0.2954, pruned_loss=0.08479, over 21938.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3212, pruned_loss=0.09062, over 4262592.33 frames. ], batch size: 333, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:15:23,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=909054.0, ans=0.125 2023-06-21 07:15:45,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 2.991e+02 3.477e+02 4.425e+02 6.921e+02, threshold=6.954e+02, percent-clipped=2.0 2023-06-21 07:15:57,602 INFO [train.py:996] (0/4) Epoch 5, batch 29550, loss[loss=0.272, simple_loss=0.3334, pruned_loss=0.1053, over 21934.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3213, pruned_loss=0.09333, over 4276507.63 frames. ], batch size: 351, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:16:20,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=909174.0, ans=0.125 2023-06-21 07:16:47,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=909294.0, ans=0.2 2023-06-21 07:17:02,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-21 07:17:43,949 INFO [train.py:996] (0/4) Epoch 5, batch 29600, loss[loss=0.295, simple_loss=0.3768, pruned_loss=0.1066, over 21819.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3283, pruned_loss=0.09625, over 4279253.56 frames. ], batch size: 282, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:18:02,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=909474.0, ans=0.95 2023-06-21 07:18:20,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=909594.0, ans=0.07 2023-06-21 07:18:43,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-21 07:18:57,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-21 07:19:08,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.750e+02 3.191e+02 3.922e+02 6.335e+02, threshold=6.382e+02, percent-clipped=0.0 2023-06-21 07:19:28,028 INFO [train.py:996] (0/4) Epoch 5, batch 29650, loss[loss=0.2931, simple_loss=0.3503, pruned_loss=0.118, over 21713.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3249, pruned_loss=0.09208, over 4286702.23 frames. ], batch size: 441, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:20:14,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=909894.0, ans=0.95 2023-06-21 07:20:16,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=909894.0, ans=0.1 2023-06-21 07:20:50,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=910014.0, ans=0.0 2023-06-21 07:20:57,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-06-21 07:21:09,381 INFO [train.py:996] (0/4) Epoch 5, batch 29700, loss[loss=0.3207, simple_loss=0.4173, pruned_loss=0.112, over 21698.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3266, pruned_loss=0.0924, over 4283232.47 frames. ], batch size: 389, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:21:16,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=910074.0, ans=0.125 2023-06-21 07:21:38,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=910134.0, ans=0.0 2023-06-21 07:21:52,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=910194.0, ans=0.0 2023-06-21 07:21:58,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=22.5 2023-06-21 07:22:21,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=910254.0, ans=0.1 2023-06-21 07:22:29,348 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.944e+02 3.350e+02 4.483e+02 9.156e+02, threshold=6.700e+02, percent-clipped=7.0 2023-06-21 07:22:43,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=910314.0, ans=0.125 2023-06-21 07:22:50,097 INFO [train.py:996] (0/4) Epoch 5, batch 29750, loss[loss=0.2977, simple_loss=0.3889, pruned_loss=0.1033, over 21704.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3323, pruned_loss=0.09249, over 4277698.51 frames. ], batch size: 441, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:23:31,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=910494.0, ans=0.125 2023-06-21 07:23:35,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-21 07:23:40,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-06-21 07:24:31,258 INFO [train.py:996] (0/4) Epoch 5, batch 29800, loss[loss=0.3098, simple_loss=0.3583, pruned_loss=0.1307, over 21637.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3322, pruned_loss=0.09264, over 4276962.59 frames. ], batch size: 471, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:25:28,712 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=22.5 2023-06-21 07:25:56,581 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.642e+02 3.044e+02 3.598e+02 6.041e+02, threshold=6.089e+02, percent-clipped=0.0 2023-06-21 07:26:11,952 INFO [train.py:996] (0/4) Epoch 5, batch 29850, loss[loss=0.2389, simple_loss=0.3071, pruned_loss=0.08538, over 21562.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3261, pruned_loss=0.08868, over 4273758.96 frames. ], batch size: 212, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:26:38,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-06-21 07:26:51,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=911094.0, ans=10.0 2023-06-21 07:27:04,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.42 vs. limit=22.5 2023-06-21 07:27:41,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=911214.0, ans=0.0 2023-06-21 07:27:47,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=911214.0, ans=0.125 2023-06-21 07:27:52,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=911274.0, ans=0.0 2023-06-21 07:27:53,670 INFO [train.py:996] (0/4) Epoch 5, batch 29900, loss[loss=0.3748, simple_loss=0.405, pruned_loss=0.1723, over 21480.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3233, pruned_loss=0.08964, over 4284405.24 frames. ], batch size: 471, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:28:18,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=911334.0, ans=0.125 2023-06-21 07:28:27,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-21 07:29:22,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.918e+02 3.287e+02 3.972e+02 7.501e+02, threshold=6.574e+02, percent-clipped=2.0 2023-06-21 07:29:29,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-06-21 07:29:34,454 INFO [train.py:996] (0/4) Epoch 5, batch 29950, loss[loss=0.3027, simple_loss=0.3609, pruned_loss=0.1223, over 21348.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3278, pruned_loss=0.09385, over 4283403.13 frames. ], batch size: 549, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:29:57,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-21 07:30:07,156 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.30 vs. limit=15.0 2023-06-21 07:30:48,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=911754.0, ans=0.2 2023-06-21 07:31:01,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-06-21 07:31:16,742 INFO [train.py:996] (0/4) Epoch 5, batch 30000, loss[loss=0.2364, simple_loss=0.3139, pruned_loss=0.07948, over 21298.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3292, pruned_loss=0.09418, over 4281235.02 frames. ], batch size: 159, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:31:16,744 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 07:31:27,381 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7169, 3.6799, 2.1737, 1.7087], device='cuda:0') 2023-06-21 07:31:38,133 INFO [train.py:1028] (0/4) Epoch 5, validation: loss=0.2485, simple_loss=0.3493, pruned_loss=0.0739, over 1796401.00 frames. 2023-06-21 07:31:38,134 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 07:31:40,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=911874.0, ans=0.0 2023-06-21 07:31:44,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-21 07:31:49,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=911874.0, ans=0.2 2023-06-21 07:32:28,692 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-152000.pt 2023-06-21 07:32:33,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=911994.0, ans=0.0 2023-06-21 07:33:14,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 3.033e+02 3.680e+02 4.795e+02 8.556e+02, threshold=7.360e+02, percent-clipped=8.0 2023-06-21 07:33:36,618 INFO [train.py:996] (0/4) Epoch 5, batch 30050, loss[loss=0.2385, simple_loss=0.3358, pruned_loss=0.07057, over 21721.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3345, pruned_loss=0.09185, over 4281696.09 frames. ], batch size: 298, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:33:41,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=912174.0, ans=0.0 2023-06-21 07:35:16,545 INFO [train.py:996] (0/4) Epoch 5, batch 30100, loss[loss=0.2131, simple_loss=0.2758, pruned_loss=0.07518, over 21767.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3339, pruned_loss=0.09174, over 4285432.86 frames. ], batch size: 118, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:35:36,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=912534.0, ans=0.2 2023-06-21 07:35:52,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=912594.0, ans=0.125 2023-06-21 07:36:40,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.174e+02 3.797e+02 4.451e+02 9.370e+02, threshold=7.593e+02, percent-clipped=1.0 2023-06-21 07:37:01,366 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:37:02,707 INFO [train.py:996] (0/4) Epoch 5, batch 30150, loss[loss=0.2672, simple_loss=0.3266, pruned_loss=0.1039, over 21751.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3308, pruned_loss=0.09425, over 4283717.30 frames. ], batch size: 282, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:37:22,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=912834.0, ans=0.125 2023-06-21 07:37:31,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=912834.0, ans=0.125 2023-06-21 07:37:31,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=912834.0, ans=0.125 2023-06-21 07:37:49,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=912894.0, ans=0.1 2023-06-21 07:38:04,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=912954.0, ans=0.0 2023-06-21 07:38:45,880 INFO [train.py:996] (0/4) Epoch 5, batch 30200, loss[loss=0.2228, simple_loss=0.3215, pruned_loss=0.06206, over 21757.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.332, pruned_loss=0.09379, over 4278992.51 frames. ], batch size: 282, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:39:02,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-21 07:39:30,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-21 07:40:17,009 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.970e+02 3.553e+02 4.438e+02 6.781e+02, threshold=7.107e+02, percent-clipped=0.0 2023-06-21 07:40:28,579 INFO [train.py:996] (0/4) Epoch 5, batch 30250, loss[loss=0.2929, simple_loss=0.3954, pruned_loss=0.0952, over 21724.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3382, pruned_loss=0.09486, over 4280000.10 frames. ], batch size: 298, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:41:07,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=913434.0, ans=0.125 2023-06-21 07:42:08,266 INFO [train.py:996] (0/4) Epoch 5, batch 30300, loss[loss=0.2219, simple_loss=0.2795, pruned_loss=0.08218, over 21240.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3349, pruned_loss=0.09453, over 4283147.93 frames. ], batch size: 144, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:42:17,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=913674.0, ans=0.125 2023-06-21 07:43:19,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=913854.0, ans=0.2 2023-06-21 07:43:34,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.375e+02 4.066e+02 5.117e+02 7.478e+02, threshold=8.132e+02, percent-clipped=2.0 2023-06-21 07:43:38,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=913914.0, ans=0.1 2023-06-21 07:43:51,199 INFO [train.py:996] (0/4) Epoch 5, batch 30350, loss[loss=0.3312, simple_loss=0.416, pruned_loss=0.1232, over 21501.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3348, pruned_loss=0.09548, over 4280219.70 frames. ], batch size: 473, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:44:43,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=914094.0, ans=0.125 2023-06-21 07:45:02,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-21 07:45:06,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=914214.0, ans=0.1 2023-06-21 07:45:20,110 INFO [train.py:996] (0/4) Epoch 5, batch 30400, loss[loss=0.2206, simple_loss=0.2756, pruned_loss=0.08279, over 20228.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3295, pruned_loss=0.09349, over 4270787.65 frames. ], batch size: 703, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:45:40,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=914334.0, ans=0.0 2023-06-21 07:46:35,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.883e+02 3.783e+02 4.866e+02 6.156e+02 1.756e+03, threshold=9.731e+02, percent-clipped=9.0 2023-06-21 07:46:37,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=914514.0, ans=0.0 2023-06-21 07:46:40,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=914514.0, ans=0.125 2023-06-21 07:46:46,045 INFO [train.py:996] (0/4) Epoch 5, batch 30450, loss[loss=0.3132, simple_loss=0.434, pruned_loss=0.09623, over 19827.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3312, pruned_loss=0.0931, over 4209318.88 frames. ], batch size: 702, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:47:39,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=914754.0, ans=0.2 2023-06-21 07:47:40,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=914754.0, ans=0.2 2023-06-21 07:47:55,590 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-5.pt 2023-06-21 07:49:37,070 INFO [train.py:996] (0/4) Epoch 6, batch 0, loss[loss=0.219, simple_loss=0.2788, pruned_loss=0.07961, over 21294.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2788, pruned_loss=0.07961, over 21294.00 frames. ], batch size: 177, lr: 5.35e-03, grad_scale: 32.0 2023-06-21 07:49:37,072 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 07:49:52,705 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2457, simple_loss=0.3531, pruned_loss=0.06922, over 1796401.00 frames. 2023-06-21 07:49:52,706 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 07:50:07,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=914838.0, ans=0.1 2023-06-21 07:50:38,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=914958.0, ans=0.125 2023-06-21 07:51:27,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.626e+02 5.784e+02 9.951e+02 2.861e+03, threshold=1.157e+03, percent-clipped=26.0 2023-06-21 07:51:28,600 INFO [train.py:996] (0/4) Epoch 6, batch 50, loss[loss=0.3628, simple_loss=0.4287, pruned_loss=0.1484, over 21489.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3317, pruned_loss=0.09393, over 962993.18 frames. ], batch size: 471, lr: 5.35e-03, grad_scale: 32.0 2023-06-21 07:51:47,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=915198.0, ans=0.125 2023-06-21 07:53:05,449 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-21 07:53:05,806 INFO [train.py:996] (0/4) Epoch 6, batch 100, loss[loss=0.2526, simple_loss=0.3579, pruned_loss=0.0736, over 21443.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3412, pruned_loss=0.09361, over 1686667.08 frames. ], batch size: 211, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 07:53:12,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=915438.0, ans=0.09899494936611666 2023-06-21 07:53:26,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=915498.0, ans=0.125 2023-06-21 07:53:39,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=915498.0, ans=0.125 2023-06-21 07:54:02,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=915558.0, ans=0.0 2023-06-21 07:54:41,353 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.736e+02 3.116e+02 3.564e+02 7.052e+02, threshold=6.231e+02, percent-clipped=0.0 2023-06-21 07:54:42,895 INFO [train.py:996] (0/4) Epoch 6, batch 150, loss[loss=0.2995, simple_loss=0.3652, pruned_loss=0.1169, over 21241.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3489, pruned_loss=0.09496, over 2267920.37 frames. ], batch size: 143, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 07:55:35,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=915858.0, ans=0.125 2023-06-21 07:55:49,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-21 07:56:03,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-21 07:56:22,172 INFO [train.py:996] (0/4) Epoch 6, batch 200, loss[loss=0.3174, simple_loss=0.3769, pruned_loss=0.1289, over 21387.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3452, pruned_loss=0.0948, over 2700422.84 frames. ], batch size: 471, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 07:56:24,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=916038.0, ans=0.125 2023-06-21 07:57:20,592 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:57:27,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-21 07:57:47,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=22.5 2023-06-21 07:58:01,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.028e+02 3.538e+02 4.112e+02 1.174e+03, threshold=7.076e+02, percent-clipped=8.0 2023-06-21 07:58:01,386 INFO [train.py:996] (0/4) Epoch 6, batch 250, loss[loss=0.2546, simple_loss=0.3352, pruned_loss=0.08702, over 21594.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3436, pruned_loss=0.09516, over 3051094.56 frames. ], batch size: 389, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 07:58:10,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-21 07:58:25,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=916398.0, ans=0.125 2023-06-21 07:58:30,675 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:58:54,583 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:59:39,963 INFO [train.py:996] (0/4) Epoch 6, batch 300, loss[loss=0.2137, simple_loss=0.2739, pruned_loss=0.07681, over 21369.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3391, pruned_loss=0.09462, over 3317940.67 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:00:28,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916758.0, ans=0.1 2023-06-21 08:01:07,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=916878.0, ans=0.125 2023-06-21 08:01:12,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916878.0, ans=0.1 2023-06-21 08:01:20,513 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 3.010e+02 3.563e+02 4.495e+02 6.815e+02, threshold=7.126e+02, percent-clipped=0.0 2023-06-21 08:01:20,542 INFO [train.py:996] (0/4) Epoch 6, batch 350, loss[loss=0.2183, simple_loss=0.3165, pruned_loss=0.06011, over 21740.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3319, pruned_loss=0.09243, over 3536830.02 frames. ], batch size: 124, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:01:25,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=916938.0, ans=0.0 2023-06-21 08:01:51,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916998.0, ans=0.1 2023-06-21 08:02:28,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=917118.0, ans=0.125 2023-06-21 08:02:58,476 INFO [train.py:996] (0/4) Epoch 6, batch 400, loss[loss=0.2477, simple_loss=0.3678, pruned_loss=0.06383, over 20803.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3257, pruned_loss=0.08974, over 3698708.31 frames. ], batch size: 608, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:03:13,093 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:03:15,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-06-21 08:04:28,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-21 08:04:36,499 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.827e+02 3.421e+02 4.074e+02 6.754e+02, threshold=6.843e+02, percent-clipped=0.0 2023-06-21 08:04:36,529 INFO [train.py:996] (0/4) Epoch 6, batch 450, loss[loss=0.1891, simple_loss=0.2651, pruned_loss=0.05657, over 21299.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3198, pruned_loss=0.08804, over 3831426.26 frames. ], batch size: 176, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:05:04,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-21 08:05:17,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=917598.0, ans=0.125 2023-06-21 08:05:22,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=917658.0, ans=0.125 2023-06-21 08:05:54,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=917718.0, ans=0.95 2023-06-21 08:06:18,069 INFO [train.py:996] (0/4) Epoch 6, batch 500, loss[loss=0.2233, simple_loss=0.2783, pruned_loss=0.08417, over 21703.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3215, pruned_loss=0.08711, over 3938991.12 frames. ], batch size: 112, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:06:38,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=917898.0, ans=0.2 2023-06-21 08:06:50,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=917898.0, ans=0.125 2023-06-21 08:07:00,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-21 08:07:51,200 INFO [train.py:996] (0/4) Epoch 6, batch 550, loss[loss=0.2342, simple_loss=0.3094, pruned_loss=0.07953, over 19935.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3246, pruned_loss=0.08706, over 4021145.86 frames. ], batch size: 704, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:07:57,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.995e+02 3.563e+02 4.699e+02 8.861e+02, threshold=7.125e+02, percent-clipped=10.0 2023-06-21 08:09:01,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-21 08:09:31,332 INFO [train.py:996] (0/4) Epoch 6, batch 600, loss[loss=0.2445, simple_loss=0.3126, pruned_loss=0.08816, over 21366.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3246, pruned_loss=0.08627, over 4076144.14 frames. ], batch size: 176, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:09:57,822 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:10:19,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=918558.0, ans=0.035 2023-06-21 08:11:09,687 INFO [train.py:996] (0/4) Epoch 6, batch 650, loss[loss=0.2486, simple_loss=0.3158, pruned_loss=0.09077, over 21909.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3242, pruned_loss=0.08564, over 4128943.77 frames. ], batch size: 414, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:11:10,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=918738.0, ans=0.125 2023-06-21 08:11:11,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.881e+02 3.396e+02 3.907e+02 7.469e+02, threshold=6.792e+02, percent-clipped=1.0 2023-06-21 08:11:48,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=918798.0, ans=0.1 2023-06-21 08:12:17,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=918918.0, ans=0.0 2023-06-21 08:12:19,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=918918.0, ans=0.95 2023-06-21 08:12:21,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=15.0 2023-06-21 08:12:41,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=919038.0, ans=0.2 2023-06-21 08:12:42,322 INFO [train.py:996] (0/4) Epoch 6, batch 700, loss[loss=0.252, simple_loss=0.3885, pruned_loss=0.0577, over 19743.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3257, pruned_loss=0.08705, over 4160546.27 frames. ], batch size: 703, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:13:03,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=919038.0, ans=0.125 2023-06-21 08:13:11,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=919098.0, ans=0.125 2023-06-21 08:13:57,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=919218.0, ans=0.0 2023-06-21 08:14:20,571 INFO [train.py:996] (0/4) Epoch 6, batch 750, loss[loss=0.2634, simple_loss=0.3419, pruned_loss=0.09246, over 21651.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3247, pruned_loss=0.08824, over 4188235.38 frames. ], batch size: 230, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:14:26,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.276e+02 4.088e+02 4.962e+02 1.159e+03, threshold=8.176e+02, percent-clipped=5.0 2023-06-21 08:14:27,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2023-06-21 08:14:52,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=919398.0, ans=0.125 2023-06-21 08:15:05,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-21 08:15:43,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-21 08:15:58,058 INFO [train.py:996] (0/4) Epoch 6, batch 800, loss[loss=0.2257, simple_loss=0.3009, pruned_loss=0.07528, over 21707.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3207, pruned_loss=0.08794, over 4203972.19 frames. ], batch size: 298, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:16:48,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=919758.0, ans=0.0 2023-06-21 08:16:56,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=919758.0, ans=0.1 2023-06-21 08:17:21,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=919878.0, ans=0.0 2023-06-21 08:17:22,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=919878.0, ans=0.125 2023-06-21 08:17:35,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=919878.0, ans=0.2 2023-06-21 08:17:38,709 INFO [train.py:996] (0/4) Epoch 6, batch 850, loss[loss=0.262, simple_loss=0.3863, pruned_loss=0.06885, over 20798.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.32, pruned_loss=0.08804, over 4217470.01 frames. ], batch size: 608, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:17:40,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 2.947e+02 3.491e+02 3.933e+02 7.622e+02, threshold=6.983e+02, percent-clipped=0.0 2023-06-21 08:18:09,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=919998.0, ans=0.125 2023-06-21 08:18:33,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=920058.0, ans=0.5 2023-06-21 08:18:55,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=920118.0, ans=0.125 2023-06-21 08:19:03,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=920178.0, ans=0.125 2023-06-21 08:19:07,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-21 08:19:21,812 INFO [train.py:996] (0/4) Epoch 6, batch 900, loss[loss=0.2681, simple_loss=0.3358, pruned_loss=0.1003, over 21851.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3178, pruned_loss=0.08735, over 4232570.79 frames. ], batch size: 371, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:19:57,759 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:20:28,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=920418.0, ans=0.125 2023-06-21 08:21:05,304 INFO [train.py:996] (0/4) Epoch 6, batch 950, loss[loss=0.1977, simple_loss=0.2717, pruned_loss=0.06188, over 21290.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3151, pruned_loss=0.08701, over 4249526.52 frames. ], batch size: 176, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:21:06,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.884e+02 3.289e+02 4.152e+02 6.570e+02, threshold=6.579e+02, percent-clipped=0.0 2023-06-21 08:21:21,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=920598.0, ans=0.1 2023-06-21 08:21:46,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=920658.0, ans=0.125 2023-06-21 08:22:39,444 INFO [train.py:996] (0/4) Epoch 6, batch 1000, loss[loss=0.225, simple_loss=0.2835, pruned_loss=0.08327, over 21193.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3142, pruned_loss=0.08685, over 4260156.68 frames. ], batch size: 548, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:22:52,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=920838.0, ans=0.125 2023-06-21 08:23:06,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-21 08:23:28,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=920958.0, ans=0.125 2023-06-21 08:24:13,775 INFO [train.py:996] (0/4) Epoch 6, batch 1050, loss[loss=0.243, simple_loss=0.3343, pruned_loss=0.07584, over 21793.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3155, pruned_loss=0.08723, over 4271689.27 frames. ], batch size: 371, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:24:15,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.022e+02 3.396e+02 3.710e+02 5.985e+02, threshold=6.792e+02, percent-clipped=0.0 2023-06-21 08:24:44,823 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-21 08:25:16,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=921318.0, ans=0.2 2023-06-21 08:25:42,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=22.5 2023-06-21 08:25:48,845 INFO [train.py:996] (0/4) Epoch 6, batch 1100, loss[loss=0.2124, simple_loss=0.2945, pruned_loss=0.06516, over 21639.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3149, pruned_loss=0.0859, over 4270801.60 frames. ], batch size: 263, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:25:50,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-21 08:25:57,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=921438.0, ans=0.125 2023-06-21 08:26:25,967 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-06-21 08:27:25,431 INFO [train.py:996] (0/4) Epoch 6, batch 1150, loss[loss=0.2348, simple_loss=0.2928, pruned_loss=0.08841, over 16664.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3172, pruned_loss=0.08667, over 4275063.01 frames. ], batch size: 60, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:27:28,760 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.136e+02 3.809e+02 5.209e+02 8.344e+02, threshold=7.619e+02, percent-clipped=5.0 2023-06-21 08:29:05,595 INFO [train.py:996] (0/4) Epoch 6, batch 1200, loss[loss=0.238, simple_loss=0.2955, pruned_loss=0.09024, over 21188.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3185, pruned_loss=0.08762, over 4278730.80 frames. ], batch size: 608, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:29:07,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=922038.0, ans=0.125 2023-06-21 08:29:20,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=922038.0, ans=0.0 2023-06-21 08:29:35,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=922098.0, ans=0.0 2023-06-21 08:30:34,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-06-21 08:30:37,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=922278.0, ans=0.2 2023-06-21 08:30:44,979 INFO [train.py:996] (0/4) Epoch 6, batch 1250, loss[loss=0.2551, simple_loss=0.3204, pruned_loss=0.09497, over 21841.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3217, pruned_loss=0.08962, over 4276588.38 frames. ], batch size: 107, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:30:47,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.807e+02 3.082e+02 3.703e+02 6.160e+02, threshold=6.164e+02, percent-clipped=0.0 2023-06-21 08:30:53,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=922338.0, ans=0.125 2023-06-21 08:31:25,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=922458.0, ans=0.125 2023-06-21 08:32:00,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=922518.0, ans=0.0 2023-06-21 08:32:19,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=922578.0, ans=0.2 2023-06-21 08:32:25,635 INFO [train.py:996] (0/4) Epoch 6, batch 1300, loss[loss=0.2393, simple_loss=0.3019, pruned_loss=0.08841, over 21435.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3234, pruned_loss=0.09006, over 4277904.17 frames. ], batch size: 131, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:32:35,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.77 vs. limit=6.0 2023-06-21 08:32:51,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=922698.0, ans=0.0 2023-06-21 08:32:52,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=922698.0, ans=0.0 2023-06-21 08:33:19,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=922758.0, ans=0.125 2023-06-21 08:34:12,139 INFO [train.py:996] (0/4) Epoch 6, batch 1350, loss[loss=0.2452, simple_loss=0.3252, pruned_loss=0.08261, over 21636.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3241, pruned_loss=0.08962, over 4274083.09 frames. ], batch size: 230, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:34:15,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 2.950e+02 3.402e+02 4.327e+02 7.422e+02, threshold=6.804e+02, percent-clipped=3.0 2023-06-21 08:34:20,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=922938.0, ans=0.0 2023-06-21 08:34:38,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=922998.0, ans=0.125 2023-06-21 08:34:51,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=923058.0, ans=0.05 2023-06-21 08:35:48,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=923178.0, ans=0.125 2023-06-21 08:35:50,724 INFO [train.py:996] (0/4) Epoch 6, batch 1400, loss[loss=0.2376, simple_loss=0.3007, pruned_loss=0.08723, over 21800.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3219, pruned_loss=0.09008, over 4279696.86 frames. ], batch size: 98, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:36:00,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923238.0, ans=0.1 2023-06-21 08:36:30,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=923358.0, ans=0.125 2023-06-21 08:36:52,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2023-06-21 08:37:06,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=923418.0, ans=0.125 2023-06-21 08:37:31,197 INFO [train.py:996] (0/4) Epoch 6, batch 1450, loss[loss=0.2261, simple_loss=0.2846, pruned_loss=0.08374, over 21650.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3227, pruned_loss=0.09103, over 4278258.51 frames. ], batch size: 415, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:37:37,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.855e+02 3.384e+02 3.937e+02 6.877e+02, threshold=6.768e+02, percent-clipped=1.0 2023-06-21 08:37:48,933 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:38:26,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=923658.0, ans=0.05 2023-06-21 08:38:51,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=923778.0, ans=0.125 2023-06-21 08:39:11,674 INFO [train.py:996] (0/4) Epoch 6, batch 1500, loss[loss=0.2648, simple_loss=0.3236, pruned_loss=0.1031, over 21336.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3255, pruned_loss=0.09233, over 4279661.87 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:40:14,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.65 vs. limit=12.0 2023-06-21 08:40:17,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=924018.0, ans=0.1 2023-06-21 08:40:20,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=924018.0, ans=0.0 2023-06-21 08:40:53,827 INFO [train.py:996] (0/4) Epoch 6, batch 1550, loss[loss=0.2363, simple_loss=0.3011, pruned_loss=0.08577, over 21494.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3235, pruned_loss=0.09077, over 4279548.44 frames. ], batch size: 131, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:41:00,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.808e+02 3.171e+02 3.740e+02 6.860e+02, threshold=6.342e+02, percent-clipped=1.0 2023-06-21 08:41:23,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.11 vs. limit=12.0 2023-06-21 08:41:36,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.43 vs. limit=10.0 2023-06-21 08:42:36,175 INFO [train.py:996] (0/4) Epoch 6, batch 1600, loss[loss=0.243, simple_loss=0.3119, pruned_loss=0.0871, over 21801.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3231, pruned_loss=0.09138, over 4275914.44 frames. ], batch size: 316, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:44:19,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=924678.0, ans=0.125 2023-06-21 08:44:25,214 INFO [train.py:996] (0/4) Epoch 6, batch 1650, loss[loss=0.2453, simple_loss=0.3117, pruned_loss=0.08945, over 21746.00 frames. ], tot_loss[loss=0.252, simple_loss=0.322, pruned_loss=0.09103, over 4274710.27 frames. ], batch size: 389, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:44:31,641 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.183e+02 3.962e+02 4.475e+02 7.912e+02, threshold=7.925e+02, percent-clipped=6.0 2023-06-21 08:44:41,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=924738.0, ans=0.125 2023-06-21 08:44:59,722 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:45:06,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=924858.0, ans=0.0 2023-06-21 08:45:10,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.06 vs. limit=15.0 2023-06-21 08:45:50,093 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-21 08:46:07,222 INFO [train.py:996] (0/4) Epoch 6, batch 1700, loss[loss=0.2672, simple_loss=0.3396, pruned_loss=0.09738, over 21694.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3254, pruned_loss=0.09212, over 4278557.51 frames. ], batch size: 351, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:46:38,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=925098.0, ans=0.0 2023-06-21 08:46:46,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=925098.0, ans=0.2 2023-06-21 08:46:58,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-21 08:47:14,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=925218.0, ans=0.125 2023-06-21 08:47:22,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=925218.0, ans=0.2 2023-06-21 08:47:41,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-21 08:47:54,721 INFO [train.py:996] (0/4) Epoch 6, batch 1750, loss[loss=0.1799, simple_loss=0.2768, pruned_loss=0.04155, over 21710.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.324, pruned_loss=0.08922, over 4275355.88 frames. ], batch size: 332, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:48:05,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.133e+02 3.705e+02 4.363e+02 7.096e+02, threshold=7.410e+02, percent-clipped=0.0 2023-06-21 08:48:09,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=925338.0, ans=0.2 2023-06-21 08:48:12,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=925338.0, ans=0.125 2023-06-21 08:48:52,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=925458.0, ans=0.125 2023-06-21 08:49:24,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=925578.0, ans=0.1 2023-06-21 08:49:26,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=925578.0, ans=15.0 2023-06-21 08:49:43,018 INFO [train.py:996] (0/4) Epoch 6, batch 1800, loss[loss=0.206, simple_loss=0.2976, pruned_loss=0.05719, over 21290.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3224, pruned_loss=0.08621, over 4274737.51 frames. ], batch size: 176, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:49:50,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=925638.0, ans=0.125 2023-06-21 08:49:50,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=925638.0, ans=0.2 2023-06-21 08:50:05,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=925698.0, ans=0.125 2023-06-21 08:51:22,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=925938.0, ans=0.0 2023-06-21 08:51:23,703 INFO [train.py:996] (0/4) Epoch 6, batch 1850, loss[loss=0.2479, simple_loss=0.3224, pruned_loss=0.0867, over 21512.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3227, pruned_loss=0.08428, over 4274942.94 frames. ], batch size: 441, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:51:30,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.862e+02 3.405e+02 4.274e+02 8.543e+02, threshold=6.809e+02, percent-clipped=2.0 2023-06-21 08:51:54,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=925998.0, ans=0.1 2023-06-21 08:52:07,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-21 08:52:25,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.16 vs. limit=15.0 2023-06-21 08:52:33,886 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-21 08:52:51,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=926178.0, ans=0.125 2023-06-21 08:52:55,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=926178.0, ans=0.0 2023-06-21 08:53:04,636 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-21 08:53:05,036 INFO [train.py:996] (0/4) Epoch 6, batch 1900, loss[loss=0.2097, simple_loss=0.2849, pruned_loss=0.06722, over 21763.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.324, pruned_loss=0.08613, over 4276696.34 frames. ], batch size: 112, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:53:21,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=926238.0, ans=0.125 2023-06-21 08:53:45,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=926358.0, ans=0.125 2023-06-21 08:53:53,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=926358.0, ans=0.125 2023-06-21 08:54:39,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=926478.0, ans=0.125 2023-06-21 08:54:48,434 INFO [train.py:996] (0/4) Epoch 6, batch 1950, loss[loss=0.2364, simple_loss=0.2886, pruned_loss=0.09206, over 21582.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3206, pruned_loss=0.08551, over 4253215.13 frames. ], batch size: 415, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:54:55,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.902e+02 3.428e+02 4.161e+02 7.529e+02, threshold=6.855e+02, percent-clipped=4.0 2023-06-21 08:55:08,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=926598.0, ans=0.125 2023-06-21 08:55:50,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-21 08:56:21,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=926778.0, ans=0.0 2023-06-21 08:56:27,464 INFO [train.py:996] (0/4) Epoch 6, batch 2000, loss[loss=0.2032, simple_loss=0.2756, pruned_loss=0.06537, over 21644.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3144, pruned_loss=0.08326, over 4264166.08 frames. ], batch size: 247, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 08:56:31,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=926838.0, ans=0.04949747468305833 2023-06-21 08:56:31,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-21 08:57:06,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=926958.0, ans=0.125 2023-06-21 08:57:46,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=927018.0, ans=0.125 2023-06-21 08:58:07,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=927138.0, ans=0.0 2023-06-21 08:58:08,473 INFO [train.py:996] (0/4) Epoch 6, batch 2050, loss[loss=0.2102, simple_loss=0.2819, pruned_loss=0.06921, over 21630.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3164, pruned_loss=0.08356, over 4267307.70 frames. ], batch size: 298, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 08:58:19,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.151e+02 3.657e+02 4.300e+02 8.922e+02, threshold=7.314e+02, percent-clipped=4.0 2023-06-21 08:58:27,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=927138.0, ans=0.125 2023-06-21 08:59:06,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-21 08:59:27,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=927318.0, ans=0.125 2023-06-21 08:59:45,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=927378.0, ans=0.125 2023-06-21 08:59:49,728 INFO [train.py:996] (0/4) Epoch 6, batch 2100, loss[loss=0.2436, simple_loss=0.3513, pruned_loss=0.06799, over 21171.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3174, pruned_loss=0.08554, over 4272964.19 frames. ], batch size: 548, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 08:59:58,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=8.0 2023-06-21 09:00:15,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=927498.0, ans=0.125 2023-06-21 09:01:31,406 INFO [train.py:996] (0/4) Epoch 6, batch 2150, loss[loss=0.2767, simple_loss=0.3302, pruned_loss=0.1116, over 21598.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3191, pruned_loss=0.08765, over 4279208.06 frames. ], batch size: 441, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:01:39,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=927738.0, ans=0.125 2023-06-21 09:01:42,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=927738.0, ans=0.0 2023-06-21 09:01:43,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.848e+02 3.301e+02 4.038e+02 6.672e+02, threshold=6.603e+02, percent-clipped=0.0 2023-06-21 09:01:48,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=927738.0, ans=0.05 2023-06-21 09:02:01,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=927798.0, ans=0.0 2023-06-21 09:02:03,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=927798.0, ans=0.125 2023-06-21 09:02:56,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-21 09:03:04,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=927978.0, ans=0.0 2023-06-21 09:03:13,683 INFO [train.py:996] (0/4) Epoch 6, batch 2200, loss[loss=0.2826, simple_loss=0.3549, pruned_loss=0.1051, over 21459.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3224, pruned_loss=0.08905, over 4279287.19 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:03:21,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=928038.0, ans=0.0 2023-06-21 09:03:22,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=928038.0, ans=0.125 2023-06-21 09:04:05,733 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.02 vs. limit=22.5 2023-06-21 09:04:06,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=928158.0, ans=0.125 2023-06-21 09:04:08,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.70 vs. limit=15.0 2023-06-21 09:04:14,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-21 09:04:36,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-21 09:04:53,488 INFO [train.py:996] (0/4) Epoch 6, batch 2250, loss[loss=0.188, simple_loss=0.2527, pruned_loss=0.06167, over 21403.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3207, pruned_loss=0.08746, over 4281440.19 frames. ], batch size: 131, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:05:06,971 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.750e+02 3.169e+02 3.694e+02 5.600e+02, threshold=6.338e+02, percent-clipped=0.0 2023-06-21 09:05:23,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=928398.0, ans=0.04949747468305833 2023-06-21 09:05:30,925 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.28 vs. limit=15.0 2023-06-21 09:06:35,972 INFO [train.py:996] (0/4) Epoch 6, batch 2300, loss[loss=0.263, simple_loss=0.3167, pruned_loss=0.1046, over 21818.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3165, pruned_loss=0.08665, over 4278731.14 frames. ], batch size: 352, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:07:12,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=928758.0, ans=0.05 2023-06-21 09:07:41,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=928818.0, ans=0.125 2023-06-21 09:08:17,152 INFO [train.py:996] (0/4) Epoch 6, batch 2350, loss[loss=0.2677, simple_loss=0.3253, pruned_loss=0.1051, over 21254.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3128, pruned_loss=0.0867, over 4280845.57 frames. ], batch size: 159, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:08:25,731 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.345e+02 4.237e+02 6.014e+02 1.096e+03, threshold=8.474e+02, percent-clipped=18.0 2023-06-21 09:08:42,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=928998.0, ans=0.1 2023-06-21 09:08:53,325 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-06-21 09:09:55,380 INFO [train.py:996] (0/4) Epoch 6, batch 2400, loss[loss=0.2654, simple_loss=0.3376, pruned_loss=0.0966, over 21718.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3186, pruned_loss=0.08937, over 4278924.52 frames. ], batch size: 332, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:10:18,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=929298.0, ans=0.2 2023-06-21 09:10:18,960 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-21 09:10:22,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=929298.0, ans=0.125 2023-06-21 09:10:48,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=929358.0, ans=0.1 2023-06-21 09:10:53,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=929418.0, ans=0.2 2023-06-21 09:11:07,465 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-21 09:11:23,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-21 09:11:33,354 INFO [train.py:996] (0/4) Epoch 6, batch 2450, loss[loss=0.2263, simple_loss=0.2969, pruned_loss=0.07792, over 15213.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3229, pruned_loss=0.09261, over 4269949.82 frames. ], batch size: 60, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:11:41,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 3.127e+02 3.688e+02 4.498e+02 8.076e+02, threshold=7.375e+02, percent-clipped=0.0 2023-06-21 09:12:57,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=929778.0, ans=0.0 2023-06-21 09:13:13,626 INFO [train.py:996] (0/4) Epoch 6, batch 2500, loss[loss=0.2609, simple_loss=0.3024, pruned_loss=0.1097, over 21366.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3228, pruned_loss=0.09229, over 4270375.93 frames. ], batch size: 508, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:14:30,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=930078.0, ans=0.125 2023-06-21 09:14:31,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=22.5 2023-06-21 09:14:40,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=930078.0, ans=0.125 2023-06-21 09:14:47,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=930078.0, ans=0.2 2023-06-21 09:14:47,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=930078.0, ans=0.125 2023-06-21 09:14:49,968 INFO [train.py:996] (0/4) Epoch 6, batch 2550, loss[loss=0.2367, simple_loss=0.3019, pruned_loss=0.08578, over 15108.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3212, pruned_loss=0.09032, over 4262186.32 frames. ], batch size: 60, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:14:58,268 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.790e+02 3.218e+02 3.631e+02 5.360e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 09:15:11,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=930198.0, ans=0.125 2023-06-21 09:15:39,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=930258.0, ans=0.2 2023-06-21 09:15:46,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-21 09:15:53,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-21 09:16:31,242 INFO [train.py:996] (0/4) Epoch 6, batch 2600, loss[loss=0.214, simple_loss=0.2833, pruned_loss=0.07239, over 21587.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3223, pruned_loss=0.09104, over 4256822.09 frames. ], batch size: 263, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:17:27,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=930558.0, ans=0.2 2023-06-21 09:17:40,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=930618.0, ans=0.2 2023-06-21 09:17:42,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=930618.0, ans=0.125 2023-06-21 09:17:47,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=930678.0, ans=0.125 2023-06-21 09:17:47,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=930678.0, ans=0.2 2023-06-21 09:17:59,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-21 09:18:02,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=930678.0, ans=0.125 2023-06-21 09:18:09,182 INFO [train.py:996] (0/4) Epoch 6, batch 2650, loss[loss=0.2469, simple_loss=0.3086, pruned_loss=0.09259, over 21392.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3244, pruned_loss=0.09205, over 4267935.05 frames. ], batch size: 143, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:18:09,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=930738.0, ans=0.035 2023-06-21 09:18:16,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.039e+02 3.537e+02 4.396e+02 7.352e+02, threshold=7.074e+02, percent-clipped=6.0 2023-06-21 09:18:17,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=930738.0, ans=0.125 2023-06-21 09:18:57,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-21 09:19:52,605 INFO [train.py:996] (0/4) Epoch 6, batch 2700, loss[loss=0.272, simple_loss=0.3457, pruned_loss=0.09918, over 21621.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3231, pruned_loss=0.09181, over 4269862.98 frames. ], batch size: 389, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:20:05,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931038.0, ans=0.1 2023-06-21 09:20:11,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2023-06-21 09:20:40,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=931158.0, ans=0.125 2023-06-21 09:20:49,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931158.0, ans=0.1 2023-06-21 09:21:27,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931278.0, ans=0.1 2023-06-21 09:21:30,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=931278.0, ans=0.125 2023-06-21 09:21:34,648 INFO [train.py:996] (0/4) Epoch 6, batch 2750, loss[loss=0.2868, simple_loss=0.345, pruned_loss=0.1143, over 21741.00 frames. ], tot_loss[loss=0.253, simple_loss=0.322, pruned_loss=0.09201, over 4274731.89 frames. ], batch size: 112, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:21:38,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=931338.0, ans=0.05 2023-06-21 09:21:42,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 2.939e+02 3.495e+02 4.251e+02 6.748e+02, threshold=6.989e+02, percent-clipped=0.0 2023-06-21 09:21:44,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-21 09:21:45,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=931338.0, ans=0.0 2023-06-21 09:23:15,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=931578.0, ans=0.125 2023-06-21 09:23:20,473 INFO [train.py:996] (0/4) Epoch 6, batch 2800, loss[loss=0.2599, simple_loss=0.3497, pruned_loss=0.08511, over 21404.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3267, pruned_loss=0.09248, over 4272173.38 frames. ], batch size: 211, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:23:26,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=931638.0, ans=0.125 2023-06-21 09:24:26,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=931758.0, ans=0.0 2023-06-21 09:25:03,051 INFO [train.py:996] (0/4) Epoch 6, batch 2850, loss[loss=0.2782, simple_loss=0.3371, pruned_loss=0.1097, over 21728.00 frames. ], tot_loss[loss=0.258, simple_loss=0.328, pruned_loss=0.09401, over 4272960.23 frames. ], batch size: 298, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:25:05,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=931938.0, ans=0.125 2023-06-21 09:25:06,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=931938.0, ans=0.2 2023-06-21 09:25:23,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.230e+02 3.892e+02 4.894e+02 8.283e+02, threshold=7.785e+02, percent-clipped=6.0 2023-06-21 09:26:35,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=932178.0, ans=0.2 2023-06-21 09:26:45,936 INFO [train.py:996] (0/4) Epoch 6, batch 2900, loss[loss=0.2848, simple_loss=0.3339, pruned_loss=0.1179, over 21733.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3252, pruned_loss=0.09348, over 4281723.83 frames. ], batch size: 473, lr: 5.30e-03, grad_scale: 16.0 2023-06-21 09:27:47,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=932358.0, ans=0.025 2023-06-21 09:28:12,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932478.0, ans=0.1 2023-06-21 09:28:26,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=932538.0, ans=0.125 2023-06-21 09:28:28,481 INFO [train.py:996] (0/4) Epoch 6, batch 2950, loss[loss=0.242, simple_loss=0.3139, pruned_loss=0.08509, over 21868.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3262, pruned_loss=0.09379, over 4290942.09 frames. ], batch size: 118, lr: 5.30e-03, grad_scale: 16.0 2023-06-21 09:28:42,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 2.961e+02 3.319e+02 4.000e+02 7.696e+02, threshold=6.638e+02, percent-clipped=0.0 2023-06-21 09:29:04,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932598.0, ans=0.1 2023-06-21 09:29:17,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=932658.0, ans=0.07 2023-06-21 09:29:19,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=932658.0, ans=0.05 2023-06-21 09:29:39,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=932718.0, ans=0.125 2023-06-21 09:30:11,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=932778.0, ans=0.125 2023-06-21 09:30:14,720 INFO [train.py:996] (0/4) Epoch 6, batch 3000, loss[loss=0.3093, simple_loss=0.3707, pruned_loss=0.1239, over 21787.00 frames. ], tot_loss[loss=0.259, simple_loss=0.33, pruned_loss=0.094, over 4295370.74 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:30:14,722 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 09:30:34,695 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.255, simple_loss=0.3481, pruned_loss=0.08099, over 1796401.00 frames. 2023-06-21 09:30:34,696 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 09:30:47,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=932838.0, ans=0.0 2023-06-21 09:30:57,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=932898.0, ans=0.125 2023-06-21 09:31:25,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=932958.0, ans=0.0 2023-06-21 09:31:35,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=933018.0, ans=0.1 2023-06-21 09:32:16,775 INFO [train.py:996] (0/4) Epoch 6, batch 3050, loss[loss=0.2387, simple_loss=0.3176, pruned_loss=0.07989, over 21725.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3292, pruned_loss=0.09141, over 4290141.45 frames. ], batch size: 414, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:32:26,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 2.903e+02 3.413e+02 4.363e+02 7.333e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-21 09:33:11,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=933318.0, ans=0.0 2023-06-21 09:33:59,325 INFO [train.py:996] (0/4) Epoch 6, batch 3100, loss[loss=0.2405, simple_loss=0.3353, pruned_loss=0.0729, over 21588.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3293, pruned_loss=0.09102, over 4296587.46 frames. ], batch size: 389, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:34:41,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=933558.0, ans=0.125 2023-06-21 09:35:40,907 INFO [train.py:996] (0/4) Epoch 6, batch 3150, loss[loss=0.2792, simple_loss=0.348, pruned_loss=0.1052, over 21237.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3302, pruned_loss=0.09106, over 4291445.66 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:35:41,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=933738.0, ans=0.125 2023-06-21 09:35:55,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.015e+02 3.533e+02 4.107e+02 6.510e+02, threshold=7.067e+02, percent-clipped=0.0 2023-06-21 09:36:29,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=933858.0, ans=0.2 2023-06-21 09:37:13,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=933978.0, ans=0.0 2023-06-21 09:37:22,869 INFO [train.py:996] (0/4) Epoch 6, batch 3200, loss[loss=0.2271, simple_loss=0.3119, pruned_loss=0.07116, over 21712.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3308, pruned_loss=0.09078, over 4286846.65 frames. ], batch size: 298, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:37:36,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=934038.0, ans=0.125 2023-06-21 09:37:42,985 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:37:44,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=934098.0, ans=0.0 2023-06-21 09:37:59,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-21 09:38:37,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-21 09:38:49,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=934278.0, ans=0.125 2023-06-21 09:38:54,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=934278.0, ans=0.125 2023-06-21 09:39:02,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934278.0, ans=0.1 2023-06-21 09:39:08,413 INFO [train.py:996] (0/4) Epoch 6, batch 3250, loss[loss=0.1967, simple_loss=0.2508, pruned_loss=0.07124, over 20782.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3314, pruned_loss=0.0919, over 4283395.82 frames. ], batch size: 609, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:39:17,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=934338.0, ans=0.125 2023-06-21 09:39:18,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.861e+02 3.274e+02 3.932e+02 7.956e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-21 09:39:18,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=934338.0, ans=0.2 2023-06-21 09:39:23,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=934398.0, ans=0.125 2023-06-21 09:39:38,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=934398.0, ans=0.125 2023-06-21 09:40:49,802 INFO [train.py:996] (0/4) Epoch 6, batch 3300, loss[loss=0.2202, simple_loss=0.2872, pruned_loss=0.07657, over 21628.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3248, pruned_loss=0.09152, over 4277944.81 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:41:03,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=934638.0, ans=0.0 2023-06-21 09:41:18,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=934698.0, ans=0.125 2023-06-21 09:41:22,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-21 09:41:27,214 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.98 vs. limit=5.0 2023-06-21 09:42:05,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-21 09:42:17,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-21 09:42:30,703 INFO [train.py:996] (0/4) Epoch 6, batch 3350, loss[loss=0.248, simple_loss=0.32, pruned_loss=0.08797, over 21371.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3266, pruned_loss=0.09159, over 4272173.69 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:42:45,023 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.921e+02 3.413e+02 3.921e+02 6.338e+02, threshold=6.826e+02, percent-clipped=0.0 2023-06-21 09:42:45,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=934938.0, ans=0.1 2023-06-21 09:42:45,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=934938.0, ans=0.0 2023-06-21 09:42:47,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=934938.0, ans=0.125 2023-06-21 09:42:54,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934938.0, ans=0.1 2023-06-21 09:43:07,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=934998.0, ans=0.125 2023-06-21 09:43:20,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=935058.0, ans=0.125 2023-06-21 09:43:53,432 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:44:17,670 INFO [train.py:996] (0/4) Epoch 6, batch 3400, loss[loss=0.2443, simple_loss=0.3125, pruned_loss=0.08804, over 21536.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3268, pruned_loss=0.09238, over 4273751.05 frames. ], batch size: 195, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:44:53,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935298.0, ans=0.1 2023-06-21 09:45:17,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=935418.0, ans=0.125 2023-06-21 09:45:17,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=935418.0, ans=0.125 2023-06-21 09:45:29,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935418.0, ans=0.1 2023-06-21 09:45:54,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-21 09:46:04,893 INFO [train.py:996] (0/4) Epoch 6, batch 3450, loss[loss=0.2033, simple_loss=0.2593, pruned_loss=0.07368, over 21469.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.321, pruned_loss=0.09068, over 4271987.55 frames. ], batch size: 212, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:46:16,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.951e+02 3.358e+02 4.026e+02 6.824e+02, threshold=6.715e+02, percent-clipped=0.0 2023-06-21 09:47:22,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=935718.0, ans=0.0 2023-06-21 09:47:23,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-21 09:47:47,077 INFO [train.py:996] (0/4) Epoch 6, batch 3500, loss[loss=0.2599, simple_loss=0.3381, pruned_loss=0.09081, over 21373.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3312, pruned_loss=0.09522, over 4277026.46 frames. ], batch size: 549, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:48:14,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=935898.0, ans=0.0 2023-06-21 09:48:33,630 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-156000.pt 2023-06-21 09:48:45,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=936018.0, ans=0.125 2023-06-21 09:48:46,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=936018.0, ans=0.125 2023-06-21 09:49:24,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=936078.0, ans=0.125 2023-06-21 09:49:28,524 INFO [train.py:996] (0/4) Epoch 6, batch 3550, loss[loss=0.2308, simple_loss=0.2967, pruned_loss=0.08243, over 21687.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3341, pruned_loss=0.09706, over 4281281.81 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:49:28,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=936138.0, ans=0.125 2023-06-21 09:49:30,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=936138.0, ans=0.2 2023-06-21 09:49:30,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=936138.0, ans=0.1 2023-06-21 09:49:44,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 3.119e+02 3.460e+02 4.086e+02 7.821e+02, threshold=6.921e+02, percent-clipped=5.0 2023-06-21 09:49:49,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=936198.0, ans=0.125 2023-06-21 09:49:59,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=936198.0, ans=0.2 2023-06-21 09:50:12,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=936258.0, ans=0.0 2023-06-21 09:50:31,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-21 09:50:38,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=936318.0, ans=0.2 2023-06-21 09:51:06,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=936378.0, ans=0.125 2023-06-21 09:51:13,823 INFO [train.py:996] (0/4) Epoch 6, batch 3600, loss[loss=0.2772, simple_loss=0.3394, pruned_loss=0.1074, over 21854.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.329, pruned_loss=0.09611, over 4274901.41 frames. ], batch size: 118, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:51:53,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=936558.0, ans=0.0 2023-06-21 09:52:33,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=936678.0, ans=15.0 2023-06-21 09:52:55,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=936738.0, ans=0.125 2023-06-21 09:52:56,110 INFO [train.py:996] (0/4) Epoch 6, batch 3650, loss[loss=0.3023, simple_loss=0.3821, pruned_loss=0.1112, over 21544.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3299, pruned_loss=0.09594, over 4273701.08 frames. ], batch size: 508, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:53:08,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 3.039e+02 3.609e+02 4.641e+02 6.973e+02, threshold=7.218e+02, percent-clipped=1.0 2023-06-21 09:53:39,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=936858.0, ans=0.09899494936611666 2023-06-21 09:54:05,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=936918.0, ans=0.0 2023-06-21 09:54:12,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-21 09:54:32,641 INFO [train.py:996] (0/4) Epoch 6, batch 3700, loss[loss=0.2711, simple_loss=0.3392, pruned_loss=0.1015, over 21792.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3278, pruned_loss=0.09464, over 4285382.59 frames. ], batch size: 247, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:55:39,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=937218.0, ans=0.125 2023-06-21 09:55:59,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=937278.0, ans=0.125 2023-06-21 09:56:15,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=937278.0, ans=0.125 2023-06-21 09:56:18,871 INFO [train.py:996] (0/4) Epoch 6, batch 3750, loss[loss=0.2406, simple_loss=0.301, pruned_loss=0.09013, over 21845.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3276, pruned_loss=0.09483, over 4286100.95 frames. ], batch size: 107, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:56:31,751 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.989e+02 3.545e+02 4.107e+02 7.890e+02, threshold=7.090e+02, percent-clipped=2.0 2023-06-21 09:57:10,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=937458.0, ans=0.0 2023-06-21 09:58:01,411 INFO [train.py:996] (0/4) Epoch 6, batch 3800, loss[loss=0.2586, simple_loss=0.3299, pruned_loss=0.09363, over 21118.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3254, pruned_loss=0.09268, over 4284533.78 frames. ], batch size: 608, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:58:02,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=937638.0, ans=0.0 2023-06-21 09:59:42,402 INFO [train.py:996] (0/4) Epoch 6, batch 3850, loss[loss=0.2498, simple_loss=0.2963, pruned_loss=0.1016, over 21599.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.323, pruned_loss=0.09278, over 4289343.32 frames. ], batch size: 298, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:59:51,412 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-21 09:59:52,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=937938.0, ans=0.125 2023-06-21 09:59:55,382 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 3.412e+02 4.254e+02 5.791e+02 1.316e+03, threshold=8.507e+02, percent-clipped=12.0 2023-06-21 10:01:15,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=938178.0, ans=0.0 2023-06-21 10:01:23,223 INFO [train.py:996] (0/4) Epoch 6, batch 3900, loss[loss=0.2598, simple_loss=0.3155, pruned_loss=0.102, over 21847.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3179, pruned_loss=0.09225, over 4288524.64 frames. ], batch size: 371, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 10:01:37,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=938238.0, ans=0.125 2023-06-21 10:01:42,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=938298.0, ans=0.0 2023-06-21 10:02:07,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-21 10:02:30,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=938418.0, ans=0.1 2023-06-21 10:03:04,366 INFO [train.py:996] (0/4) Epoch 6, batch 3950, loss[loss=0.1902, simple_loss=0.2644, pruned_loss=0.05797, over 21138.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3198, pruned_loss=0.0919, over 4291230.09 frames. ], batch size: 143, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 10:03:17,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.886e+02 3.404e+02 4.103e+02 5.613e+02, threshold=6.809e+02, percent-clipped=0.0 2023-06-21 10:03:41,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=938598.0, ans=0.125 2023-06-21 10:04:15,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=938718.0, ans=0.125 2023-06-21 10:04:17,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=938718.0, ans=0.2 2023-06-21 10:04:45,834 INFO [train.py:996] (0/4) Epoch 6, batch 4000, loss[loss=0.2333, simple_loss=0.2915, pruned_loss=0.08756, over 21778.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3144, pruned_loss=0.08839, over 4283798.53 frames. ], batch size: 351, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:05:06,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=938898.0, ans=0.05 2023-06-21 10:05:08,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=938898.0, ans=0.125 2023-06-21 10:05:08,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.53 vs. limit=10.0 2023-06-21 10:05:39,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=938958.0, ans=0.1 2023-06-21 10:06:12,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-21 10:06:14,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=939078.0, ans=0.1 2023-06-21 10:06:14,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=22.5 2023-06-21 10:06:20,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=939078.0, ans=0.125 2023-06-21 10:06:26,131 INFO [train.py:996] (0/4) Epoch 6, batch 4050, loss[loss=0.2593, simple_loss=0.3682, pruned_loss=0.07521, over 21271.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3127, pruned_loss=0.08696, over 4286255.25 frames. ], batch size: 548, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:06:27,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-21 10:06:43,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.907e+02 3.499e+02 4.123e+02 8.601e+02, threshold=6.998e+02, percent-clipped=5.0 2023-06-21 10:07:01,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=939198.0, ans=0.1 2023-06-21 10:07:15,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=939258.0, ans=0.125 2023-06-21 10:07:47,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-21 10:08:13,481 INFO [train.py:996] (0/4) Epoch 6, batch 4100, loss[loss=0.2838, simple_loss=0.3567, pruned_loss=0.1054, over 21707.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3142, pruned_loss=0.08709, over 4290048.88 frames. ], batch size: 389, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:09:11,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=939558.0, ans=0.125 2023-06-21 10:09:29,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=939618.0, ans=0.2 2023-06-21 10:09:52,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=939678.0, ans=0.125 2023-06-21 10:09:54,975 INFO [train.py:996] (0/4) Epoch 6, batch 4150, loss[loss=0.1958, simple_loss=0.2895, pruned_loss=0.05106, over 21148.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3144, pruned_loss=0.08438, over 4284997.32 frames. ], batch size: 159, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:10:10,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=939738.0, ans=10.0 2023-06-21 10:10:17,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 3.037e+02 3.666e+02 4.331e+02 9.059e+02, threshold=7.332e+02, percent-clipped=3.0 2023-06-21 10:10:50,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=939858.0, ans=0.125 2023-06-21 10:11:00,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.04 vs. limit=8.0 2023-06-21 10:11:03,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-21 10:11:19,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=939918.0, ans=0.125 2023-06-21 10:11:44,080 INFO [train.py:996] (0/4) Epoch 6, batch 4200, loss[loss=0.3015, simple_loss=0.3571, pruned_loss=0.123, over 21450.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3147, pruned_loss=0.08325, over 4279129.61 frames. ], batch size: 473, lr: 5.27e-03, grad_scale: 32.0 2023-06-21 10:11:46,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940038.0, ans=0.1 2023-06-21 10:11:50,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-21 10:12:26,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=940158.0, ans=0.0 2023-06-21 10:13:19,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-21 10:13:20,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=940278.0, ans=0.2 2023-06-21 10:13:33,272 INFO [train.py:996] (0/4) Epoch 6, batch 4250, loss[loss=0.2627, simple_loss=0.3605, pruned_loss=0.08243, over 21854.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3203, pruned_loss=0.08534, over 4273611.64 frames. ], batch size: 317, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:13:52,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 3.226e+02 3.853e+02 4.783e+02 9.792e+02, threshold=7.707e+02, percent-clipped=2.0 2023-06-21 10:13:53,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=940398.0, ans=0.125 2023-06-21 10:14:04,912 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-21 10:14:39,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940518.0, ans=0.1 2023-06-21 10:14:49,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=940578.0, ans=0.1 2023-06-21 10:14:57,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=940578.0, ans=0.125 2023-06-21 10:15:15,867 INFO [train.py:996] (0/4) Epoch 6, batch 4300, loss[loss=0.2246, simple_loss=0.3166, pruned_loss=0.06628, over 21830.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3288, pruned_loss=0.08856, over 4270342.99 frames. ], batch size: 282, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:15:30,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=940638.0, ans=0.125 2023-06-21 10:15:31,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-21 10:16:47,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=940878.0, ans=0.125 2023-06-21 10:17:01,942 INFO [train.py:996] (0/4) Epoch 6, batch 4350, loss[loss=0.2791, simple_loss=0.3901, pruned_loss=0.08403, over 21271.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3287, pruned_loss=0.08745, over 4266525.54 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:17:06,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-06-21 10:17:07,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=940938.0, ans=0.125 2023-06-21 10:17:16,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 2.990e+02 3.501e+02 4.556e+02 7.699e+02, threshold=7.002e+02, percent-clipped=0.0 2023-06-21 10:17:32,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.78 vs. limit=6.0 2023-06-21 10:17:34,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=940998.0, ans=0.0 2023-06-21 10:17:45,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=941058.0, ans=0.125 2023-06-21 10:18:21,648 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-21 10:18:22,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=941178.0, ans=0.2 2023-06-21 10:18:35,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=941178.0, ans=0.125 2023-06-21 10:18:43,535 INFO [train.py:996] (0/4) Epoch 6, batch 4400, loss[loss=0.278, simple_loss=0.3982, pruned_loss=0.07891, over 19859.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3248, pruned_loss=0.08734, over 4256519.61 frames. ], batch size: 702, lr: 5.27e-03, grad_scale: 32.0 2023-06-21 10:18:47,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=941238.0, ans=0.0 2023-06-21 10:18:52,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=941238.0, ans=0.125 2023-06-21 10:19:15,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-21 10:19:19,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=941298.0, ans=0.125 2023-06-21 10:19:35,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=941358.0, ans=0.125 2023-06-21 10:20:20,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-21 10:20:26,310 INFO [train.py:996] (0/4) Epoch 6, batch 4450, loss[loss=0.2641, simple_loss=0.3258, pruned_loss=0.1011, over 21449.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3322, pruned_loss=0.08884, over 4267223.56 frames. ], batch size: 131, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:20:34,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=941538.0, ans=0.125 2023-06-21 10:20:47,412 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.365e+02 2.781e+02 3.285e+02 3.917e+02 7.316e+02, threshold=6.570e+02, percent-clipped=2.0 2023-06-21 10:21:01,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.64 vs. limit=22.5 2023-06-21 10:21:13,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=941658.0, ans=0.125 2023-06-21 10:21:32,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=941718.0, ans=0.125 2023-06-21 10:22:06,951 INFO [train.py:996] (0/4) Epoch 6, batch 4500, loss[loss=0.3019, simple_loss=0.3812, pruned_loss=0.1113, over 21893.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3339, pruned_loss=0.09147, over 4276425.18 frames. ], batch size: 371, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:23:10,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942018.0, ans=0.1 2023-06-21 10:23:44,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-21 10:23:54,934 INFO [train.py:996] (0/4) Epoch 6, batch 4550, loss[loss=0.264, simple_loss=0.338, pruned_loss=0.09494, over 21314.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3357, pruned_loss=0.09136, over 4275116.53 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:24:04,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=942138.0, ans=0.0 2023-06-21 10:24:06,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=942138.0, ans=0.0 2023-06-21 10:24:16,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.720e+02 3.042e+02 3.501e+02 7.303e+02, threshold=6.084e+02, percent-clipped=1.0 2023-06-21 10:24:18,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=942198.0, ans=0.0 2023-06-21 10:24:29,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-21 10:24:48,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=942258.0, ans=0.125 2023-06-21 10:24:59,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=942318.0, ans=0.125 2023-06-21 10:25:08,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-21 10:25:37,668 INFO [train.py:996] (0/4) Epoch 6, batch 4600, loss[loss=0.1976, simple_loss=0.2784, pruned_loss=0.05843, over 21656.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3376, pruned_loss=0.09309, over 4274057.18 frames. ], batch size: 230, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:25:46,267 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:26:15,747 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:26:17,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942498.0, ans=0.1 2023-06-21 10:26:26,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942558.0, ans=0.1 2023-06-21 10:26:41,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=942618.0, ans=0.0 2023-06-21 10:26:51,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=942618.0, ans=0.125 2023-06-21 10:27:02,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=942678.0, ans=0.0 2023-06-21 10:27:10,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=942678.0, ans=0.2 2023-06-21 10:27:18,167 INFO [train.py:996] (0/4) Epoch 6, batch 4650, loss[loss=0.1999, simple_loss=0.2644, pruned_loss=0.06775, over 21251.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3316, pruned_loss=0.09207, over 4269924.74 frames. ], batch size: 159, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:27:44,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.693e+02 3.104e+02 3.574e+02 6.080e+02, threshold=6.208e+02, percent-clipped=0.0 2023-06-21 10:28:30,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-21 10:28:58,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=943038.0, ans=0.125 2023-06-21 10:28:59,782 INFO [train.py:996] (0/4) Epoch 6, batch 4700, loss[loss=0.2055, simple_loss=0.2714, pruned_loss=0.06975, over 21664.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3218, pruned_loss=0.08938, over 4270517.69 frames. ], batch size: 282, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:29:12,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.61 vs. limit=15.0 2023-06-21 10:29:23,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=943038.0, ans=0.125 2023-06-21 10:30:08,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=15.0 2023-06-21 10:30:19,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=943278.0, ans=0.1 2023-06-21 10:30:34,056 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:30:39,995 INFO [train.py:996] (0/4) Epoch 6, batch 4750, loss[loss=0.2511, simple_loss=0.3055, pruned_loss=0.09837, over 20252.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3163, pruned_loss=0.08988, over 4270483.19 frames. ], batch size: 707, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:30:42,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-21 10:31:00,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.759e+02 3.391e+02 4.138e+02 8.179e+02, threshold=6.782e+02, percent-clipped=2.0 2023-06-21 10:32:04,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=943578.0, ans=0.0 2023-06-21 10:32:20,466 INFO [train.py:996] (0/4) Epoch 6, batch 4800, loss[loss=0.2325, simple_loss=0.3445, pruned_loss=0.06021, over 19793.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3162, pruned_loss=0.09003, over 4275427.81 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:33:06,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=943758.0, ans=0.0 2023-06-21 10:33:34,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=943818.0, ans=0.5 2023-06-21 10:34:00,644 INFO [train.py:996] (0/4) Epoch 6, batch 4850, loss[loss=0.2348, simple_loss=0.3084, pruned_loss=0.08066, over 21657.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3159, pruned_loss=0.08898, over 4277037.00 frames. ], batch size: 298, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:34:13,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=943938.0, ans=0.125 2023-06-21 10:34:28,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.010e+02 3.635e+02 4.678e+02 6.819e+02, threshold=7.270e+02, percent-clipped=1.0 2023-06-21 10:34:30,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.32 vs. limit=22.5 2023-06-21 10:34:34,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=943998.0, ans=0.125 2023-06-21 10:35:16,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=944118.0, ans=0.125 2023-06-21 10:35:33,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=944178.0, ans=0.125 2023-06-21 10:35:36,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=944178.0, ans=0.1 2023-06-21 10:35:42,686 INFO [train.py:996] (0/4) Epoch 6, batch 4900, loss[loss=0.2118, simple_loss=0.2941, pruned_loss=0.06473, over 20108.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3165, pruned_loss=0.08969, over 4274240.35 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:35:44,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=944238.0, ans=0.125 2023-06-21 10:36:17,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=944298.0, ans=0.125 2023-06-21 10:36:46,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-21 10:36:47,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=944418.0, ans=0.025 2023-06-21 10:37:10,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=944478.0, ans=0.2 2023-06-21 10:37:36,983 INFO [train.py:996] (0/4) Epoch 6, batch 4950, loss[loss=0.1847, simple_loss=0.2695, pruned_loss=0.04996, over 21314.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3211, pruned_loss=0.0877, over 4275867.07 frames. ], batch size: 211, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:37:53,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.748e+02 3.392e+02 4.071e+02 6.752e+02, threshold=6.784e+02, percent-clipped=0.0 2023-06-21 10:37:56,648 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.28 vs. limit=15.0 2023-06-21 10:38:03,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=944598.0, ans=0.2 2023-06-21 10:38:23,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=944658.0, ans=0.0 2023-06-21 10:39:10,147 INFO [train.py:996] (0/4) Epoch 6, batch 5000, loss[loss=0.2383, simple_loss=0.3093, pruned_loss=0.08361, over 21455.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3204, pruned_loss=0.08459, over 4279043.52 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:39:10,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=944838.0, ans=0.125 2023-06-21 10:39:28,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=944898.0, ans=0.0 2023-06-21 10:40:02,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=945018.0, ans=0.125 2023-06-21 10:40:43,884 INFO [train.py:996] (0/4) Epoch 6, batch 5050, loss[loss=0.2371, simple_loss=0.3074, pruned_loss=0.08337, over 21933.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3202, pruned_loss=0.08696, over 4291084.94 frames. ], batch size: 333, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:40:51,858 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:41:00,728 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.826e+02 3.138e+02 3.685e+02 6.329e+02, threshold=6.276e+02, percent-clipped=0.0 2023-06-21 10:41:56,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=945378.0, ans=0.1 2023-06-21 10:41:56,157 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:41:58,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=945378.0, ans=0.0 2023-06-21 10:42:16,828 INFO [train.py:996] (0/4) Epoch 6, batch 5100, loss[loss=0.234, simple_loss=0.2996, pruned_loss=0.08419, over 21861.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3188, pruned_loss=0.08684, over 4285776.85 frames. ], batch size: 124, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:42:19,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-21 10:42:49,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=945498.0, ans=0.2 2023-06-21 10:43:21,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=945618.0, ans=0.0 2023-06-21 10:43:56,821 INFO [train.py:996] (0/4) Epoch 6, batch 5150, loss[loss=0.2627, simple_loss=0.3392, pruned_loss=0.09315, over 21400.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3202, pruned_loss=0.0884, over 4289213.27 frames. ], batch size: 548, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:44:19,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.377e+02 2.990e+02 3.465e+02 4.317e+02 6.616e+02, threshold=6.931e+02, percent-clipped=1.0 2023-06-21 10:45:00,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=945918.0, ans=0.125 2023-06-21 10:45:42,242 INFO [train.py:996] (0/4) Epoch 6, batch 5200, loss[loss=0.2208, simple_loss=0.2804, pruned_loss=0.08055, over 21335.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3214, pruned_loss=0.08926, over 4286283.13 frames. ], batch size: 176, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:45:58,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=946098.0, ans=0.2 2023-06-21 10:46:02,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-21 10:46:16,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=946098.0, ans=0.125 2023-06-21 10:46:19,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=946158.0, ans=0.1 2023-06-21 10:46:27,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=946158.0, ans=0.2 2023-06-21 10:47:03,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=946278.0, ans=0.125 2023-06-21 10:47:12,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=946278.0, ans=0.1 2023-06-21 10:47:21,857 INFO [train.py:996] (0/4) Epoch 6, batch 5250, loss[loss=0.1935, simple_loss=0.2555, pruned_loss=0.06573, over 21821.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3247, pruned_loss=0.08769, over 4284267.07 frames. ], batch size: 102, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:47:39,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.923e+02 3.658e+02 4.333e+02 7.638e+02, threshold=7.316e+02, percent-clipped=1.0 2023-06-21 10:48:11,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-21 10:49:00,850 INFO [train.py:996] (0/4) Epoch 6, batch 5300, loss[loss=0.2591, simple_loss=0.32, pruned_loss=0.09907, over 21893.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3241, pruned_loss=0.089, over 4289844.31 frames. ], batch size: 351, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:49:11,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-21 10:49:21,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=946698.0, ans=0.125 2023-06-21 10:50:14,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=946818.0, ans=0.2 2023-06-21 10:50:28,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=946878.0, ans=0.0 2023-06-21 10:50:39,183 INFO [train.py:996] (0/4) Epoch 6, batch 5350, loss[loss=0.2481, simple_loss=0.3115, pruned_loss=0.09232, over 21903.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3225, pruned_loss=0.09015, over 4294804.39 frames. ], batch size: 316, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:50:58,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.908e+02 3.182e+02 3.564e+02 5.714e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-21 10:51:16,609 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=12.0 2023-06-21 10:51:21,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.24 vs. limit=6.0 2023-06-21 10:51:40,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=947118.0, ans=0.125 2023-06-21 10:51:51,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=947118.0, ans=0.0 2023-06-21 10:52:18,106 INFO [train.py:996] (0/4) Epoch 6, batch 5400, loss[loss=0.2209, simple_loss=0.2778, pruned_loss=0.08203, over 21651.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3218, pruned_loss=0.08973, over 4283663.36 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:52:31,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=947238.0, ans=0.0 2023-06-21 10:52:45,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=947298.0, ans=0.125 2023-06-21 10:52:49,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.19 vs. limit=15.0 2023-06-21 10:53:15,604 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-21 10:53:26,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=947418.0, ans=0.125 2023-06-21 10:53:41,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=947478.0, ans=0.0 2023-06-21 10:53:53,500 INFO [train.py:996] (0/4) Epoch 6, batch 5450, loss[loss=0.2645, simple_loss=0.3219, pruned_loss=0.1036, over 21181.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.321, pruned_loss=0.08758, over 4281732.75 frames. ], batch size: 608, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:54:17,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.387e+02 2.977e+02 3.575e+02 4.401e+02 6.671e+02, threshold=7.149e+02, percent-clipped=1.0 2023-06-21 10:55:04,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-21 10:55:34,831 INFO [train.py:996] (0/4) Epoch 6, batch 5500, loss[loss=0.2048, simple_loss=0.3085, pruned_loss=0.05052, over 21658.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3256, pruned_loss=0.08422, over 4279133.58 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:55:49,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=947838.0, ans=0.125 2023-06-21 10:56:06,456 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.25 vs. limit=10.0 2023-06-21 10:56:35,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=947958.0, ans=0.1 2023-06-21 10:56:52,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.59 vs. limit=12.0 2023-06-21 10:57:00,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-21 10:57:20,582 INFO [train.py:996] (0/4) Epoch 6, batch 5550, loss[loss=0.2614, simple_loss=0.3339, pruned_loss=0.09442, over 21016.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3243, pruned_loss=0.08035, over 4272008.10 frames. ], batch size: 607, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:57:36,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=948138.0, ans=0.125 2023-06-21 10:57:39,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=948138.0, ans=0.125 2023-06-21 10:57:45,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.700e+02 3.215e+02 3.984e+02 5.956e+02, threshold=6.431e+02, percent-clipped=0.0 2023-06-21 10:58:05,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=948198.0, ans=0.025 2023-06-21 10:58:11,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=948258.0, ans=0.0 2023-06-21 10:58:14,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=948258.0, ans=0.125 2023-06-21 10:58:15,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-21 10:59:07,572 INFO [train.py:996] (0/4) Epoch 6, batch 5600, loss[loss=0.2345, simple_loss=0.3071, pruned_loss=0.08093, over 21149.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3204, pruned_loss=0.07706, over 4278984.24 frames. ], batch size: 143, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 10:59:19,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=948438.0, ans=0.125 2023-06-21 10:59:37,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=948498.0, ans=0.125 2023-06-21 10:59:37,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.57 vs. limit=15.0 2023-06-21 10:59:40,604 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-21 10:59:41,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=948498.0, ans=0.125 2023-06-21 11:00:46,532 INFO [train.py:996] (0/4) Epoch 6, batch 5650, loss[loss=0.2514, simple_loss=0.3206, pruned_loss=0.0911, over 21888.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.324, pruned_loss=0.07927, over 4274985.64 frames. ], batch size: 351, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:01:09,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=948798.0, ans=0.0 2023-06-21 11:01:10,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.904e+02 3.593e+02 4.833e+02 7.419e+02, threshold=7.185e+02, percent-clipped=8.0 2023-06-21 11:01:17,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=948798.0, ans=0.125 2023-06-21 11:02:28,412 INFO [train.py:996] (0/4) Epoch 6, batch 5700, loss[loss=0.2979, simple_loss=0.3486, pruned_loss=0.1236, over 21607.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3254, pruned_loss=0.08176, over 4275456.23 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:02:30,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=949038.0, ans=0.0 2023-06-21 11:02:31,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=22.5 2023-06-21 11:03:05,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=949158.0, ans=0.125 2023-06-21 11:03:28,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=22.5 2023-06-21 11:03:59,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=949278.0, ans=0.1 2023-06-21 11:04:14,750 INFO [train.py:996] (0/4) Epoch 6, batch 5750, loss[loss=0.1884, simple_loss=0.2837, pruned_loss=0.04656, over 21778.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3221, pruned_loss=0.0803, over 4282032.71 frames. ], batch size: 332, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:04:34,866 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.640e+02 3.214e+02 3.808e+02 7.764e+02, threshold=6.428e+02, percent-clipped=2.0 2023-06-21 11:04:55,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=949458.0, ans=0.2 2023-06-21 11:05:00,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=949458.0, ans=0.1 2023-06-21 11:05:10,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=949458.0, ans=0.125 2023-06-21 11:05:13,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=949518.0, ans=0.1 2023-06-21 11:05:26,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=949518.0, ans=0.125 2023-06-21 11:05:31,501 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:05:36,202 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:05:56,178 INFO [train.py:996] (0/4) Epoch 6, batch 5800, loss[loss=0.3738, simple_loss=0.4416, pruned_loss=0.153, over 21501.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3218, pruned_loss=0.07887, over 4283499.97 frames. ], batch size: 508, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:06:09,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.89 vs. limit=15.0 2023-06-21 11:06:51,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=949758.0, ans=0.2 2023-06-21 11:07:16,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=949818.0, ans=0.1 2023-06-21 11:07:17,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=949818.0, ans=0.0 2023-06-21 11:07:29,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=949878.0, ans=0.0 2023-06-21 11:07:39,080 INFO [train.py:996] (0/4) Epoch 6, batch 5850, loss[loss=0.1652, simple_loss=0.2428, pruned_loss=0.04381, over 21900.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3196, pruned_loss=0.07484, over 4287118.37 frames. ], batch size: 107, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:08:03,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.386e+02 2.763e+02 3.432e+02 5.220e+02, threshold=5.525e+02, percent-clipped=0.0 2023-06-21 11:09:18,227 INFO [train.py:996] (0/4) Epoch 6, batch 5900, loss[loss=0.2148, simple_loss=0.2864, pruned_loss=0.07161, over 21779.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3122, pruned_loss=0.06956, over 4280987.76 frames. ], batch size: 298, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:09:20,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=950238.0, ans=0.125 2023-06-21 11:09:41,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=950298.0, ans=0.0 2023-06-21 11:09:51,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=950298.0, ans=0.0 2023-06-21 11:10:36,940 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:10:40,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=950478.0, ans=0.1 2023-06-21 11:10:57,254 INFO [train.py:996] (0/4) Epoch 6, batch 5950, loss[loss=0.2268, simple_loss=0.2924, pruned_loss=0.08058, over 21746.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3109, pruned_loss=0.07317, over 4278768.47 frames. ], batch size: 333, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 11:11:19,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=950598.0, ans=0.125 2023-06-21 11:11:22,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 2.586e+02 3.198e+02 3.905e+02 6.345e+02, threshold=6.395e+02, percent-clipped=3.0 2023-06-21 11:12:15,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=950718.0, ans=0.125 2023-06-21 11:12:31,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=950778.0, ans=0.125 2023-06-21 11:12:40,623 INFO [train.py:996] (0/4) Epoch 6, batch 6000, loss[loss=0.223, simple_loss=0.2768, pruned_loss=0.08464, over 21753.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3066, pruned_loss=0.07699, over 4285071.16 frames. ], batch size: 112, lr: 5.24e-03, grad_scale: 32.0 2023-06-21 11:12:40,624 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 11:12:57,292 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2656, simple_loss=0.3626, pruned_loss=0.08426, over 1796401.00 frames. 2023-06-21 11:12:57,293 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 11:13:17,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=950898.0, ans=0.125 2023-06-21 11:14:26,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-21 11:14:43,506 INFO [train.py:996] (0/4) Epoch 6, batch 6050, loss[loss=0.1933, simple_loss=0.2649, pruned_loss=0.06088, over 21696.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3016, pruned_loss=0.07801, over 4278807.25 frames. ], batch size: 247, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:14:50,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=951138.0, ans=0.0 2023-06-21 11:15:16,957 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.804e+02 3.230e+02 3.761e+02 6.873e+02, threshold=6.459e+02, percent-clipped=1.0 2023-06-21 11:15:17,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=951198.0, ans=0.125 2023-06-21 11:15:20,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=951198.0, ans=0.1 2023-06-21 11:15:34,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=951258.0, ans=0.0 2023-06-21 11:15:34,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=951258.0, ans=0.125 2023-06-21 11:16:16,222 INFO [train.py:996] (0/4) Epoch 6, batch 6100, loss[loss=0.2028, simple_loss=0.3003, pruned_loss=0.0526, over 21803.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2995, pruned_loss=0.07689, over 4282692.61 frames. ], batch size: 371, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:16:17,465 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-06-21 11:16:57,274 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:17:04,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.13 vs. limit=15.0 2023-06-21 11:17:35,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=951618.0, ans=0.0 2023-06-21 11:17:54,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=951738.0, ans=0.125 2023-06-21 11:17:55,300 INFO [train.py:996] (0/4) Epoch 6, batch 6150, loss[loss=0.2201, simple_loss=0.2859, pruned_loss=0.07715, over 21527.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3023, pruned_loss=0.07961, over 4285975.11 frames. ], batch size: 195, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:18:08,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=951738.0, ans=0.125 2023-06-21 11:18:15,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=951798.0, ans=0.125 2023-06-21 11:18:33,350 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.602e+02 3.011e+02 3.655e+02 5.167e+02, threshold=6.022e+02, percent-clipped=0.0 2023-06-21 11:19:07,900 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:19:39,850 INFO [train.py:996] (0/4) Epoch 6, batch 6200, loss[loss=0.2154, simple_loss=0.2891, pruned_loss=0.0709, over 21381.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3054, pruned_loss=0.07969, over 4277167.59 frames. ], batch size: 159, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:19:44,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-21 11:19:57,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=952038.0, ans=0.2 2023-06-21 11:20:29,300 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:20:47,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=952218.0, ans=0.0 2023-06-21 11:20:51,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=952218.0, ans=0.2 2023-06-21 11:21:11,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-21 11:21:20,442 INFO [train.py:996] (0/4) Epoch 6, batch 6250, loss[loss=0.227, simple_loss=0.3293, pruned_loss=0.06239, over 21784.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3115, pruned_loss=0.07903, over 4273342.17 frames. ], batch size: 332, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:21:53,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.114e+02 4.042e+02 5.400e+02 9.374e+02, threshold=8.084e+02, percent-clipped=17.0 2023-06-21 11:21:56,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-21 11:22:13,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=952458.0, ans=0.125 2023-06-21 11:22:22,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=952518.0, ans=0.0 2023-06-21 11:22:37,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=952578.0, ans=0.2 2023-06-21 11:22:39,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=952578.0, ans=0.125 2023-06-21 11:22:57,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.28 vs. limit=10.0 2023-06-21 11:22:58,343 INFO [train.py:996] (0/4) Epoch 6, batch 6300, loss[loss=0.3125, simple_loss=0.4329, pruned_loss=0.09603, over 20816.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3156, pruned_loss=0.07898, over 4267915.19 frames. ], batch size: 607, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:23:55,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952758.0, ans=0.1 2023-06-21 11:24:48,258 INFO [train.py:996] (0/4) Epoch 6, batch 6350, loss[loss=0.2505, simple_loss=0.3248, pruned_loss=0.08812, over 21807.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3204, pruned_loss=0.08477, over 4276822.91 frames. ], batch size: 282, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:25:17,182 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 3.069e+02 3.635e+02 4.276e+02 7.885e+02, threshold=7.269e+02, percent-clipped=0.0 2023-06-21 11:25:54,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=953118.0, ans=0.125 2023-06-21 11:26:20,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=953178.0, ans=0.125 2023-06-21 11:26:25,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953178.0, ans=0.1 2023-06-21 11:26:28,479 INFO [train.py:996] (0/4) Epoch 6, batch 6400, loss[loss=0.3066, simple_loss=0.3695, pruned_loss=0.1219, over 21821.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3282, pruned_loss=0.09074, over 4276213.58 frames. ], batch size: 441, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:27:07,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=953358.0, ans=0.125 2023-06-21 11:27:20,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=953358.0, ans=0.0 2023-06-21 11:27:30,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=953418.0, ans=0.0 2023-06-21 11:27:41,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2023-06-21 11:28:12,567 INFO [train.py:996] (0/4) Epoch 6, batch 6450, loss[loss=0.2156, simple_loss=0.2838, pruned_loss=0.07374, over 21815.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3301, pruned_loss=0.08914, over 4277995.63 frames. ], batch size: 124, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:28:19,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.58 vs. limit=22.5 2023-06-21 11:28:29,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-21 11:28:36,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.859e+02 3.374e+02 4.192e+02 6.332e+02, threshold=6.748e+02, percent-clipped=0.0 2023-06-21 11:29:06,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=953718.0, ans=0.0 2023-06-21 11:29:33,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=953778.0, ans=0.0 2023-06-21 11:29:54,950 INFO [train.py:996] (0/4) Epoch 6, batch 6500, loss[loss=0.2032, simple_loss=0.2736, pruned_loss=0.06643, over 21533.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3239, pruned_loss=0.08721, over 4272708.12 frames. ], batch size: 230, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:30:37,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=953958.0, ans=0.125 2023-06-21 11:31:14,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=954078.0, ans=0.125 2023-06-21 11:31:33,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-21 11:31:35,563 INFO [train.py:996] (0/4) Epoch 6, batch 6550, loss[loss=0.2098, simple_loss=0.2887, pruned_loss=0.06549, over 21589.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3219, pruned_loss=0.08549, over 4261659.27 frames. ], batch size: 230, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:31:59,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.779e+02 3.082e+02 3.818e+02 7.032e+02, threshold=6.164e+02, percent-clipped=1.0 2023-06-21 11:31:59,656 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:32:23,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=954258.0, ans=0.125 2023-06-21 11:32:34,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-21 11:32:50,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=954318.0, ans=0.0 2023-06-21 11:33:14,290 INFO [train.py:996] (0/4) Epoch 6, batch 6600, loss[loss=0.2139, simple_loss=0.2752, pruned_loss=0.07628, over 21799.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3165, pruned_loss=0.08519, over 4273367.52 frames. ], batch size: 98, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:33:14,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=954438.0, ans=0.125 2023-06-21 11:33:26,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-21 11:33:55,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=954558.0, ans=0.2 2023-06-21 11:34:13,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=954618.0, ans=0.2 2023-06-21 11:34:27,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=954618.0, ans=0.1 2023-06-21 11:34:33,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-21 11:34:52,655 INFO [train.py:996] (0/4) Epoch 6, batch 6650, loss[loss=0.2236, simple_loss=0.2692, pruned_loss=0.08902, over 20111.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3092, pruned_loss=0.0826, over 4271586.69 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:34:54,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=954738.0, ans=0.5 2023-06-21 11:34:59,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=954738.0, ans=0.125 2023-06-21 11:35:21,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.585e+02 3.021e+02 3.677e+02 6.066e+02, threshold=6.041e+02, percent-clipped=0.0 2023-06-21 11:35:31,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=954858.0, ans=0.2 2023-06-21 11:35:36,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=954858.0, ans=0.0 2023-06-21 11:36:30,494 INFO [train.py:996] (0/4) Epoch 6, batch 6700, loss[loss=0.2096, simple_loss=0.2802, pruned_loss=0.0695, over 21817.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3029, pruned_loss=0.08237, over 4267554.25 frames. ], batch size: 118, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:36:39,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-21 11:37:45,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=955218.0, ans=0.0 2023-06-21 11:38:09,245 INFO [train.py:996] (0/4) Epoch 6, batch 6750, loss[loss=0.2065, simple_loss=0.2707, pruned_loss=0.07114, over 21263.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3005, pruned_loss=0.08282, over 4263227.58 frames. ], batch size: 176, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:38:19,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=955338.0, ans=0.0 2023-06-21 11:38:20,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=955338.0, ans=0.1 2023-06-21 11:38:24,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=955398.0, ans=0.025 2023-06-21 11:38:32,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.820e+02 3.249e+02 3.969e+02 6.943e+02, threshold=6.498e+02, percent-clipped=2.0 2023-06-21 11:38:46,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=955458.0, ans=0.07 2023-06-21 11:38:47,578 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:39:14,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=955518.0, ans=0.125 2023-06-21 11:39:47,099 INFO [train.py:996] (0/4) Epoch 6, batch 6800, loss[loss=0.2668, simple_loss=0.321, pruned_loss=0.1063, over 21883.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3027, pruned_loss=0.08531, over 4275243.52 frames. ], batch size: 98, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:41:03,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=955878.0, ans=0.0 2023-06-21 11:41:04,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=955878.0, ans=0.125 2023-06-21 11:41:13,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=955878.0, ans=0.035 2023-06-21 11:41:24,282 INFO [train.py:996] (0/4) Epoch 6, batch 6850, loss[loss=0.2554, simple_loss=0.2984, pruned_loss=0.1062, over 21463.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3015, pruned_loss=0.08656, over 4270976.27 frames. ], batch size: 509, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:41:34,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=8.0 2023-06-21 11:41:38,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2023-06-21 11:41:48,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.884e+02 3.443e+02 4.128e+02 6.086e+02, threshold=6.887e+02, percent-clipped=0.0 2023-06-21 11:42:12,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=956058.0, ans=0.0 2023-06-21 11:42:57,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-21 11:43:04,883 INFO [train.py:996] (0/4) Epoch 6, batch 6900, loss[loss=0.3075, simple_loss=0.3736, pruned_loss=0.1207, over 21622.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3039, pruned_loss=0.08668, over 4280534.15 frames. ], batch size: 508, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:43:39,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-21 11:44:42,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=956478.0, ans=0.1 2023-06-21 11:44:45,083 INFO [train.py:996] (0/4) Epoch 6, batch 6950, loss[loss=0.2812, simple_loss=0.3542, pruned_loss=0.1041, over 21720.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3058, pruned_loss=0.08335, over 4275034.80 frames. ], batch size: 332, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:45:09,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=956598.0, ans=0.0 2023-06-21 11:45:13,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.660e+02 3.171e+02 3.598e+02 5.873e+02, threshold=6.343e+02, percent-clipped=0.0 2023-06-21 11:45:54,039 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:46:24,079 INFO [train.py:996] (0/4) Epoch 6, batch 7000, loss[loss=0.261, simple_loss=0.3166, pruned_loss=0.1027, over 21447.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.309, pruned_loss=0.08602, over 4279768.30 frames. ], batch size: 389, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:46:55,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=956898.0, ans=0.125 2023-06-21 11:47:39,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957018.0, ans=0.1 2023-06-21 11:47:39,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=957018.0, ans=0.07 2023-06-21 11:48:04,836 INFO [train.py:996] (0/4) Epoch 6, batch 7050, loss[loss=0.2323, simple_loss=0.3143, pruned_loss=0.07509, over 21730.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3058, pruned_loss=0.08441, over 4282568.73 frames. ], batch size: 351, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:48:38,900 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.010e+02 3.429e+02 4.410e+02 6.547e+02, threshold=6.858e+02, percent-clipped=1.0 2023-06-21 11:48:41,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=957198.0, ans=0.125 2023-06-21 11:48:53,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-21 11:49:02,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=957258.0, ans=0.0 2023-06-21 11:49:38,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=957378.0, ans=0.0 2023-06-21 11:49:45,429 INFO [train.py:996] (0/4) Epoch 6, batch 7100, loss[loss=0.2015, simple_loss=0.2803, pruned_loss=0.06138, over 21301.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3117, pruned_loss=0.0863, over 4285950.11 frames. ], batch size: 176, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:49:57,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=957438.0, ans=0.125 2023-06-21 11:50:23,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=957498.0, ans=0.125 2023-06-21 11:50:34,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.40 vs. limit=10.0 2023-06-21 11:50:37,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=957558.0, ans=0.0 2023-06-21 11:50:58,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=957618.0, ans=0.125 2023-06-21 11:51:22,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=957678.0, ans=0.0 2023-06-21 11:51:25,109 INFO [train.py:996] (0/4) Epoch 6, batch 7150, loss[loss=0.1895, simple_loss=0.2763, pruned_loss=0.05135, over 21763.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3078, pruned_loss=0.0826, over 4274269.36 frames. ], batch size: 332, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:51:58,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=957798.0, ans=0.0 2023-06-21 11:52:08,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.713e+02 3.100e+02 3.583e+02 6.411e+02, threshold=6.200e+02, percent-clipped=0.0 2023-06-21 11:52:12,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=957798.0, ans=0.125 2023-06-21 11:53:10,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=957978.0, ans=0.0 2023-06-21 11:53:11,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=957978.0, ans=0.1 2023-06-21 11:53:13,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=958038.0, ans=0.0 2023-06-21 11:53:14,735 INFO [train.py:996] (0/4) Epoch 6, batch 7200, loss[loss=0.2446, simple_loss=0.3546, pruned_loss=0.0673, over 19707.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3112, pruned_loss=0.08481, over 4269199.81 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:53:41,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-21 11:53:45,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=958098.0, ans=0.1 2023-06-21 11:54:54,090 INFO [train.py:996] (0/4) Epoch 6, batch 7250, loss[loss=0.2681, simple_loss=0.3739, pruned_loss=0.08117, over 19813.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3084, pruned_loss=0.08509, over 4260735.25 frames. ], batch size: 703, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 11:54:54,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=958338.0, ans=0.1 2023-06-21 11:55:28,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.745e+02 3.061e+02 4.034e+02 7.842e+02, threshold=6.122e+02, percent-clipped=5.0 2023-06-21 11:55:54,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=958518.0, ans=0.125 2023-06-21 11:56:33,721 INFO [train.py:996] (0/4) Epoch 6, batch 7300, loss[loss=0.1923, simple_loss=0.2927, pruned_loss=0.04597, over 20781.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3032, pruned_loss=0.0843, over 4267853.53 frames. ], batch size: 609, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 11:57:05,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=958698.0, ans=0.2 2023-06-21 11:57:29,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=958758.0, ans=0.0 2023-06-21 11:57:53,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=958878.0, ans=0.125 2023-06-21 11:58:15,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=958938.0, ans=0.125 2023-06-21 11:58:21,353 INFO [train.py:996] (0/4) Epoch 6, batch 7350, loss[loss=0.2511, simple_loss=0.3181, pruned_loss=0.09205, over 21550.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3017, pruned_loss=0.08552, over 4262474.11 frames. ], batch size: 389, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 11:58:49,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-21 11:58:52,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.849e+02 3.277e+02 4.064e+02 7.126e+02, threshold=6.555e+02, percent-clipped=2.0 2023-06-21 11:59:41,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=959178.0, ans=0.0 2023-06-21 12:00:06,536 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:00:06,566 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:00:07,541 INFO [train.py:996] (0/4) Epoch 6, batch 7400, loss[loss=0.2237, simple_loss=0.3045, pruned_loss=0.07144, over 21690.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3085, pruned_loss=0.08864, over 4265775.09 frames. ], batch size: 247, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:00:10,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.57 vs. limit=22.5 2023-06-21 12:00:17,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=959238.0, ans=0.125 2023-06-21 12:00:21,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=959238.0, ans=0.125 2023-06-21 12:00:22,688 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:00:35,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=959298.0, ans=0.2 2023-06-21 12:00:37,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=959298.0, ans=0.125 2023-06-21 12:00:40,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=959358.0, ans=0.0 2023-06-21 12:00:44,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=959358.0, ans=0.0 2023-06-21 12:01:48,250 INFO [train.py:996] (0/4) Epoch 6, batch 7450, loss[loss=0.2747, simple_loss=0.3215, pruned_loss=0.1139, over 21371.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3063, pruned_loss=0.08688, over 4269184.82 frames. ], batch size: 473, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:02:01,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=959538.0, ans=0.0 2023-06-21 12:02:02,959 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:02:13,863 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.784e+02 3.192e+02 3.792e+02 7.564e+02, threshold=6.383e+02, percent-clipped=1.0 2023-06-21 12:02:14,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=959598.0, ans=0.0 2023-06-21 12:02:41,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=959718.0, ans=0.1 2023-06-21 12:03:19,231 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=12.0 2023-06-21 12:03:24,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=959778.0, ans=0.125 2023-06-21 12:03:31,324 INFO [train.py:996] (0/4) Epoch 6, batch 7500, loss[loss=0.3285, simple_loss=0.4124, pruned_loss=0.1223, over 21655.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3127, pruned_loss=0.08901, over 4270739.78 frames. ], batch size: 441, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:04:12,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=959958.0, ans=0.2 2023-06-21 12:04:18,757 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-160000.pt 2023-06-21 12:04:55,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=960018.0, ans=0.125 2023-06-21 12:05:12,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=960078.0, ans=0.125 2023-06-21 12:05:16,823 INFO [train.py:996] (0/4) Epoch 6, batch 7550, loss[loss=0.2284, simple_loss=0.3192, pruned_loss=0.06885, over 21639.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.32, pruned_loss=0.08673, over 4280527.67 frames. ], batch size: 230, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:05:29,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=960138.0, ans=0.0 2023-06-21 12:05:41,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 3.161e+02 3.642e+02 4.665e+02 7.611e+02, threshold=7.284e+02, percent-clipped=6.0 2023-06-21 12:06:18,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960318.0, ans=0.1 2023-06-21 12:06:38,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=960378.0, ans=0.0 2023-06-21 12:06:50,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=960378.0, ans=0.0 2023-06-21 12:06:51,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960378.0, ans=0.1 2023-06-21 12:06:53,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=960378.0, ans=0.125 2023-06-21 12:06:56,349 INFO [train.py:996] (0/4) Epoch 6, batch 7600, loss[loss=0.2487, simple_loss=0.3111, pruned_loss=0.09312, over 21359.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3192, pruned_loss=0.08576, over 4286280.91 frames. ], batch size: 143, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:07:38,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=960558.0, ans=0.0 2023-06-21 12:07:56,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=960618.0, ans=0.2 2023-06-21 12:08:34,770 INFO [train.py:996] (0/4) Epoch 6, batch 7650, loss[loss=0.2562, simple_loss=0.3236, pruned_loss=0.09442, over 21882.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3172, pruned_loss=0.08691, over 4283081.49 frames. ], batch size: 118, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:08:45,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=960738.0, ans=0.0 2023-06-21 12:08:58,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=960798.0, ans=0.2 2023-06-21 12:09:01,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 3.034e+02 3.412e+02 4.046e+02 6.566e+02, threshold=6.823e+02, percent-clipped=0.0 2023-06-21 12:09:14,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960858.0, ans=0.1 2023-06-21 12:10:11,149 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:10:17,243 INFO [train.py:996] (0/4) Epoch 6, batch 7700, loss[loss=0.3054, simple_loss=0.3458, pruned_loss=0.1325, over 21817.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3204, pruned_loss=0.09024, over 4289119.91 frames. ], batch size: 441, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:10:30,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.14 vs. limit=22.5 2023-06-21 12:10:32,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=961098.0, ans=0.125 2023-06-21 12:10:48,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=961098.0, ans=0.125 2023-06-21 12:11:04,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-21 12:11:06,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-21 12:11:20,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=961158.0, ans=0.125 2023-06-21 12:11:33,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=961218.0, ans=0.125 2023-06-21 12:11:58,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=961338.0, ans=0.2 2023-06-21 12:11:59,181 INFO [train.py:996] (0/4) Epoch 6, batch 7750, loss[loss=0.3072, simple_loss=0.3963, pruned_loss=0.109, over 21748.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3277, pruned_loss=0.09087, over 4288290.91 frames. ], batch size: 351, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:12:06,578 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=22.5 2023-06-21 12:12:12,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=961338.0, ans=0.0 2023-06-21 12:12:34,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 3.114e+02 3.576e+02 4.204e+02 7.368e+02, threshold=7.152e+02, percent-clipped=1.0 2023-06-21 12:12:54,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=961458.0, ans=0.2 2023-06-21 12:12:54,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.78 vs. limit=10.0 2023-06-21 12:13:02,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=961458.0, ans=0.0 2023-06-21 12:13:39,901 INFO [train.py:996] (0/4) Epoch 6, batch 7800, loss[loss=0.2473, simple_loss=0.3005, pruned_loss=0.09703, over 21570.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3296, pruned_loss=0.09128, over 4289802.89 frames. ], batch size: 230, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:14:10,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=961698.0, ans=0.0 2023-06-21 12:14:35,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-21 12:14:41,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=961758.0, ans=0.0 2023-06-21 12:14:42,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=961758.0, ans=0.125 2023-06-21 12:15:18,201 INFO [train.py:996] (0/4) Epoch 6, batch 7850, loss[loss=0.2263, simple_loss=0.2784, pruned_loss=0.08713, over 20317.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3212, pruned_loss=0.09009, over 4269140.16 frames. ], batch size: 703, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:15:49,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=961998.0, ans=0.125 2023-06-21 12:16:02,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.425e+02 2.926e+02 3.453e+02 4.214e+02 9.317e+02, threshold=6.905e+02, percent-clipped=1.0 2023-06-21 12:16:44,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=962178.0, ans=0.125 2023-06-21 12:16:54,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-21 12:17:01,340 INFO [train.py:996] (0/4) Epoch 6, batch 7900, loss[loss=0.134, simple_loss=0.1757, pruned_loss=0.04611, over 16214.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3166, pruned_loss=0.08995, over 4255170.75 frames. ], batch size: 61, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:17:01,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=962238.0, ans=0.0 2023-06-21 12:17:04,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-06-21 12:18:10,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=962418.0, ans=0.2 2023-06-21 12:18:37,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-21 12:18:51,645 INFO [train.py:996] (0/4) Epoch 6, batch 7950, loss[loss=0.2052, simple_loss=0.2762, pruned_loss=0.06712, over 20773.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3209, pruned_loss=0.08854, over 4251787.08 frames. ], batch size: 609, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:19:18,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=962598.0, ans=10.0 2023-06-21 12:19:20,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=962598.0, ans=0.125 2023-06-21 12:19:26,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=962598.0, ans=0.0 2023-06-21 12:19:29,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.499e+02 3.498e+02 4.372e+02 5.089e+02 1.068e+03, threshold=8.743e+02, percent-clipped=8.0 2023-06-21 12:19:44,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=962658.0, ans=0.07 2023-06-21 12:20:44,524 INFO [train.py:996] (0/4) Epoch 6, batch 8000, loss[loss=0.295, simple_loss=0.3998, pruned_loss=0.09508, over 20769.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3267, pruned_loss=0.0902, over 4253849.04 frames. ], batch size: 607, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:20:53,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=962838.0, ans=0.0 2023-06-21 12:21:10,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=962898.0, ans=0.0 2023-06-21 12:22:00,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-06-21 12:22:13,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-21 12:22:19,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-21 12:22:28,321 INFO [train.py:996] (0/4) Epoch 6, batch 8050, loss[loss=0.2953, simple_loss=0.3821, pruned_loss=0.1042, over 21613.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3324, pruned_loss=0.09173, over 4251239.62 frames. ], batch size: 441, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:22:34,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-21 12:22:55,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 2.992e+02 3.416e+02 4.104e+02 8.130e+02, threshold=6.832e+02, percent-clipped=0.0 2023-06-21 12:23:37,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=963318.0, ans=0.0 2023-06-21 12:24:07,626 INFO [train.py:996] (0/4) Epoch 6, batch 8100, loss[loss=0.2578, simple_loss=0.3228, pruned_loss=0.09641, over 21015.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3298, pruned_loss=0.09218, over 4260667.23 frames. ], batch size: 608, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:25:23,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=963618.0, ans=0.125 2023-06-21 12:25:50,226 INFO [train.py:996] (0/4) Epoch 6, batch 8150, loss[loss=0.256, simple_loss=0.3627, pruned_loss=0.07464, over 21779.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.337, pruned_loss=0.09324, over 4261007.69 frames. ], batch size: 352, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:26:36,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.091e+02 3.490e+02 4.370e+02 7.436e+02, threshold=6.980e+02, percent-clipped=1.0 2023-06-21 12:27:23,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-21 12:27:25,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=12.0 2023-06-21 12:27:27,558 INFO [train.py:996] (0/4) Epoch 6, batch 8200, loss[loss=0.2664, simple_loss=0.3128, pruned_loss=0.1099, over 21415.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3278, pruned_loss=0.0905, over 4256106.80 frames. ], batch size: 509, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:27:28,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964038.0, ans=0.1 2023-06-21 12:29:07,657 INFO [train.py:996] (0/4) Epoch 6, batch 8250, loss[loss=0.2235, simple_loss=0.3045, pruned_loss=0.07124, over 21431.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3244, pruned_loss=0.09046, over 4251519.06 frames. ], batch size: 131, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:29:56,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.891e+02 3.432e+02 4.145e+02 7.025e+02, threshold=6.865e+02, percent-clipped=1.0 2023-06-21 12:30:46,922 INFO [train.py:996] (0/4) Epoch 6, batch 8300, loss[loss=0.2376, simple_loss=0.3116, pruned_loss=0.08184, over 21245.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3226, pruned_loss=0.08722, over 4251054.89 frames. ], batch size: 176, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:31:40,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=964758.0, ans=0.0 2023-06-21 12:32:15,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-06-21 12:32:33,476 INFO [train.py:996] (0/4) Epoch 6, batch 8350, loss[loss=0.2293, simple_loss=0.3045, pruned_loss=0.07707, over 21784.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3212, pruned_loss=0.08429, over 4250627.78 frames. ], batch size: 372, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:33:17,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.748e+02 3.092e+02 3.699e+02 5.409e+02, threshold=6.184e+02, percent-clipped=0.0 2023-06-21 12:33:27,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=965058.0, ans=0.125 2023-06-21 12:33:31,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=965058.0, ans=0.125 2023-06-21 12:34:14,505 INFO [train.py:996] (0/4) Epoch 6, batch 8400, loss[loss=0.2296, simple_loss=0.2763, pruned_loss=0.09145, over 20004.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3179, pruned_loss=0.08234, over 4241487.16 frames. ], batch size: 703, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:34:34,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=965238.0, ans=0.125 2023-06-21 12:34:55,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=965358.0, ans=0.125 2023-06-21 12:35:42,903 INFO [train.py:996] (0/4) Epoch 6, batch 8450, loss[loss=0.2752, simple_loss=0.3328, pruned_loss=0.1088, over 21235.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3165, pruned_loss=0.08294, over 4254255.74 frames. ], batch size: 143, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:36:18,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=965598.0, ans=0.125 2023-06-21 12:36:27,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=965598.0, ans=0.125 2023-06-21 12:36:28,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.529e+02 3.064e+02 3.775e+02 6.261e+02, threshold=6.127e+02, percent-clipped=1.0 2023-06-21 12:36:29,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=965658.0, ans=0.0 2023-06-21 12:36:41,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=965658.0, ans=0.1 2023-06-21 12:36:55,976 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:37:09,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=965778.0, ans=0.0 2023-06-21 12:37:17,973 INFO [train.py:996] (0/4) Epoch 6, batch 8500, loss[loss=0.246, simple_loss=0.2967, pruned_loss=0.09761, over 21209.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.313, pruned_loss=0.0851, over 4253785.39 frames. ], batch size: 159, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:37:28,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=965838.0, ans=0.2 2023-06-21 12:37:36,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2023-06-21 12:37:47,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=965898.0, ans=0.125 2023-06-21 12:38:57,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=966138.0, ans=0.0 2023-06-21 12:38:58,639 INFO [train.py:996] (0/4) Epoch 6, batch 8550, loss[loss=0.2473, simple_loss=0.3318, pruned_loss=0.08138, over 21616.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3174, pruned_loss=0.08779, over 4256289.31 frames. ], batch size: 263, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:39:47,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.008e+02 3.313e+02 4.045e+02 7.159e+02, threshold=6.625e+02, percent-clipped=3.0 2023-06-21 12:39:49,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=966258.0, ans=0.125 2023-06-21 12:40:35,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=966378.0, ans=0.0 2023-06-21 12:41:11,026 INFO [train.py:996] (0/4) Epoch 6, batch 8600, loss[loss=0.1917, simple_loss=0.2377, pruned_loss=0.07286, over 20018.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.326, pruned_loss=0.09037, over 4263817.97 frames. ], batch size: 704, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:41:20,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=966438.0, ans=0.1 2023-06-21 12:42:55,506 INFO [train.py:996] (0/4) Epoch 6, batch 8650, loss[loss=0.2441, simple_loss=0.3199, pruned_loss=0.08415, over 20812.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.329, pruned_loss=0.0912, over 4258792.01 frames. ], batch size: 607, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:43:23,705 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.940e+02 3.541e+02 4.015e+02 7.663e+02, threshold=7.081e+02, percent-clipped=3.0 2023-06-21 12:43:44,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=966918.0, ans=0.125 2023-06-21 12:44:29,629 INFO [train.py:996] (0/4) Epoch 6, batch 8700, loss[loss=0.2456, simple_loss=0.3043, pruned_loss=0.0935, over 21793.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3198, pruned_loss=0.08633, over 4253489.55 frames. ], batch size: 98, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:44:33,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-21 12:45:10,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=967158.0, ans=0.125 2023-06-21 12:45:33,641 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:45:36,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=967278.0, ans=0.0 2023-06-21 12:46:04,011 INFO [train.py:996] (0/4) Epoch 6, batch 8750, loss[loss=0.2535, simple_loss=0.309, pruned_loss=0.09904, over 21472.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3168, pruned_loss=0.08679, over 4256975.05 frames. ], batch size: 194, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:46:14,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=967338.0, ans=0.125 2023-06-21 12:46:34,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.061e+02 3.811e+02 4.792e+02 9.884e+02, threshold=7.621e+02, percent-clipped=4.0 2023-06-21 12:46:40,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.06 vs. limit=10.0 2023-06-21 12:46:58,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=967518.0, ans=0.125 2023-06-21 12:47:41,882 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:47:42,825 INFO [train.py:996] (0/4) Epoch 6, batch 8800, loss[loss=0.3009, simple_loss=0.3767, pruned_loss=0.1125, over 21847.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.326, pruned_loss=0.08987, over 4260023.92 frames. ], batch size: 118, lr: 5.20e-03, grad_scale: 32.0 2023-06-21 12:47:59,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-21 12:48:07,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-21 12:49:18,430 INFO [train.py:996] (0/4) Epoch 6, batch 8850, loss[loss=0.3099, simple_loss=0.3814, pruned_loss=0.1192, over 21388.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3347, pruned_loss=0.09222, over 4256028.23 frames. ], batch size: 131, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:49:48,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.854e+02 3.378e+02 4.143e+02 7.151e+02, threshold=6.757e+02, percent-clipped=0.0 2023-06-21 12:50:54,528 INFO [train.py:996] (0/4) Epoch 6, batch 8900, loss[loss=0.2358, simple_loss=0.3046, pruned_loss=0.08351, over 21795.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3287, pruned_loss=0.09085, over 4250766.06 frames. ], batch size: 102, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:50:57,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-06-21 12:51:48,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=968358.0, ans=0.0 2023-06-21 12:52:10,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=968418.0, ans=0.0 2023-06-21 12:52:19,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=968478.0, ans=0.1 2023-06-21 12:52:27,953 INFO [train.py:996] (0/4) Epoch 6, batch 8950, loss[loss=0.2326, simple_loss=0.2849, pruned_loss=0.09016, over 21212.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3284, pruned_loss=0.09029, over 4255764.80 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:53:12,535 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.104e+02 3.637e+02 4.159e+02 7.258e+02, threshold=7.275e+02, percent-clipped=2.0 2023-06-21 12:53:22,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=968658.0, ans=0.125 2023-06-21 12:53:22,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=968658.0, ans=0.125 2023-06-21 12:53:23,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=968658.0, ans=0.1 2023-06-21 12:53:37,018 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.34 vs. limit=6.0 2023-06-21 12:54:02,956 INFO [train.py:996] (0/4) Epoch 6, batch 9000, loss[loss=0.2058, simple_loss=0.2663, pruned_loss=0.07264, over 21815.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3214, pruned_loss=0.08975, over 4261896.27 frames. ], batch size: 124, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:54:02,957 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 12:54:22,933 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.3184, 4.3598, 4.0181, 4.0611], device='cuda:0') 2023-06-21 12:54:25,117 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2624, simple_loss=0.3599, pruned_loss=0.08239, over 1796401.00 frames. 2023-06-21 12:54:25,118 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 12:56:01,358 INFO [train.py:996] (0/4) Epoch 6, batch 9050, loss[loss=0.3012, simple_loss=0.3658, pruned_loss=0.1183, over 21754.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3175, pruned_loss=0.08681, over 4261206.90 frames. ], batch size: 124, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:56:25,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=969198.0, ans=0.0 2023-06-21 12:56:33,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=969198.0, ans=0.125 2023-06-21 12:56:41,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.936e+02 3.440e+02 3.853e+02 8.730e+02, threshold=6.881e+02, percent-clipped=1.0 2023-06-21 12:57:43,025 INFO [train.py:996] (0/4) Epoch 6, batch 9100, loss[loss=0.2341, simple_loss=0.32, pruned_loss=0.07413, over 21309.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3255, pruned_loss=0.09046, over 4257433.79 frames. ], batch size: 176, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 12:57:54,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=969438.0, ans=0.07 2023-06-21 12:58:29,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=969558.0, ans=0.125 2023-06-21 12:58:31,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=969558.0, ans=0.2 2023-06-21 12:59:27,372 INFO [train.py:996] (0/4) Epoch 6, batch 9150, loss[loss=0.2216, simple_loss=0.3098, pruned_loss=0.06667, over 21450.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3258, pruned_loss=0.0864, over 4269365.16 frames. ], batch size: 211, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 12:59:57,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.788e+02 3.229e+02 4.446e+02 7.555e+02, threshold=6.457e+02, percent-clipped=3.0 2023-06-21 13:01:00,375 INFO [train.py:996] (0/4) Epoch 6, batch 9200, loss[loss=0.2891, simple_loss=0.3602, pruned_loss=0.1091, over 21814.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3288, pruned_loss=0.08613, over 4277999.04 frames. ], batch size: 124, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:01:04,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=970038.0, ans=0.125 2023-06-21 13:01:13,590 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:01:19,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=970098.0, ans=0.125 2023-06-21 13:01:19,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=970098.0, ans=0.125 2023-06-21 13:01:25,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-21 13:01:25,227 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-21 13:01:27,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=970098.0, ans=0.125 2023-06-21 13:01:45,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=970158.0, ans=0.125 2023-06-21 13:01:53,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-21 13:02:36,685 INFO [train.py:996] (0/4) Epoch 6, batch 9250, loss[loss=0.289, simple_loss=0.3374, pruned_loss=0.1203, over 21451.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3325, pruned_loss=0.09012, over 4280502.27 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:02:43,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=970338.0, ans=0.1 2023-06-21 13:03:07,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.070e+02 3.502e+02 4.094e+02 6.605e+02, threshold=7.004e+02, percent-clipped=1.0 2023-06-21 13:04:13,726 INFO [train.py:996] (0/4) Epoch 6, batch 9300, loss[loss=0.2262, simple_loss=0.2883, pruned_loss=0.08206, over 21767.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3258, pruned_loss=0.08914, over 4275262.23 frames. ], batch size: 124, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:04:52,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=970758.0, ans=0.09899494936611666 2023-06-21 13:05:26,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=970818.0, ans=0.2 2023-06-21 13:05:36,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=970878.0, ans=0.125 2023-06-21 13:05:37,533 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-21 13:05:50,443 INFO [train.py:996] (0/4) Epoch 6, batch 9350, loss[loss=0.2974, simple_loss=0.3683, pruned_loss=0.1132, over 21805.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3314, pruned_loss=0.09022, over 4276291.57 frames. ], batch size: 118, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:05:52,617 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:05:59,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2023-06-21 13:06:31,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.002e+02 3.519e+02 4.065e+02 7.578e+02, threshold=7.038e+02, percent-clipped=1.0 2023-06-21 13:07:25,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=971238.0, ans=0.0 2023-06-21 13:07:26,220 INFO [train.py:996] (0/4) Epoch 6, batch 9400, loss[loss=0.3017, simple_loss=0.343, pruned_loss=0.1302, over 21532.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3341, pruned_loss=0.09065, over 4272052.40 frames. ], batch size: 441, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:07:33,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=971238.0, ans=0.125 2023-06-21 13:08:19,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=971358.0, ans=0.125 2023-06-21 13:08:29,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=971418.0, ans=0.0 2023-06-21 13:08:33,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=22.5 2023-06-21 13:08:56,690 INFO [train.py:996] (0/4) Epoch 6, batch 9450, loss[loss=0.2073, simple_loss=0.2668, pruned_loss=0.07383, over 21550.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3256, pruned_loss=0.0893, over 4267258.06 frames. ], batch size: 195, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:09:41,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.152e+02 3.709e+02 4.839e+02 7.749e+02, threshold=7.417e+02, percent-clipped=1.0 2023-06-21 13:09:56,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=971658.0, ans=0.125 2023-06-21 13:10:32,104 INFO [train.py:996] (0/4) Epoch 6, batch 9500, loss[loss=0.1963, simple_loss=0.283, pruned_loss=0.0548, over 21707.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3169, pruned_loss=0.08652, over 4264967.69 frames. ], batch size: 332, lr: 5.19e-03, grad_scale: 8.0 2023-06-21 13:10:54,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=971898.0, ans=0.0 2023-06-21 13:11:31,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=971958.0, ans=0.2 2023-06-21 13:12:00,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=972078.0, ans=0.0 2023-06-21 13:12:05,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=972078.0, ans=0.04949747468305833 2023-06-21 13:12:07,954 INFO [train.py:996] (0/4) Epoch 6, batch 9550, loss[loss=0.244, simple_loss=0.327, pruned_loss=0.08053, over 16367.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3218, pruned_loss=0.08916, over 4261171.29 frames. ], batch size: 60, lr: 5.19e-03, grad_scale: 8.0 2023-06-21 13:13:00,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.847e+02 3.325e+02 4.189e+02 8.114e+02, threshold=6.651e+02, percent-clipped=1.0 2023-06-21 13:13:12,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=972318.0, ans=0.125 2023-06-21 13:13:43,376 INFO [train.py:996] (0/4) Epoch 6, batch 9600, loss[loss=0.2377, simple_loss=0.315, pruned_loss=0.08017, over 21855.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3247, pruned_loss=0.08967, over 4268799.82 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 13:13:47,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=972438.0, ans=0.0 2023-06-21 13:13:47,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=972438.0, ans=0.2 2023-06-21 13:14:20,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=972498.0, ans=0.125 2023-06-21 13:15:25,387 INFO [train.py:996] (0/4) Epoch 6, batch 9650, loss[loss=0.2578, simple_loss=0.3322, pruned_loss=0.09171, over 21631.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3263, pruned_loss=0.09077, over 4267976.86 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 13:15:27,610 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-21 13:15:42,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=12.0 2023-06-21 13:16:12,616 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 2.974e+02 3.476e+02 4.202e+02 8.291e+02, threshold=6.952e+02, percent-clipped=2.0 2023-06-21 13:16:13,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972858.0, ans=0.1 2023-06-21 13:16:39,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-21 13:16:44,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.59 vs. limit=22.5 2023-06-21 13:17:06,083 INFO [train.py:996] (0/4) Epoch 6, batch 9700, loss[loss=0.2516, simple_loss=0.3107, pruned_loss=0.09627, over 21831.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3275, pruned_loss=0.09068, over 4253848.00 frames. ], batch size: 112, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:17:16,174 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.86 vs. limit=6.0 2023-06-21 13:17:30,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=973098.0, ans=0.125 2023-06-21 13:17:38,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=973098.0, ans=0.125 2023-06-21 13:17:51,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=973158.0, ans=0.125 2023-06-21 13:17:53,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=973158.0, ans=0.1 2023-06-21 13:17:54,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=973158.0, ans=0.125 2023-06-21 13:17:56,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=973158.0, ans=0.2 2023-06-21 13:18:35,249 INFO [train.py:996] (0/4) Epoch 6, batch 9750, loss[loss=0.2117, simple_loss=0.2745, pruned_loss=0.07441, over 21635.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3202, pruned_loss=0.08915, over 4256498.65 frames. ], batch size: 298, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:18:35,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=973338.0, ans=0.125 2023-06-21 13:18:36,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-21 13:18:40,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=973338.0, ans=0.125 2023-06-21 13:19:14,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973398.0, ans=0.1 2023-06-21 13:19:23,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.880e+02 3.330e+02 4.100e+02 8.108e+02, threshold=6.660e+02, percent-clipped=1.0 2023-06-21 13:19:31,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=973458.0, ans=0.2 2023-06-21 13:19:35,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=973518.0, ans=0.125 2023-06-21 13:19:42,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-06-21 13:19:56,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=973578.0, ans=0.1 2023-06-21 13:20:03,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-21 13:20:08,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=973638.0, ans=0.125 2023-06-21 13:20:08,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=973638.0, ans=0.125 2023-06-21 13:20:09,985 INFO [train.py:996] (0/4) Epoch 6, batch 9800, loss[loss=0.3043, simple_loss=0.3524, pruned_loss=0.1281, over 21594.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3212, pruned_loss=0.08965, over 4253648.28 frames. ], batch size: 471, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:20:29,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=973698.0, ans=0.125 2023-06-21 13:21:04,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973758.0, ans=0.1 2023-06-21 13:21:05,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=973758.0, ans=0.1 2023-06-21 13:21:07,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=973758.0, ans=0.2 2023-06-21 13:21:27,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=973878.0, ans=0.125 2023-06-21 13:21:30,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-21 13:21:40,054 INFO [train.py:996] (0/4) Epoch 6, batch 9850, loss[loss=0.2068, simple_loss=0.2646, pruned_loss=0.07456, over 21224.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3181, pruned_loss=0.09038, over 4262022.41 frames. ], batch size: 176, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:22:02,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=973938.0, ans=0.0 2023-06-21 13:22:32,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.858e+02 3.118e+02 3.826e+02 5.863e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-21 13:22:37,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=974058.0, ans=0.125 2023-06-21 13:23:15,717 INFO [train.py:996] (0/4) Epoch 6, batch 9900, loss[loss=0.2639, simple_loss=0.332, pruned_loss=0.09796, over 21576.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3144, pruned_loss=0.08991, over 4259411.64 frames. ], batch size: 414, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:24:00,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-21 13:24:07,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=974358.0, ans=0.125 2023-06-21 13:24:08,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=974358.0, ans=0.0 2023-06-21 13:24:10,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=974358.0, ans=0.125 2023-06-21 13:24:56,249 INFO [train.py:996] (0/4) Epoch 6, batch 9950, loss[loss=0.2881, simple_loss=0.3228, pruned_loss=0.1267, over 21404.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3171, pruned_loss=0.09248, over 4266490.41 frames. ], batch size: 510, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:25:00,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=974538.0, ans=0.125 2023-06-21 13:25:31,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=974598.0, ans=0.02 2023-06-21 13:25:39,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 2.967e+02 3.408e+02 4.209e+02 6.972e+02, threshold=6.817e+02, percent-clipped=1.0 2023-06-21 13:25:51,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=974718.0, ans=0.0 2023-06-21 13:26:09,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-21 13:26:28,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=974778.0, ans=0.1 2023-06-21 13:26:32,420 INFO [train.py:996] (0/4) Epoch 6, batch 10000, loss[loss=0.257, simple_loss=0.313, pruned_loss=0.1005, over 21454.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3141, pruned_loss=0.09092, over 4253249.86 frames. ], batch size: 509, lr: 5.18e-03, grad_scale: 32.0 2023-06-21 13:27:19,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=974958.0, ans=0.125 2023-06-21 13:27:38,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=975018.0, ans=0.0 2023-06-21 13:27:58,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=975078.0, ans=0.125 2023-06-21 13:28:07,405 INFO [train.py:996] (0/4) Epoch 6, batch 10050, loss[loss=0.2356, simple_loss=0.304, pruned_loss=0.08358, over 21420.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.317, pruned_loss=0.09177, over 4258735.89 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 32.0 2023-06-21 13:28:07,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=975138.0, ans=0.95 2023-06-21 13:28:12,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=975138.0, ans=0.125 2023-06-21 13:28:26,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=15.0 2023-06-21 13:28:31,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=975198.0, ans=0.05 2023-06-21 13:28:51,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.710e+02 3.231e+02 4.212e+02 7.416e+02, threshold=6.463e+02, percent-clipped=2.0 2023-06-21 13:29:07,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=975318.0, ans=0.04949747468305833 2023-06-21 13:29:17,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=975318.0, ans=0.125 2023-06-21 13:29:34,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=975378.0, ans=0.125 2023-06-21 13:29:35,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-21 13:29:53,557 INFO [train.py:996] (0/4) Epoch 6, batch 10100, loss[loss=0.2737, simple_loss=0.3886, pruned_loss=0.07941, over 19853.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3141, pruned_loss=0.08858, over 4265397.10 frames. ], batch size: 702, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:30:26,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=975558.0, ans=0.125 2023-06-21 13:30:42,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.19 vs. limit=6.0 2023-06-21 13:30:59,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-21 13:31:25,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=975678.0, ans=0.2 2023-06-21 13:31:29,499 INFO [train.py:996] (0/4) Epoch 6, batch 10150, loss[loss=0.2669, simple_loss=0.34, pruned_loss=0.09695, over 21885.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3194, pruned_loss=0.09062, over 4265006.73 frames. ], batch size: 371, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:32:04,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.136e+02 3.616e+02 4.302e+02 7.230e+02, threshold=7.231e+02, percent-clipped=1.0 2023-06-21 13:32:22,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=975858.0, ans=10.0 2023-06-21 13:33:04,742 INFO [train.py:996] (0/4) Epoch 6, batch 10200, loss[loss=0.1989, simple_loss=0.2828, pruned_loss=0.05752, over 21605.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3189, pruned_loss=0.08827, over 4269545.90 frames. ], batch size: 263, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:33:22,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=976098.0, ans=0.125 2023-06-21 13:33:25,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2023-06-21 13:33:34,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=976158.0, ans=0.2 2023-06-21 13:34:12,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=976218.0, ans=0.125 2023-06-21 13:34:40,829 INFO [train.py:996] (0/4) Epoch 6, batch 10250, loss[loss=0.197, simple_loss=0.2914, pruned_loss=0.05125, over 21791.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3113, pruned_loss=0.08155, over 4269912.13 frames. ], batch size: 333, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:34:51,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=976338.0, ans=0.2 2023-06-21 13:34:56,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=976398.0, ans=0.125 2023-06-21 13:35:01,239 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-21 13:35:21,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 2.417e+02 2.778e+02 3.535e+02 6.658e+02, threshold=5.557e+02, percent-clipped=0.0 2023-06-21 13:35:39,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=976518.0, ans=10.0 2023-06-21 13:36:18,501 INFO [train.py:996] (0/4) Epoch 6, batch 10300, loss[loss=0.1862, simple_loss=0.2668, pruned_loss=0.05278, over 21864.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3134, pruned_loss=0.08289, over 4275502.68 frames. ], batch size: 107, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:36:32,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-21 13:37:04,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-06-21 13:37:05,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=976758.0, ans=0.125 2023-06-21 13:37:37,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=976878.0, ans=0.125 2023-06-21 13:37:51,342 INFO [train.py:996] (0/4) Epoch 6, batch 10350, loss[loss=0.2067, simple_loss=0.2739, pruned_loss=0.06974, over 21664.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3161, pruned_loss=0.08395, over 4276944.30 frames. ], batch size: 247, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:37:53,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=976938.0, ans=0.5 2023-06-21 13:38:26,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=976998.0, ans=0.0 2023-06-21 13:38:27,143 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.91 vs. limit=12.0 2023-06-21 13:38:41,221 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.013e+02 3.424e+02 4.062e+02 6.181e+02, threshold=6.848e+02, percent-clipped=5.0 2023-06-21 13:39:02,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=977118.0, ans=0.1 2023-06-21 13:39:03,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=977118.0, ans=0.1 2023-06-21 13:39:09,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=977118.0, ans=0.125 2023-06-21 13:39:27,356 INFO [train.py:996] (0/4) Epoch 6, batch 10400, loss[loss=0.3023, simple_loss=0.361, pruned_loss=0.1218, over 21529.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3099, pruned_loss=0.08199, over 4261347.89 frames. ], batch size: 509, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:40:43,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=977418.0, ans=0.125 2023-06-21 13:40:51,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=977478.0, ans=0.125 2023-06-21 13:41:09,994 INFO [train.py:996] (0/4) Epoch 6, batch 10450, loss[loss=0.2474, simple_loss=0.3225, pruned_loss=0.08617, over 20660.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3149, pruned_loss=0.08586, over 4271452.83 frames. ], batch size: 607, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:41:13,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=977538.0, ans=0.0 2023-06-21 13:41:29,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=977598.0, ans=0.2 2023-06-21 13:42:01,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.285e+02 3.743e+02 4.607e+02 9.328e+02, threshold=7.486e+02, percent-clipped=7.0 2023-06-21 13:42:25,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-06-21 13:42:52,013 INFO [train.py:996] (0/4) Epoch 6, batch 10500, loss[loss=0.2475, simple_loss=0.3107, pruned_loss=0.09218, over 15760.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3145, pruned_loss=0.08503, over 4270063.16 frames. ], batch size: 60, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:43:28,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-21 13:43:36,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-06-21 13:43:46,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-21 13:44:03,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=978078.0, ans=0.125 2023-06-21 13:44:13,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-21 13:44:27,345 INFO [train.py:996] (0/4) Epoch 6, batch 10550, loss[loss=0.2167, simple_loss=0.2715, pruned_loss=0.08093, over 21652.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3079, pruned_loss=0.08447, over 4266983.03 frames. ], batch size: 264, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:44:45,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=15.0 2023-06-21 13:45:12,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.729e+02 3.054e+02 3.524e+02 6.998e+02, threshold=6.108e+02, percent-clipped=0.0 2023-06-21 13:45:14,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=978258.0, ans=0.0 2023-06-21 13:45:17,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=12.0 2023-06-21 13:45:26,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=978318.0, ans=0.125 2023-06-21 13:46:03,808 INFO [train.py:996] (0/4) Epoch 6, batch 10600, loss[loss=0.2026, simple_loss=0.2742, pruned_loss=0.06556, over 21256.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3045, pruned_loss=0.08277, over 4270809.99 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:46:23,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=978498.0, ans=0.125 2023-06-21 13:46:51,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.35 vs. limit=10.0 2023-06-21 13:47:05,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=978618.0, ans=0.125 2023-06-21 13:47:05,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-21 13:47:18,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=978678.0, ans=0.125 2023-06-21 13:47:44,898 INFO [train.py:996] (0/4) Epoch 6, batch 10650, loss[loss=0.168, simple_loss=0.2503, pruned_loss=0.04283, over 21612.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3056, pruned_loss=0.08049, over 4263993.47 frames. ], batch size: 247, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:47:51,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=978738.0, ans=0.1 2023-06-21 13:48:22,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=978858.0, ans=0.0 2023-06-21 13:48:26,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 2.894e+02 3.773e+02 4.928e+02 8.046e+02, threshold=7.546e+02, percent-clipped=12.0 2023-06-21 13:48:35,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=15.0 2023-06-21 13:48:47,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=978918.0, ans=0.125 2023-06-21 13:48:49,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=978918.0, ans=0.125 2023-06-21 13:49:22,177 INFO [train.py:996] (0/4) Epoch 6, batch 10700, loss[loss=0.2936, simple_loss=0.352, pruned_loss=0.1176, over 21197.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3074, pruned_loss=0.0816, over 4252569.28 frames. ], batch size: 143, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:49:50,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=979098.0, ans=0.125 2023-06-21 13:50:54,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-21 13:50:56,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-21 13:50:57,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=979278.0, ans=0.125 2023-06-21 13:51:04,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=979338.0, ans=0.125 2023-06-21 13:51:05,749 INFO [train.py:996] (0/4) Epoch 6, batch 10750, loss[loss=0.2385, simple_loss=0.3152, pruned_loss=0.08093, over 21373.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3182, pruned_loss=0.08598, over 4258969.86 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:51:38,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=979458.0, ans=0.125 2023-06-21 13:51:42,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.198e+02 3.607e+02 4.478e+02 7.932e+02, threshold=7.214e+02, percent-clipped=1.0 2023-06-21 13:51:50,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=979458.0, ans=0.125 2023-06-21 13:51:50,716 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:51:59,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-21 13:52:17,067 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-21 13:52:43,763 INFO [train.py:996] (0/4) Epoch 6, batch 10800, loss[loss=0.2807, simple_loss=0.3471, pruned_loss=0.1072, over 21353.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3225, pruned_loss=0.08679, over 4262237.87 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:53:19,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=979758.0, ans=0.0 2023-06-21 13:53:23,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979758.0, ans=0.1 2023-06-21 13:53:45,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-21 13:53:47,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=979818.0, ans=0.04949747468305833 2023-06-21 13:54:06,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-21 13:54:13,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=979938.0, ans=0.0 2023-06-21 13:54:15,006 INFO [train.py:996] (0/4) Epoch 6, batch 10850, loss[loss=0.2336, simple_loss=0.3033, pruned_loss=0.08198, over 21788.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.323, pruned_loss=0.08693, over 4260086.58 frames. ], batch size: 102, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:54:46,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=979998.0, ans=0.1 2023-06-21 13:54:49,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979998.0, ans=0.1 2023-06-21 13:55:06,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 2.788e+02 3.255e+02 3.917e+02 5.822e+02, threshold=6.509e+02, percent-clipped=0.0 2023-06-21 13:55:29,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=980118.0, ans=0.0 2023-06-21 13:55:33,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=980118.0, ans=0.125 2023-06-21 13:55:51,275 INFO [train.py:996] (0/4) Epoch 6, batch 10900, loss[loss=0.2193, simple_loss=0.3039, pruned_loss=0.06733, over 21445.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3169, pruned_loss=0.08518, over 4261494.41 frames. ], batch size: 194, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:56:17,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=980298.0, ans=0.0 2023-06-21 13:56:27,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=980298.0, ans=0.125 2023-06-21 13:56:43,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=980358.0, ans=0.125 2023-06-21 13:56:43,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=980358.0, ans=0.125 2023-06-21 13:57:25,451 INFO [train.py:996] (0/4) Epoch 6, batch 10950, loss[loss=0.2274, simple_loss=0.288, pruned_loss=0.08341, over 21142.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3101, pruned_loss=0.08305, over 4258203.35 frames. ], batch size: 143, lr: 5.16e-03, grad_scale: 32.0 2023-06-21 13:58:15,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.774e+02 3.261e+02 3.678e+02 5.101e+02, threshold=6.522e+02, percent-clipped=0.0 2023-06-21 13:58:22,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=980658.0, ans=0.1 2023-06-21 13:58:55,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=980778.0, ans=0.0 2023-06-21 13:58:59,449 INFO [train.py:996] (0/4) Epoch 6, batch 11000, loss[loss=0.2301, simple_loss=0.3005, pruned_loss=0.07982, over 21593.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3092, pruned_loss=0.08411, over 4258589.97 frames. ], batch size: 212, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 13:59:40,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=980958.0, ans=0.125 2023-06-21 13:59:46,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=980958.0, ans=0.125 2023-06-21 13:59:51,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=980958.0, ans=0.0 2023-06-21 14:00:21,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=981078.0, ans=0.125 2023-06-21 14:00:36,197 INFO [train.py:996] (0/4) Epoch 6, batch 11050, loss[loss=0.2415, simple_loss=0.302, pruned_loss=0.09053, over 21793.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3077, pruned_loss=0.0853, over 4271423.98 frames. ], batch size: 112, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:01:28,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.910e+02 3.183e+02 3.721e+02 5.949e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-21 14:01:31,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.71 vs. limit=15.0 2023-06-21 14:01:48,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=981318.0, ans=0.125 2023-06-21 14:02:10,606 INFO [train.py:996] (0/4) Epoch 6, batch 11100, loss[loss=0.2268, simple_loss=0.2973, pruned_loss=0.07816, over 21412.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3064, pruned_loss=0.08553, over 4276438.21 frames. ], batch size: 211, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:02:58,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-21 14:03:48,528 INFO [train.py:996] (0/4) Epoch 6, batch 11150, loss[loss=0.1774, simple_loss=0.2398, pruned_loss=0.05745, over 16037.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3039, pruned_loss=0.08477, over 4244703.29 frames. ], batch size: 61, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:04:41,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.660e+02 3.131e+02 3.688e+02 5.663e+02, threshold=6.262e+02, percent-clipped=0.0 2023-06-21 14:04:59,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=981918.0, ans=0.1 2023-06-21 14:05:24,971 INFO [train.py:996] (0/4) Epoch 6, batch 11200, loss[loss=0.2268, simple_loss=0.2968, pruned_loss=0.07846, over 21746.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3035, pruned_loss=0.08417, over 4254826.75 frames. ], batch size: 351, lr: 5.16e-03, grad_scale: 32.0 2023-06-21 14:05:32,449 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.64 vs. limit=15.0 2023-06-21 14:05:36,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=982038.0, ans=0.125 2023-06-21 14:05:48,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=982098.0, ans=0.1 2023-06-21 14:06:22,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=982218.0, ans=0.0 2023-06-21 14:06:40,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=982278.0, ans=0.0 2023-06-21 14:06:45,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=982278.0, ans=0.125 2023-06-21 14:06:56,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=982338.0, ans=15.0 2023-06-21 14:06:57,272 INFO [train.py:996] (0/4) Epoch 6, batch 11250, loss[loss=0.2629, simple_loss=0.3239, pruned_loss=0.101, over 21571.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3037, pruned_loss=0.08439, over 4252695.37 frames. ], batch size: 508, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:07:46,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.620e+02 2.906e+02 3.338e+02 5.205e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-21 14:08:02,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=982518.0, ans=0.125 2023-06-21 14:08:04,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=982518.0, ans=0.125 2023-06-21 14:08:07,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=982578.0, ans=0.09899494936611666 2023-06-21 14:08:28,298 INFO [train.py:996] (0/4) Epoch 6, batch 11300, loss[loss=0.1946, simple_loss=0.2636, pruned_loss=0.06283, over 17002.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3052, pruned_loss=0.08452, over 4262260.66 frames. ], batch size: 63, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:09:25,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=982758.0, ans=0.125 2023-06-21 14:09:31,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=982818.0, ans=0.125 2023-06-21 14:10:03,384 INFO [train.py:996] (0/4) Epoch 6, batch 11350, loss[loss=0.2755, simple_loss=0.335, pruned_loss=0.108, over 21246.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3055, pruned_loss=0.0836, over 4267557.26 frames. ], batch size: 143, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:10:21,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=982938.0, ans=0.0 2023-06-21 14:10:23,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=982998.0, ans=10.0 2023-06-21 14:10:43,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=983058.0, ans=0.0 2023-06-21 14:10:49,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983058.0, ans=0.1 2023-06-21 14:10:53,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 2.788e+02 3.178e+02 3.739e+02 7.652e+02, threshold=6.355e+02, percent-clipped=2.0 2023-06-21 14:11:03,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983118.0, ans=0.1 2023-06-21 14:11:18,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=983118.0, ans=0.0 2023-06-21 14:11:27,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=983178.0, ans=0.0 2023-06-21 14:11:35,966 INFO [train.py:996] (0/4) Epoch 6, batch 11400, loss[loss=0.1966, simple_loss=0.2615, pruned_loss=0.06585, over 16128.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3127, pruned_loss=0.0872, over 4266966.23 frames. ], batch size: 60, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:12:02,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983238.0, ans=0.1 2023-06-21 14:12:10,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=983298.0, ans=0.07 2023-06-21 14:12:28,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=983358.0, ans=0.0 2023-06-21 14:12:36,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=983418.0, ans=0.125 2023-06-21 14:12:53,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=983418.0, ans=0.0 2023-06-21 14:13:18,821 INFO [train.py:996] (0/4) Epoch 6, batch 11450, loss[loss=0.2693, simple_loss=0.36, pruned_loss=0.08926, over 21731.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3153, pruned_loss=0.08642, over 4272534.14 frames. ], batch size: 415, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:13:36,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=983538.0, ans=0.2 2023-06-21 14:13:40,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=983598.0, ans=0.0 2023-06-21 14:13:43,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983598.0, ans=0.1 2023-06-21 14:14:03,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.362e+02 2.838e+02 3.448e+02 4.254e+02 7.137e+02, threshold=6.896e+02, percent-clipped=4.0 2023-06-21 14:14:13,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=983718.0, ans=0.0 2023-06-21 14:14:33,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=983778.0, ans=0.125 2023-06-21 14:14:55,039 INFO [train.py:996] (0/4) Epoch 6, batch 11500, loss[loss=0.2497, simple_loss=0.3326, pruned_loss=0.08343, over 21615.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3182, pruned_loss=0.088, over 4275682.62 frames. ], batch size: 414, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:15:32,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=983958.0, ans=0.125 2023-06-21 14:15:44,745 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-164000.pt 2023-06-21 14:15:53,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=984018.0, ans=0.125 2023-06-21 14:15:59,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=984018.0, ans=0.0 2023-06-21 14:16:37,032 INFO [train.py:996] (0/4) Epoch 6, batch 11550, loss[loss=0.2674, simple_loss=0.3406, pruned_loss=0.09708, over 21286.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3227, pruned_loss=0.0871, over 4280035.30 frames. ], batch size: 176, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:17:23,350 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.955e+02 3.350e+02 4.139e+02 7.597e+02, threshold=6.701e+02, percent-clipped=2.0 2023-06-21 14:17:37,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=984318.0, ans=0.125 2023-06-21 14:17:49,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=984318.0, ans=0.2 2023-06-21 14:17:58,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=984378.0, ans=0.125 2023-06-21 14:18:09,015 INFO [train.py:996] (0/4) Epoch 6, batch 11600, loss[loss=0.2348, simple_loss=0.3238, pruned_loss=0.07292, over 21530.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3384, pruned_loss=0.09036, over 4276392.18 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:18:23,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=984498.0, ans=0.125 2023-06-21 14:18:30,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=984498.0, ans=0.0 2023-06-21 14:18:31,922 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:18:38,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-21 14:18:44,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.95 vs. limit=22.5 2023-06-21 14:19:27,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-21 14:19:45,205 INFO [train.py:996] (0/4) Epoch 6, batch 11650, loss[loss=0.2839, simple_loss=0.3868, pruned_loss=0.09049, over 21886.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3424, pruned_loss=0.09015, over 4274363.62 frames. ], batch size: 317, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:19:53,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-06-21 14:20:01,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=984738.0, ans=0.2 2023-06-21 14:20:16,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=984798.0, ans=0.125 2023-06-21 14:20:21,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-21 14:20:29,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 3.000e+02 3.616e+02 4.303e+02 7.688e+02, threshold=7.232e+02, percent-clipped=3.0 2023-06-21 14:20:56,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=984918.0, ans=8.0 2023-06-21 14:21:09,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-21 14:21:21,189 INFO [train.py:996] (0/4) Epoch 6, batch 11700, loss[loss=0.2123, simple_loss=0.2737, pruned_loss=0.0754, over 21589.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3345, pruned_loss=0.09064, over 4270859.45 frames. ], batch size: 263, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:21:24,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-21 14:21:38,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=985038.0, ans=0.125 2023-06-21 14:21:43,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=985098.0, ans=0.125 2023-06-21 14:21:45,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=985098.0, ans=0.125 2023-06-21 14:21:57,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=985158.0, ans=0.025 2023-06-21 14:22:56,824 INFO [train.py:996] (0/4) Epoch 6, batch 11750, loss[loss=0.2339, simple_loss=0.2884, pruned_loss=0.08973, over 21297.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3252, pruned_loss=0.08978, over 4272548.04 frames. ], batch size: 144, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:23:17,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=985398.0, ans=0.2 2023-06-21 14:23:57,410 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.950e+02 3.559e+02 4.361e+02 6.685e+02, threshold=7.118e+02, percent-clipped=0.0 2023-06-21 14:24:02,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=985518.0, ans=0.125 2023-06-21 14:24:33,574 INFO [train.py:996] (0/4) Epoch 6, batch 11800, loss[loss=0.226, simple_loss=0.3264, pruned_loss=0.06279, over 21722.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3262, pruned_loss=0.09169, over 4275578.01 frames. ], batch size: 298, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:24:35,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=985638.0, ans=0.125 2023-06-21 14:25:29,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.48 vs. limit=22.5 2023-06-21 14:26:01,277 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:26:14,996 INFO [train.py:996] (0/4) Epoch 6, batch 11850, loss[loss=0.228, simple_loss=0.3387, pruned_loss=0.05863, over 20773.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3285, pruned_loss=0.09122, over 4280906.08 frames. ], batch size: 608, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:26:56,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=986058.0, ans=0.02 2023-06-21 14:27:10,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.768e+02 3.132e+02 3.956e+02 6.532e+02, threshold=6.263e+02, percent-clipped=0.0 2023-06-21 14:27:11,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=986058.0, ans=0.95 2023-06-21 14:27:50,736 INFO [train.py:996] (0/4) Epoch 6, batch 11900, loss[loss=0.2629, simple_loss=0.3358, pruned_loss=0.09503, over 21662.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3279, pruned_loss=0.08845, over 4276893.95 frames. ], batch size: 332, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:28:10,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-21 14:28:26,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-21 14:28:37,423 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:28:51,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-21 14:29:10,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=986478.0, ans=15.0 2023-06-21 14:29:27,045 INFO [train.py:996] (0/4) Epoch 6, batch 11950, loss[loss=0.2309, simple_loss=0.3289, pruned_loss=0.06642, over 21692.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3302, pruned_loss=0.08523, over 4271155.21 frames. ], batch size: 414, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:29:34,418 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-06-21 14:29:56,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=986598.0, ans=0.0 2023-06-21 14:30:23,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.615e+02 3.254e+02 4.193e+02 8.163e+02, threshold=6.508e+02, percent-clipped=5.0 2023-06-21 14:30:45,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=986778.0, ans=0.125 2023-06-21 14:30:56,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=986778.0, ans=0.1 2023-06-21 14:31:03,752 INFO [train.py:996] (0/4) Epoch 6, batch 12000, loss[loss=0.2574, simple_loss=0.3027, pruned_loss=0.1061, over 21225.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3241, pruned_loss=0.08368, over 4273967.80 frames. ], batch size: 144, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:31:03,753 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 14:31:23,356 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2642, simple_loss=0.3586, pruned_loss=0.08492, over 1796401.00 frames. 2023-06-21 14:31:23,357 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 14:31:34,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=986838.0, ans=0.125 2023-06-21 14:31:47,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=986898.0, ans=0.2 2023-06-21 14:32:15,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.13 vs. limit=10.0 2023-06-21 14:32:37,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=987078.0, ans=0.1 2023-06-21 14:32:46,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=987078.0, ans=0.125 2023-06-21 14:33:01,557 INFO [train.py:996] (0/4) Epoch 6, batch 12050, loss[loss=0.241, simple_loss=0.3057, pruned_loss=0.08815, over 21863.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3208, pruned_loss=0.08568, over 4279713.25 frames. ], batch size: 351, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:33:19,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=987138.0, ans=0.1 2023-06-21 14:33:29,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.99 vs. limit=22.5 2023-06-21 14:33:53,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 3.010e+02 3.446e+02 4.017e+02 8.146e+02, threshold=6.892e+02, percent-clipped=4.0 2023-06-21 14:34:39,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=987378.0, ans=0.125 2023-06-21 14:34:43,751 INFO [train.py:996] (0/4) Epoch 6, batch 12100, loss[loss=0.2778, simple_loss=0.345, pruned_loss=0.1053, over 21642.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.325, pruned_loss=0.09009, over 4285452.23 frames. ], batch size: 230, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:34:50,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=987438.0, ans=0.0 2023-06-21 14:35:00,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=987438.0, ans=0.1 2023-06-21 14:35:54,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=987618.0, ans=0.125 2023-06-21 14:36:27,297 INFO [train.py:996] (0/4) Epoch 6, batch 12150, loss[loss=0.2605, simple_loss=0.3618, pruned_loss=0.07959, over 21295.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3279, pruned_loss=0.08885, over 4273460.78 frames. ], batch size: 548, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:36:34,015 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:36:48,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-21 14:37:09,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=987858.0, ans=0.025 2023-06-21 14:37:19,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 3.020e+02 3.614e+02 4.015e+02 8.551e+02, threshold=7.228e+02, percent-clipped=5.0 2023-06-21 14:37:44,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=987918.0, ans=0.05 2023-06-21 14:37:45,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.59 vs. limit=15.0 2023-06-21 14:37:50,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=987978.0, ans=0.125 2023-06-21 14:38:01,624 INFO [train.py:996] (0/4) Epoch 6, batch 12200, loss[loss=0.2261, simple_loss=0.287, pruned_loss=0.08257, over 21491.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3222, pruned_loss=0.08689, over 4271982.98 frames. ], batch size: 391, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:38:30,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=988098.0, ans=0.125 2023-06-21 14:38:44,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=988158.0, ans=0.125 2023-06-21 14:38:56,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=988218.0, ans=0.2 2023-06-21 14:39:36,221 INFO [train.py:996] (0/4) Epoch 6, batch 12250, loss[loss=0.2267, simple_loss=0.3054, pruned_loss=0.07402, over 21489.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.315, pruned_loss=0.08327, over 4261399.93 frames. ], batch size: 471, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:39:43,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-21 14:40:22,487 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 2.454e+02 2.966e+02 3.954e+02 7.953e+02, threshold=5.931e+02, percent-clipped=3.0 2023-06-21 14:40:27,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=988518.0, ans=0.05 2023-06-21 14:40:35,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=988518.0, ans=0.2 2023-06-21 14:40:58,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=988578.0, ans=0.125 2023-06-21 14:41:10,041 INFO [train.py:996] (0/4) Epoch 6, batch 12300, loss[loss=0.2086, simple_loss=0.2822, pruned_loss=0.06754, over 21216.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3072, pruned_loss=0.07725, over 4257666.53 frames. ], batch size: 176, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:41:19,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=988638.0, ans=0.0 2023-06-21 14:42:14,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=988818.0, ans=0.0 2023-06-21 14:42:15,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=988818.0, ans=0.2 2023-06-21 14:42:27,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=988818.0, ans=0.125 2023-06-21 14:42:41,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-21 14:42:44,777 INFO [train.py:996] (0/4) Epoch 6, batch 12350, loss[loss=0.2357, simple_loss=0.3134, pruned_loss=0.079, over 21834.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3127, pruned_loss=0.07803, over 4261191.91 frames. ], batch size: 118, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:42:54,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=988938.0, ans=0.2 2023-06-21 14:43:03,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=988998.0, ans=0.125 2023-06-21 14:43:36,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.794e+02 2.741e+02 3.281e+02 4.325e+02 6.278e+02, threshold=6.562e+02, percent-clipped=1.0 2023-06-21 14:44:18,051 INFO [train.py:996] (0/4) Epoch 6, batch 12400, loss[loss=0.2638, simple_loss=0.3275, pruned_loss=0.1001, over 21895.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3138, pruned_loss=0.08134, over 4271618.86 frames. ], batch size: 124, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:44:18,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=989238.0, ans=0.0 2023-06-21 14:44:51,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=989358.0, ans=0.0 2023-06-21 14:45:18,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=989418.0, ans=0.1 2023-06-21 14:45:47,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-21 14:45:52,848 INFO [train.py:996] (0/4) Epoch 6, batch 12450, loss[loss=0.2735, simple_loss=0.3389, pruned_loss=0.104, over 21378.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3172, pruned_loss=0.08461, over 4272426.92 frames. ], batch size: 548, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:46:33,815 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.32 vs. limit=15.0 2023-06-21 14:46:54,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=989658.0, ans=0.125 2023-06-21 14:46:55,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.846e+02 3.218e+02 3.959e+02 6.466e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 14:46:57,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=989718.0, ans=0.125 2023-06-21 14:47:08,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=989718.0, ans=0.125 2023-06-21 14:47:12,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=989718.0, ans=0.125 2023-06-21 14:47:35,193 INFO [train.py:996] (0/4) Epoch 6, batch 12500, loss[loss=0.2899, simple_loss=0.3647, pruned_loss=0.1076, over 21471.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3313, pruned_loss=0.08982, over 4279722.90 frames. ], batch size: 131, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:48:25,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=989958.0, ans=0.125 2023-06-21 14:49:00,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=990078.0, ans=0.125 2023-06-21 14:49:19,719 INFO [train.py:996] (0/4) Epoch 6, batch 12550, loss[loss=0.345, simple_loss=0.3996, pruned_loss=0.1452, over 21323.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3367, pruned_loss=0.09269, over 4276266.93 frames. ], batch size: 507, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:49:48,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=990198.0, ans=0.125 2023-06-21 14:50:00,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=990258.0, ans=0.125 2023-06-21 14:50:12,194 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 2.946e+02 3.555e+02 3.995e+02 6.725e+02, threshold=7.110e+02, percent-clipped=1.0 2023-06-21 14:50:55,511 INFO [train.py:996] (0/4) Epoch 6, batch 12600, loss[loss=0.2212, simple_loss=0.3059, pruned_loss=0.06823, over 21671.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3338, pruned_loss=0.0901, over 4279349.72 frames. ], batch size: 247, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:51:16,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=990498.0, ans=0.2 2023-06-21 14:51:45,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=990618.0, ans=0.1 2023-06-21 14:51:58,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=990618.0, ans=0.05 2023-06-21 14:52:01,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=990618.0, ans=0.125 2023-06-21 14:52:25,071 INFO [train.py:996] (0/4) Epoch 6, batch 12650, loss[loss=0.2645, simple_loss=0.3246, pruned_loss=0.1022, over 21879.00 frames. ], tot_loss[loss=0.249, simple_loss=0.326, pruned_loss=0.08595, over 4282251.55 frames. ], batch size: 124, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:52:49,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-21 14:53:08,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=990858.0, ans=0.0 2023-06-21 14:53:09,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=990858.0, ans=0.07 2023-06-21 14:53:16,412 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.529e+02 3.007e+02 3.447e+02 6.549e+02, threshold=6.013e+02, percent-clipped=0.0 2023-06-21 14:53:20,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=990918.0, ans=0.125 2023-06-21 14:53:46,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=990978.0, ans=0.125 2023-06-21 14:54:11,993 INFO [train.py:996] (0/4) Epoch 6, batch 12700, loss[loss=0.3152, simple_loss=0.362, pruned_loss=0.1342, over 21538.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3269, pruned_loss=0.08907, over 4287937.45 frames. ], batch size: 471, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:54:32,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=991098.0, ans=0.0 2023-06-21 14:54:51,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=12.0 2023-06-21 14:55:22,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=991218.0, ans=0.0 2023-06-21 14:55:36,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-21 14:55:48,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2023-06-21 14:55:48,684 INFO [train.py:996] (0/4) Epoch 6, batch 12750, loss[loss=0.2506, simple_loss=0.3271, pruned_loss=0.08706, over 19883.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3291, pruned_loss=0.09011, over 4277710.16 frames. ], batch size: 702, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:55:59,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=991338.0, ans=0.125 2023-06-21 14:56:19,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=991458.0, ans=0.04949747468305833 2023-06-21 14:56:36,157 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.913e+02 3.338e+02 4.032e+02 7.736e+02, threshold=6.676e+02, percent-clipped=3.0 2023-06-21 14:57:21,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=991578.0, ans=0.2 2023-06-21 14:57:23,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=991638.0, ans=0.0 2023-06-21 14:57:24,158 INFO [train.py:996] (0/4) Epoch 6, batch 12800, loss[loss=0.2691, simple_loss=0.3423, pruned_loss=0.09797, over 21444.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3293, pruned_loss=0.09125, over 4286384.14 frames. ], batch size: 176, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:57:26,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=991638.0, ans=0.05 2023-06-21 14:57:43,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.44 vs. limit=15.0 2023-06-21 14:57:49,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.34 vs. limit=15.0 2023-06-21 14:58:19,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=991818.0, ans=0.125 2023-06-21 14:58:21,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=991818.0, ans=0.125 2023-06-21 14:58:36,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=991818.0, ans=0.125 2023-06-21 14:58:59,824 INFO [train.py:996] (0/4) Epoch 6, batch 12850, loss[loss=0.2281, simple_loss=0.3175, pruned_loss=0.06934, over 21899.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.331, pruned_loss=0.09288, over 4288941.36 frames. ], batch size: 316, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:59:04,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=991938.0, ans=0.04949747468305833 2023-06-21 14:59:22,414 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-21 14:59:53,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.801e+02 3.143e+02 3.622e+02 6.427e+02, threshold=6.286e+02, percent-clipped=0.0 2023-06-21 15:00:19,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-21 15:00:36,441 INFO [train.py:996] (0/4) Epoch 6, batch 12900, loss[loss=0.2142, simple_loss=0.2943, pruned_loss=0.06706, over 21597.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3279, pruned_loss=0.0882, over 4282116.28 frames. ], batch size: 263, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:01:17,963 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-21 15:01:47,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=992418.0, ans=0.07 2023-06-21 15:01:52,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992418.0, ans=0.1 2023-06-21 15:02:09,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=992478.0, ans=0.125 2023-06-21 15:02:12,332 INFO [train.py:996] (0/4) Epoch 6, batch 12950, loss[loss=0.2195, simple_loss=0.2993, pruned_loss=0.06991, over 21744.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3252, pruned_loss=0.08629, over 4276376.41 frames. ], batch size: 332, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:02:45,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992598.0, ans=0.1 2023-06-21 15:02:47,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=992598.0, ans=0.0 2023-06-21 15:03:15,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.940e+02 3.602e+02 4.409e+02 7.106e+02, threshold=7.204e+02, percent-clipped=2.0 2023-06-21 15:03:37,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=992778.0, ans=0.125 2023-06-21 15:03:46,812 INFO [train.py:996] (0/4) Epoch 6, batch 13000, loss[loss=0.2252, simple_loss=0.3003, pruned_loss=0.07501, over 21684.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3244, pruned_loss=0.08658, over 4275587.77 frames. ], batch size: 263, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:03:47,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=992838.0, ans=0.0 2023-06-21 15:05:08,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=993078.0, ans=0.2 2023-06-21 15:05:16,223 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.70 vs. limit=6.0 2023-06-21 15:05:21,381 INFO [train.py:996] (0/4) Epoch 6, batch 13050, loss[loss=0.2613, simple_loss=0.3232, pruned_loss=0.09966, over 21639.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3194, pruned_loss=0.08384, over 4268342.37 frames. ], batch size: 471, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:06:06,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=993258.0, ans=0.0 2023-06-21 15:06:23,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.762e+02 3.169e+02 4.003e+02 6.766e+02, threshold=6.339e+02, percent-clipped=0.0 2023-06-21 15:06:35,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=993318.0, ans=0.2 2023-06-21 15:06:36,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=993318.0, ans=0.0 2023-06-21 15:07:00,673 INFO [train.py:996] (0/4) Epoch 6, batch 13100, loss[loss=0.301, simple_loss=0.3665, pruned_loss=0.1178, over 21175.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.322, pruned_loss=0.08353, over 4277825.29 frames. ], batch size: 143, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:08:12,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=993618.0, ans=0.0 2023-06-21 15:08:40,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=993678.0, ans=0.0 2023-06-21 15:08:43,000 INFO [train.py:996] (0/4) Epoch 6, batch 13150, loss[loss=0.3367, simple_loss=0.4674, pruned_loss=0.103, over 19788.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3267, pruned_loss=0.087, over 4275199.22 frames. ], batch size: 702, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:08:45,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=993738.0, ans=0.125 2023-06-21 15:09:05,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=993738.0, ans=0.125 2023-06-21 15:09:19,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=993798.0, ans=0.125 2023-06-21 15:09:21,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=993798.0, ans=0.125 2023-06-21 15:09:37,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.111e+02 3.957e+02 5.293e+02 1.278e+03, threshold=7.913e+02, percent-clipped=9.0 2023-06-21 15:10:27,533 INFO [train.py:996] (0/4) Epoch 6, batch 13200, loss[loss=0.2558, simple_loss=0.3196, pruned_loss=0.09601, over 21289.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3246, pruned_loss=0.08761, over 4275370.08 frames. ], batch size: 176, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:10:54,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-06-21 15:11:15,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=994158.0, ans=0.1 2023-06-21 15:12:03,182 INFO [train.py:996] (0/4) Epoch 6, batch 13250, loss[loss=0.2161, simple_loss=0.301, pruned_loss=0.06564, over 21525.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.324, pruned_loss=0.08928, over 4276115.30 frames. ], batch size: 131, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:12:33,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=994458.0, ans=0.125 2023-06-21 15:12:51,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.898e+02 3.245e+02 3.819e+02 5.517e+02, threshold=6.489e+02, percent-clipped=0.0 2023-06-21 15:13:08,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-21 15:13:09,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=994518.0, ans=0.1 2023-06-21 15:13:33,149 INFO [train.py:996] (0/4) Epoch 6, batch 13300, loss[loss=0.2698, simple_loss=0.3505, pruned_loss=0.09455, over 21637.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3272, pruned_loss=0.08939, over 4271555.53 frames. ], batch size: 389, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:13:36,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=994638.0, ans=0.125 2023-06-21 15:13:49,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=994698.0, ans=0.0 2023-06-21 15:13:55,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=994698.0, ans=0.2 2023-06-21 15:15:05,604 INFO [train.py:996] (0/4) Epoch 6, batch 13350, loss[loss=0.2734, simple_loss=0.348, pruned_loss=0.09935, over 21665.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3315, pruned_loss=0.09228, over 4276606.72 frames. ], batch size: 351, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:16:01,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.076e+02 3.846e+02 4.574e+02 8.350e+02, threshold=7.691e+02, percent-clipped=3.0 2023-06-21 15:16:20,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=995118.0, ans=0.0 2023-06-21 15:16:26,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=995178.0, ans=0.0 2023-06-21 15:16:38,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=995178.0, ans=0.2 2023-06-21 15:16:44,136 INFO [train.py:996] (0/4) Epoch 6, batch 13400, loss[loss=0.2377, simple_loss=0.3713, pruned_loss=0.05201, over 19788.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3331, pruned_loss=0.09255, over 4271426.41 frames. ], batch size: 702, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:16:46,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=995238.0, ans=0.1 2023-06-21 15:16:47,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=995238.0, ans=0.5 2023-06-21 15:17:58,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=995418.0, ans=0.125 2023-06-21 15:18:12,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=995478.0, ans=0.0 2023-06-21 15:18:25,032 INFO [train.py:996] (0/4) Epoch 6, batch 13450, loss[loss=0.2736, simple_loss=0.3445, pruned_loss=0.1013, over 21531.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3352, pruned_loss=0.09514, over 4273600.07 frames. ], batch size: 131, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:18:31,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=995538.0, ans=0.125 2023-06-21 15:18:31,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=995538.0, ans=0.125 2023-06-21 15:19:28,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=995718.0, ans=0.0 2023-06-21 15:19:29,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.161e+02 3.421e+02 3.980e+02 7.603e+02, threshold=6.841e+02, percent-clipped=0.0 2023-06-21 15:19:46,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=995778.0, ans=0.1 2023-06-21 15:20:06,006 INFO [train.py:996] (0/4) Epoch 6, batch 13500, loss[loss=0.3113, simple_loss=0.3717, pruned_loss=0.1255, over 21698.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3224, pruned_loss=0.09171, over 4261374.93 frames. ], batch size: 441, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:20:09,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=995838.0, ans=0.0 2023-06-21 15:21:11,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=996018.0, ans=0.1 2023-06-21 15:21:22,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=996078.0, ans=0.0 2023-06-21 15:21:23,323 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=22.5 2023-06-21 15:21:43,576 INFO [train.py:996] (0/4) Epoch 6, batch 13550, loss[loss=0.2511, simple_loss=0.342, pruned_loss=0.08009, over 20770.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3252, pruned_loss=0.09039, over 4261179.56 frames. ], batch size: 607, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:22:02,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=996198.0, ans=0.125 2023-06-21 15:22:20,981 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.45 vs. limit=10.0 2023-06-21 15:22:40,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=996258.0, ans=0.125 2023-06-21 15:22:44,915 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.395e+02 3.030e+02 3.606e+02 4.387e+02 7.560e+02, threshold=7.212e+02, percent-clipped=4.0 2023-06-21 15:22:57,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=996378.0, ans=0.05 2023-06-21 15:23:11,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=996378.0, ans=0.125 2023-06-21 15:23:18,633 INFO [train.py:996] (0/4) Epoch 6, batch 13600, loss[loss=0.2942, simple_loss=0.348, pruned_loss=0.1202, over 21685.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3261, pruned_loss=0.09064, over 4268075.66 frames. ], batch size: 507, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:23:30,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=12.0 2023-06-21 15:24:18,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=996618.0, ans=0.1 2023-06-21 15:24:26,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=996618.0, ans=0.125 2023-06-21 15:24:46,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=996678.0, ans=0.2 2023-06-21 15:24:58,352 INFO [train.py:996] (0/4) Epoch 6, batch 13650, loss[loss=0.2514, simple_loss=0.3023, pruned_loss=0.1002, over 21540.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3231, pruned_loss=0.08843, over 4271494.32 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:25:33,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=996798.0, ans=0.125 2023-06-21 15:25:48,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=996858.0, ans=0.125 2023-06-21 15:25:53,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=996918.0, ans=0.125 2023-06-21 15:25:54,155 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.919e+02 3.475e+02 4.506e+02 7.169e+02, threshold=6.950e+02, percent-clipped=0.0 2023-06-21 15:26:20,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=996978.0, ans=0.125 2023-06-21 15:26:32,665 INFO [train.py:996] (0/4) Epoch 6, batch 13700, loss[loss=0.2809, simple_loss=0.3566, pruned_loss=0.1026, over 21645.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3176, pruned_loss=0.08786, over 4270700.36 frames. ], batch size: 441, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:26:37,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=997038.0, ans=0.125 2023-06-21 15:27:17,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=997158.0, ans=0.125 2023-06-21 15:27:30,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=997218.0, ans=0.1 2023-06-21 15:27:46,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=997218.0, ans=0.125 2023-06-21 15:27:48,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-21 15:28:06,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=997278.0, ans=0.2 2023-06-21 15:28:15,477 INFO [train.py:996] (0/4) Epoch 6, batch 13750, loss[loss=0.1997, simple_loss=0.255, pruned_loss=0.07215, over 21140.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3153, pruned_loss=0.08586, over 4258537.19 frames. ], batch size: 176, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:28:51,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=997458.0, ans=0.0 2023-06-21 15:29:15,007 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.229e+02 4.012e+02 5.672e+02 9.491e+02, threshold=8.024e+02, percent-clipped=9.0 2023-06-21 15:29:23,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=997518.0, ans=0.04949747468305833 2023-06-21 15:29:58,607 INFO [train.py:996] (0/4) Epoch 6, batch 13800, loss[loss=0.2274, simple_loss=0.3333, pruned_loss=0.06076, over 21614.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3211, pruned_loss=0.08529, over 4255549.16 frames. ], batch size: 263, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:30:17,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=997698.0, ans=0.125 2023-06-21 15:30:24,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=997698.0, ans=0.0 2023-06-21 15:30:44,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-21 15:31:09,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=997818.0, ans=0.125 2023-06-21 15:31:25,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=997878.0, ans=0.0 2023-06-21 15:31:35,476 INFO [train.py:996] (0/4) Epoch 6, batch 13850, loss[loss=0.2852, simple_loss=0.3702, pruned_loss=0.1001, over 21615.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3246, pruned_loss=0.08502, over 4254688.24 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:31:40,222 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:32:43,628 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.953e+02 3.447e+02 4.211e+02 7.666e+02, threshold=6.893e+02, percent-clipped=0.0 2023-06-21 15:32:44,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-21 15:32:53,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-21 15:33:10,837 INFO [train.py:996] (0/4) Epoch 6, batch 13900, loss[loss=0.2835, simple_loss=0.3435, pruned_loss=0.1117, over 21806.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3297, pruned_loss=0.0901, over 4259466.31 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:33:20,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-21 15:33:31,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=998298.0, ans=0.0 2023-06-21 15:33:56,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=998358.0, ans=0.0 2023-06-21 15:34:02,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=998358.0, ans=0.125 2023-06-21 15:34:25,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=998478.0, ans=0.125 2023-06-21 15:34:37,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-06-21 15:34:41,893 INFO [train.py:996] (0/4) Epoch 6, batch 13950, loss[loss=0.2474, simple_loss=0.3207, pruned_loss=0.08704, over 21367.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3307, pruned_loss=0.09139, over 4263419.62 frames. ], batch size: 144, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:35:31,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-21 15:35:43,666 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.434e+02 3.088e+02 3.493e+02 4.359e+02 6.535e+02, threshold=6.987e+02, percent-clipped=0.0 2023-06-21 15:36:07,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=998778.0, ans=0.125 2023-06-21 15:36:10,600 INFO [train.py:996] (0/4) Epoch 6, batch 14000, loss[loss=0.2223, simple_loss=0.3104, pruned_loss=0.0671, over 21627.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3278, pruned_loss=0.08943, over 4272393.80 frames. ], batch size: 263, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:36:12,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=998838.0, ans=0.125 2023-06-21 15:36:15,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=998838.0, ans=0.125 2023-06-21 15:36:49,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=998958.0, ans=0.05 2023-06-21 15:36:59,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=998958.0, ans=0.5 2023-06-21 15:37:41,080 INFO [train.py:996] (0/4) Epoch 6, batch 14050, loss[loss=0.2144, simple_loss=0.2868, pruned_loss=0.07104, over 21422.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3223, pruned_loss=0.08527, over 4269829.57 frames. ], batch size: 194, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:37:41,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=999138.0, ans=0.125 2023-06-21 15:38:47,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=999318.0, ans=0.125 2023-06-21 15:38:48,435 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.785e+02 3.183e+02 4.255e+02 6.746e+02, threshold=6.366e+02, percent-clipped=0.0 2023-06-21 15:38:50,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=999318.0, ans=0.125 2023-06-21 15:39:15,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=999438.0, ans=0.1 2023-06-21 15:39:16,526 INFO [train.py:996] (0/4) Epoch 6, batch 14100, loss[loss=0.2605, simple_loss=0.3259, pruned_loss=0.09754, over 21734.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3174, pruned_loss=0.08519, over 4267081.57 frames. ], batch size: 351, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:39:40,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=999498.0, ans=0.125 2023-06-21 15:39:54,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=999498.0, ans=0.125 2023-06-21 15:40:25,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=999618.0, ans=0.0 2023-06-21 15:40:47,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=999678.0, ans=0.0 2023-06-21 15:40:49,624 INFO [train.py:996] (0/4) Epoch 6, batch 14150, loss[loss=0.2343, simple_loss=0.3185, pruned_loss=0.075, over 21842.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3214, pruned_loss=0.08617, over 4256817.96 frames. ], batch size: 107, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:41:47,168 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.804e+02 3.332e+02 4.334e+02 8.014e+02, threshold=6.664e+02, percent-clipped=2.0 2023-06-21 15:41:49,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=15.0 2023-06-21 15:42:06,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=999918.0, ans=0.0 2023-06-21 15:42:16,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=999978.0, ans=0.125 2023-06-21 15:42:23,678 INFO [train.py:996] (0/4) Epoch 6, batch 14200, loss[loss=0.2436, simple_loss=0.2991, pruned_loss=0.09408, over 21602.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3185, pruned_loss=0.08416, over 4261533.48 frames. ], batch size: 230, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:42:43,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1000098.0, ans=0.0 2023-06-21 15:42:43,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1000098.0, ans=0.125 2023-06-21 15:43:29,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-21 15:43:58,947 INFO [train.py:996] (0/4) Epoch 6, batch 14250, loss[loss=0.1946, simple_loss=0.2792, pruned_loss=0.05495, over 21753.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3132, pruned_loss=0.0846, over 4249048.39 frames. ], batch size: 371, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:44:16,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1000398.0, ans=0.125 2023-06-21 15:44:17,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1000398.0, ans=0.125 2023-06-21 15:44:19,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1000398.0, ans=0.125 2023-06-21 15:44:19,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-21 15:44:59,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.649e+02 3.041e+02 3.616e+02 7.648e+02, threshold=6.082e+02, percent-clipped=1.0 2023-06-21 15:45:27,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1000578.0, ans=0.125 2023-06-21 15:45:35,971 INFO [train.py:996] (0/4) Epoch 6, batch 14300, loss[loss=0.3335, simple_loss=0.4174, pruned_loss=0.1247, over 21750.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3121, pruned_loss=0.08374, over 4244090.84 frames. ], batch size: 351, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:45:48,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1000638.0, ans=0.95 2023-06-21 15:46:12,064 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:46:21,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.21 vs. limit=5.0 2023-06-21 15:46:32,523 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-21 15:46:47,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1000818.0, ans=0.2 2023-06-21 15:47:10,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1000938.0, ans=0.0 2023-06-21 15:47:11,430 INFO [train.py:996] (0/4) Epoch 6, batch 14350, loss[loss=0.2559, simple_loss=0.3118, pruned_loss=0.1, over 20007.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3176, pruned_loss=0.08446, over 4247428.66 frames. ], batch size: 703, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:47:58,884 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-21 15:48:05,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1001058.0, ans=0.125 2023-06-21 15:48:18,793 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.231e+02 3.841e+02 4.769e+02 8.361e+02, threshold=7.683e+02, percent-clipped=10.0 2023-06-21 15:48:23,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1001118.0, ans=0.05 2023-06-21 15:48:38,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1001178.0, ans=0.0 2023-06-21 15:48:46,000 INFO [train.py:996] (0/4) Epoch 6, batch 14400, loss[loss=0.2098, simple_loss=0.2724, pruned_loss=0.07366, over 21581.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3171, pruned_loss=0.08594, over 4258928.34 frames. ], batch size: 230, lr: 5.11e-03, grad_scale: 32.0 2023-06-21 15:49:05,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1001298.0, ans=0.1 2023-06-21 15:49:07,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1001298.0, ans=0.1 2023-06-21 15:50:10,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1001478.0, ans=0.125 2023-06-21 15:50:12,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1001478.0, ans=0.0 2023-06-21 15:50:20,553 INFO [train.py:996] (0/4) Epoch 6, batch 14450, loss[loss=0.2481, simple_loss=0.297, pruned_loss=0.09961, over 21279.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3126, pruned_loss=0.08674, over 4264069.64 frames. ], batch size: 176, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:51:07,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1001658.0, ans=0.125 2023-06-21 15:51:07,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1001658.0, ans=0.125 2023-06-21 15:51:29,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.916e+02 3.251e+02 4.168e+02 6.765e+02, threshold=6.503e+02, percent-clipped=0.0 2023-06-21 15:51:55,520 INFO [train.py:996] (0/4) Epoch 6, batch 14500, loss[loss=0.2268, simple_loss=0.2898, pruned_loss=0.08197, over 21163.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3114, pruned_loss=0.08705, over 4258808.22 frames. ], batch size: 143, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:51:59,724 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-21 15:52:31,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1001898.0, ans=0.2 2023-06-21 15:52:33,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1001898.0, ans=0.1 2023-06-21 15:52:44,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1001958.0, ans=0.2 2023-06-21 15:53:31,827 INFO [train.py:996] (0/4) Epoch 6, batch 14550, loss[loss=0.2058, simple_loss=0.3107, pruned_loss=0.05045, over 19773.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3163, pruned_loss=0.08834, over 4257014.48 frames. ], batch size: 702, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:54:11,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1002258.0, ans=0.125 2023-06-21 15:54:40,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1002318.0, ans=0.125 2023-06-21 15:54:41,349 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 3.127e+02 3.885e+02 5.234e+02 1.064e+03, threshold=7.771e+02, percent-clipped=10.0 2023-06-21 15:54:46,299 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:55:07,182 INFO [train.py:996] (0/4) Epoch 6, batch 14600, loss[loss=0.2711, simple_loss=0.3491, pruned_loss=0.09659, over 21284.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3242, pruned_loss=0.09223, over 4262004.78 frames. ], batch size: 159, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:55:24,920 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-21 15:55:39,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1002498.0, ans=0.125 2023-06-21 15:55:45,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1002558.0, ans=0.09899494936611666 2023-06-21 15:56:20,374 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.31 vs. limit=22.5 2023-06-21 15:56:22,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1002618.0, ans=0.125 2023-06-21 15:56:24,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1002618.0, ans=0.125 2023-06-21 15:56:37,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1002678.0, ans=0.125 2023-06-21 15:56:41,651 INFO [train.py:996] (0/4) Epoch 6, batch 14650, loss[loss=0.2148, simple_loss=0.3036, pruned_loss=0.06303, over 21762.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3255, pruned_loss=0.09058, over 4257373.49 frames. ], batch size: 332, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:56:42,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-21 15:57:07,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1002798.0, ans=0.125 2023-06-21 15:57:08,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1002798.0, ans=0.125 2023-06-21 15:57:11,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1002798.0, ans=0.125 2023-06-21 15:57:26,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1002858.0, ans=0.0 2023-06-21 15:57:48,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-21 15:57:50,600 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.901e+02 3.727e+02 5.131e+02 9.036e+02, threshold=7.453e+02, percent-clipped=4.0 2023-06-21 15:58:21,846 INFO [train.py:996] (0/4) Epoch 6, batch 14700, loss[loss=0.2459, simple_loss=0.3139, pruned_loss=0.08897, over 20076.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3185, pruned_loss=0.08512, over 4241853.45 frames. ], batch size: 702, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:58:56,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1003158.0, ans=0.125 2023-06-21 15:59:14,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1003158.0, ans=0.0 2023-06-21 15:59:16,838 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:59:21,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1003218.0, ans=15.0 2023-06-21 15:59:51,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1003278.0, ans=0.0 2023-06-21 15:59:53,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1003278.0, ans=0.0 2023-06-21 15:59:59,097 INFO [train.py:996] (0/4) Epoch 6, batch 14750, loss[loss=0.3523, simple_loss=0.4288, pruned_loss=0.1379, over 21216.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3245, pruned_loss=0.08823, over 4253031.97 frames. ], batch size: 548, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 16:00:54,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1003458.0, ans=0.125 2023-06-21 16:01:04,196 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 3.041e+02 3.594e+02 4.539e+02 8.460e+02, threshold=7.189e+02, percent-clipped=3.0 2023-06-21 16:01:07,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1003518.0, ans=0.125 2023-06-21 16:01:35,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1003578.0, ans=0.2 2023-06-21 16:01:39,220 INFO [train.py:996] (0/4) Epoch 6, batch 14800, loss[loss=0.3018, simple_loss=0.3459, pruned_loss=0.1288, over 21334.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3371, pruned_loss=0.09486, over 4264626.07 frames. ], batch size: 507, lr: 5.11e-03, grad_scale: 32.0 2023-06-21 16:02:08,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1003698.0, ans=0.125 2023-06-21 16:02:10,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1003698.0, ans=0.125 2023-06-21 16:02:11,180 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-21 16:02:13,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1003698.0, ans=0.0 2023-06-21 16:03:20,698 INFO [train.py:996] (0/4) Epoch 6, batch 14850, loss[loss=0.2273, simple_loss=0.2945, pruned_loss=0.0801, over 21570.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3304, pruned_loss=0.09368, over 4253255.66 frames. ], batch size: 263, lr: 5.10e-03, grad_scale: 32.0 2023-06-21 16:03:47,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1003998.0, ans=0.125 2023-06-21 16:04:28,218 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 3.136e+02 3.769e+02 4.672e+02 7.258e+02, threshold=7.538e+02, percent-clipped=1.0 2023-06-21 16:04:48,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1004178.0, ans=0.04949747468305833 2023-06-21 16:05:03,051 INFO [train.py:996] (0/4) Epoch 6, batch 14900, loss[loss=0.2617, simple_loss=0.3327, pruned_loss=0.09539, over 21380.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3325, pruned_loss=0.09378, over 4256459.19 frames. ], batch size: 549, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:05:22,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1004298.0, ans=0.0 2023-06-21 16:05:46,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1004358.0, ans=15.0 2023-06-21 16:06:40,454 INFO [train.py:996] (0/4) Epoch 6, batch 14950, loss[loss=0.2539, simple_loss=0.342, pruned_loss=0.08289, over 21632.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3361, pruned_loss=0.09508, over 4260555.67 frames. ], batch size: 441, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:06:44,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-06-21 16:07:12,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1004598.0, ans=0.0 2023-06-21 16:07:14,874 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-21 16:07:26,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1004658.0, ans=0.125 2023-06-21 16:07:27,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1004658.0, ans=0.125 2023-06-21 16:07:41,314 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 2.931e+02 3.491e+02 4.464e+02 7.538e+02, threshold=6.982e+02, percent-clipped=0.0 2023-06-21 16:08:00,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1004778.0, ans=0.1 2023-06-21 16:08:10,590 INFO [train.py:996] (0/4) Epoch 6, batch 15000, loss[loss=0.2471, simple_loss=0.3158, pruned_loss=0.08916, over 21394.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3381, pruned_loss=0.09671, over 4268205.43 frames. ], batch size: 176, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:08:10,591 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 16:08:27,118 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.26, simple_loss=0.3558, pruned_loss=0.08209, over 1796401.00 frames. 2023-06-21 16:08:27,119 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 16:08:57,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-21 16:09:10,631 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-21 16:09:13,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1004958.0, ans=0.2 2023-06-21 16:10:03,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1005138.0, ans=0.125 2023-06-21 16:10:04,171 INFO [train.py:996] (0/4) Epoch 6, batch 15050, loss[loss=0.3029, simple_loss=0.4003, pruned_loss=0.1027, over 21170.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3378, pruned_loss=0.09726, over 4268513.24 frames. ], batch size: 548, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:10:10,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.68 vs. limit=22.5 2023-06-21 16:10:35,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1005198.0, ans=0.2 2023-06-21 16:10:49,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1005258.0, ans=0.0 2023-06-21 16:10:49,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1005258.0, ans=0.125 2023-06-21 16:11:16,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 2.976e+02 3.378e+02 4.052e+02 9.524e+02, threshold=6.756e+02, percent-clipped=3.0 2023-06-21 16:11:17,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1005318.0, ans=0.0 2023-06-21 16:11:44,053 INFO [train.py:996] (0/4) Epoch 6, batch 15100, loss[loss=0.3054, simple_loss=0.3703, pruned_loss=0.1202, over 21316.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3403, pruned_loss=0.09689, over 4276507.14 frames. ], batch size: 548, lr: 5.10e-03, grad_scale: 8.0 2023-06-21 16:12:09,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1005498.0, ans=0.2 2023-06-21 16:12:21,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1005498.0, ans=0.0 2023-06-21 16:12:42,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1005558.0, ans=22.5 2023-06-21 16:12:47,835 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:13:02,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1005678.0, ans=0.125 2023-06-21 16:13:02,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1005678.0, ans=0.125 2023-06-21 16:13:02,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-06-21 16:13:23,677 INFO [train.py:996] (0/4) Epoch 6, batch 15150, loss[loss=0.2304, simple_loss=0.291, pruned_loss=0.08493, over 21742.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3371, pruned_loss=0.09769, over 4275778.67 frames. ], batch size: 112, lr: 5.10e-03, grad_scale: 8.0 2023-06-21 16:13:35,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1005738.0, ans=0.125 2023-06-21 16:13:57,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1005798.0, ans=0.125 2023-06-21 16:14:09,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=22.5 2023-06-21 16:14:25,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 2.946e+02 3.329e+02 3.848e+02 7.712e+02, threshold=6.658e+02, percent-clipped=2.0 2023-06-21 16:14:27,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1005918.0, ans=0.0 2023-06-21 16:14:53,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1005978.0, ans=0.2 2023-06-21 16:14:57,073 INFO [train.py:996] (0/4) Epoch 6, batch 15200, loss[loss=0.2211, simple_loss=0.2999, pruned_loss=0.07117, over 21646.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3274, pruned_loss=0.09287, over 4275497.66 frames. ], batch size: 247, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:15:27,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1006098.0, ans=0.125 2023-06-21 16:15:28,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-21 16:15:36,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1006158.0, ans=0.025 2023-06-21 16:15:44,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-06-21 16:15:54,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1006218.0, ans=0.125 2023-06-21 16:16:03,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1006218.0, ans=0.0 2023-06-21 16:16:14,512 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.69 vs. limit=10.0 2023-06-21 16:16:30,315 INFO [train.py:996] (0/4) Epoch 6, batch 15250, loss[loss=0.2209, simple_loss=0.2834, pruned_loss=0.07917, over 21683.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3211, pruned_loss=0.09087, over 4274421.24 frames. ], batch size: 333, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:17:03,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-21 16:17:04,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1006398.0, ans=0.0 2023-06-21 16:17:32,963 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.045e+02 3.599e+02 4.449e+02 6.735e+02, threshold=7.197e+02, percent-clipped=2.0 2023-06-21 16:18:15,103 INFO [train.py:996] (0/4) Epoch 6, batch 15300, loss[loss=0.2559, simple_loss=0.3266, pruned_loss=0.09258, over 21987.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3236, pruned_loss=0.09277, over 4266733.75 frames. ], batch size: 317, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:18:31,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1006698.0, ans=0.025 2023-06-21 16:19:04,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1006818.0, ans=0.125 2023-06-21 16:19:16,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1006818.0, ans=0.2 2023-06-21 16:19:30,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1006878.0, ans=0.125 2023-06-21 16:19:31,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1006878.0, ans=0.2 2023-06-21 16:19:38,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1006878.0, ans=0.125 2023-06-21 16:19:44,633 INFO [train.py:996] (0/4) Epoch 6, batch 15350, loss[loss=0.2631, simple_loss=0.3563, pruned_loss=0.08498, over 21779.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3277, pruned_loss=0.09497, over 4274983.30 frames. ], batch size: 247, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:19:44,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1006938.0, ans=0.0 2023-06-21 16:20:10,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1006998.0, ans=0.0 2023-06-21 16:20:14,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1006998.0, ans=0.0 2023-06-21 16:20:32,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1007118.0, ans=0.0 2023-06-21 16:20:40,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 2.864e+02 3.312e+02 3.850e+02 5.534e+02, threshold=6.625e+02, percent-clipped=0.0 2023-06-21 16:20:56,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1007178.0, ans=0.2 2023-06-21 16:21:12,499 INFO [train.py:996] (0/4) Epoch 6, batch 15400, loss[loss=0.2307, simple_loss=0.3109, pruned_loss=0.07523, over 21877.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3282, pruned_loss=0.09371, over 4278864.67 frames. ], batch size: 351, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:21:37,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1007298.0, ans=0.125 2023-06-21 16:21:43,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1007298.0, ans=0.0 2023-06-21 16:22:25,392 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-21 16:22:51,012 INFO [train.py:996] (0/4) Epoch 6, batch 15450, loss[loss=0.2079, simple_loss=0.2779, pruned_loss=0.06893, over 21678.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3258, pruned_loss=0.09246, over 4275791.96 frames. ], batch size: 263, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:23:29,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1007658.0, ans=0.125 2023-06-21 16:23:48,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-21 16:23:53,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.820e+02 3.206e+02 3.882e+02 5.798e+02, threshold=6.411e+02, percent-clipped=0.0 2023-06-21 16:23:56,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1007718.0, ans=0.125 2023-06-21 16:24:11,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1007778.0, ans=0.125 2023-06-21 16:24:24,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-21 16:24:26,157 INFO [train.py:996] (0/4) Epoch 6, batch 15500, loss[loss=0.2604, simple_loss=0.3282, pruned_loss=0.09625, over 21591.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3293, pruned_loss=0.09159, over 4267008.50 frames. ], batch size: 263, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:24:57,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1007898.0, ans=0.1 2023-06-21 16:25:10,807 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-168000.pt 2023-06-21 16:25:25,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1008018.0, ans=0.07 2023-06-21 16:25:48,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1008078.0, ans=0.1 2023-06-21 16:26:04,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1008138.0, ans=0.125 2023-06-21 16:26:05,928 INFO [train.py:996] (0/4) Epoch 6, batch 15550, loss[loss=0.344, simple_loss=0.4508, pruned_loss=0.1186, over 19755.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3299, pruned_loss=0.09005, over 4260691.79 frames. ], batch size: 702, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:26:13,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1008138.0, ans=0.125 2023-06-21 16:26:22,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1008198.0, ans=0.0 2023-06-21 16:26:35,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-21 16:27:08,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.789e+02 3.164e+02 3.648e+02 6.720e+02, threshold=6.328e+02, percent-clipped=2.0 2023-06-21 16:27:28,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1008378.0, ans=0.04949747468305833 2023-06-21 16:27:37,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1008378.0, ans=0.125 2023-06-21 16:27:39,801 INFO [train.py:996] (0/4) Epoch 6, batch 15600, loss[loss=0.2188, simple_loss=0.2851, pruned_loss=0.07622, over 21920.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3228, pruned_loss=0.08815, over 4265139.81 frames. ], batch size: 125, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:28:20,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1008558.0, ans=0.0 2023-06-21 16:28:25,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-21 16:29:13,707 INFO [train.py:996] (0/4) Epoch 6, batch 15650, loss[loss=0.2514, simple_loss=0.3123, pruned_loss=0.09526, over 21625.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3212, pruned_loss=0.0881, over 4261177.46 frames. ], batch size: 332, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:29:37,920 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-21 16:29:49,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1008858.0, ans=0.1 2023-06-21 16:30:15,988 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.365e+02 3.062e+02 3.560e+02 4.421e+02 6.753e+02, threshold=7.119e+02, percent-clipped=3.0 2023-06-21 16:30:47,538 INFO [train.py:996] (0/4) Epoch 6, batch 15700, loss[loss=0.2347, simple_loss=0.3042, pruned_loss=0.08265, over 21839.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3164, pruned_loss=0.0871, over 4269001.68 frames. ], batch size: 372, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:30:55,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1009038.0, ans=0.125 2023-06-21 16:30:59,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1009038.0, ans=0.0 2023-06-21 16:31:01,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2023-06-21 16:31:04,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1009098.0, ans=0.125 2023-06-21 16:31:10,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-06-21 16:31:33,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1009158.0, ans=0.125 2023-06-21 16:31:48,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1009218.0, ans=0.0 2023-06-21 16:32:21,338 INFO [train.py:996] (0/4) Epoch 6, batch 15750, loss[loss=0.2208, simple_loss=0.2883, pruned_loss=0.07658, over 21480.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3131, pruned_loss=0.08716, over 4263639.14 frames. ], batch size: 212, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:32:30,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1009338.0, ans=0.125 2023-06-21 16:32:35,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1009398.0, ans=0.2 2023-06-21 16:32:45,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1009398.0, ans=0.2 2023-06-21 16:33:09,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1009518.0, ans=0.1 2023-06-21 16:33:20,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-21 16:33:24,867 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.861e+02 3.404e+02 4.012e+02 5.531e+02, threshold=6.808e+02, percent-clipped=0.0 2023-06-21 16:33:25,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1009518.0, ans=0.125 2023-06-21 16:33:55,165 INFO [train.py:996] (0/4) Epoch 6, batch 15800, loss[loss=0.2671, simple_loss=0.368, pruned_loss=0.08317, over 20790.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3084, pruned_loss=0.0866, over 4270783.56 frames. ], batch size: 608, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:34:00,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1009638.0, ans=0.05 2023-06-21 16:34:02,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1009638.0, ans=0.0 2023-06-21 16:34:47,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1009758.0, ans=0.125 2023-06-21 16:34:50,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1009818.0, ans=0.1 2023-06-21 16:35:29,301 INFO [train.py:996] (0/4) Epoch 6, batch 15850, loss[loss=0.2826, simple_loss=0.3431, pruned_loss=0.111, over 21261.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3123, pruned_loss=0.09016, over 4275645.37 frames. ], batch size: 143, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:35:37,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-21 16:35:59,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1010058.0, ans=0.5 2023-06-21 16:36:28,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1010118.0, ans=0.2 2023-06-21 16:36:31,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=22.5 2023-06-21 16:36:32,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.973e+02 3.336e+02 4.018e+02 6.867e+02, threshold=6.671e+02, percent-clipped=1.0 2023-06-21 16:36:34,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1010118.0, ans=0.0 2023-06-21 16:36:47,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1010178.0, ans=0.1 2023-06-21 16:37:02,618 INFO [train.py:996] (0/4) Epoch 6, batch 15900, loss[loss=0.2244, simple_loss=0.2978, pruned_loss=0.07546, over 21468.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.308, pruned_loss=0.08962, over 4263945.99 frames. ], batch size: 389, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:37:05,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-21 16:37:07,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1010238.0, ans=0.2 2023-06-21 16:37:38,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1010358.0, ans=0.125 2023-06-21 16:38:36,325 INFO [train.py:996] (0/4) Epoch 6, batch 15950, loss[loss=0.226, simple_loss=0.3149, pruned_loss=0.0685, over 21500.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3071, pruned_loss=0.08655, over 4260537.28 frames. ], batch size: 471, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:38:48,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1010538.0, ans=0.2 2023-06-21 16:38:53,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1010598.0, ans=0.125 2023-06-21 16:39:03,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1010598.0, ans=0.125 2023-06-21 16:39:09,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1010658.0, ans=0.1 2023-06-21 16:39:25,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-21 16:39:40,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.630e+02 3.032e+02 3.635e+02 5.664e+02, threshold=6.064e+02, percent-clipped=0.0 2023-06-21 16:39:46,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1010778.0, ans=0.125 2023-06-21 16:40:06,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-21 16:40:10,506 INFO [train.py:996] (0/4) Epoch 6, batch 16000, loss[loss=0.2147, simple_loss=0.2943, pruned_loss=0.06755, over 21676.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3078, pruned_loss=0.08407, over 4266646.16 frames. ], batch size: 263, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:40:16,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-06-21 16:40:19,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1010838.0, ans=0.2 2023-06-21 16:40:25,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1010898.0, ans=0.1 2023-06-21 16:41:40,573 INFO [train.py:996] (0/4) Epoch 6, batch 16050, loss[loss=0.2994, simple_loss=0.4016, pruned_loss=0.09861, over 21259.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3111, pruned_loss=0.08217, over 4267648.72 frames. ], batch size: 548, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:42:04,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1011198.0, ans=0.2 2023-06-21 16:42:06,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1011198.0, ans=0.125 2023-06-21 16:42:44,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.881e+02 3.788e+02 4.822e+02 9.882e+02, threshold=7.576e+02, percent-clipped=9.0 2023-06-21 16:43:12,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1011438.0, ans=10.0 2023-06-21 16:43:13,229 INFO [train.py:996] (0/4) Epoch 6, batch 16100, loss[loss=0.2169, simple_loss=0.287, pruned_loss=0.07346, over 21907.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3171, pruned_loss=0.08326, over 4270909.52 frames. ], batch size: 316, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:43:15,159 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:43:45,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1011558.0, ans=0.125 2023-06-21 16:44:04,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.94 vs. limit=15.0 2023-06-21 16:44:23,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-21 16:44:42,433 INFO [train.py:996] (0/4) Epoch 6, batch 16150, loss[loss=0.2424, simple_loss=0.3567, pruned_loss=0.06406, over 19868.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3189, pruned_loss=0.08561, over 4276689.42 frames. ], batch size: 702, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:45:05,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1011798.0, ans=0.125 2023-06-21 16:45:47,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.047e+02 3.537e+02 4.143e+02 9.363e+02, threshold=7.074e+02, percent-clipped=2.0 2023-06-21 16:46:09,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1011978.0, ans=0.2 2023-06-21 16:46:16,808 INFO [train.py:996] (0/4) Epoch 6, batch 16200, loss[loss=0.2576, simple_loss=0.323, pruned_loss=0.09614, over 21289.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3216, pruned_loss=0.08685, over 4287861.98 frames. ], batch size: 159, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:46:47,608 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:46:56,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1012158.0, ans=0.125 2023-06-21 16:47:27,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1012218.0, ans=0.0 2023-06-21 16:47:51,852 INFO [train.py:996] (0/4) Epoch 6, batch 16250, loss[loss=0.196, simple_loss=0.2664, pruned_loss=0.06281, over 21251.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3229, pruned_loss=0.08828, over 4292574.61 frames. ], batch size: 176, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:47:58,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-21 16:48:31,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1012458.0, ans=0.0 2023-06-21 16:48:39,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1012458.0, ans=0.1 2023-06-21 16:49:01,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.762e+02 3.334e+02 4.108e+02 7.386e+02, threshold=6.668e+02, percent-clipped=1.0 2023-06-21 16:49:02,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1012518.0, ans=0.0 2023-06-21 16:49:26,015 INFO [train.py:996] (0/4) Epoch 6, batch 16300, loss[loss=0.2947, simple_loss=0.3697, pruned_loss=0.1099, over 20705.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3168, pruned_loss=0.08442, over 4282902.80 frames. ], batch size: 607, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:49:42,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1012698.0, ans=0.0 2023-06-21 16:49:43,468 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.86 vs. limit=6.0 2023-06-21 16:49:44,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1012698.0, ans=0.125 2023-06-21 16:49:47,534 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:49:50,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1012698.0, ans=0.0 2023-06-21 16:50:08,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1012758.0, ans=0.2 2023-06-21 16:50:47,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1012878.0, ans=0.1 2023-06-21 16:50:56,875 INFO [train.py:996] (0/4) Epoch 6, batch 16350, loss[loss=0.363, simple_loss=0.4114, pruned_loss=0.1573, over 21402.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3171, pruned_loss=0.08529, over 4280045.51 frames. ], batch size: 471, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:51:09,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1012938.0, ans=0.2 2023-06-21 16:51:41,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1013058.0, ans=0.125 2023-06-21 16:52:11,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.615e+02 3.247e+02 3.873e+02 7.213e+02, threshold=6.493e+02, percent-clipped=3.0 2023-06-21 16:52:12,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-21 16:52:19,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-21 16:52:22,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1013178.0, ans=0.125 2023-06-21 16:52:26,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1013178.0, ans=0.0 2023-06-21 16:52:26,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013178.0, ans=0.1 2023-06-21 16:52:30,810 INFO [train.py:996] (0/4) Epoch 6, batch 16400, loss[loss=0.2278, simple_loss=0.2956, pruned_loss=0.07998, over 21913.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3201, pruned_loss=0.08701, over 4279413.27 frames. ], batch size: 107, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:52:58,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1013298.0, ans=0.05 2023-06-21 16:53:07,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1013298.0, ans=0.125 2023-06-21 16:53:46,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-21 16:53:51,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1013478.0, ans=0.0 2023-06-21 16:54:04,557 INFO [train.py:996] (0/4) Epoch 6, batch 16450, loss[loss=0.2427, simple_loss=0.3137, pruned_loss=0.08583, over 21876.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3196, pruned_loss=0.0878, over 4288973.99 frames. ], batch size: 371, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:54:27,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1013598.0, ans=0.125 2023-06-21 16:54:44,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1013658.0, ans=0.0 2023-06-21 16:55:01,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1013658.0, ans=0.1 2023-06-21 16:55:13,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=15.0 2023-06-21 16:55:14,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1013718.0, ans=0.0 2023-06-21 16:55:19,955 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 2.857e+02 3.262e+02 3.717e+02 6.839e+02, threshold=6.523e+02, percent-clipped=2.0 2023-06-21 16:55:29,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1013778.0, ans=0.125 2023-06-21 16:55:36,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013778.0, ans=0.1 2023-06-21 16:55:39,118 INFO [train.py:996] (0/4) Epoch 6, batch 16500, loss[loss=0.2008, simple_loss=0.2664, pruned_loss=0.06761, over 21447.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.318, pruned_loss=0.08713, over 4285137.80 frames. ], batch size: 211, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:55:40,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1013838.0, ans=0.125 2023-06-21 16:56:37,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1013958.0, ans=10.0 2023-06-21 16:57:17,910 INFO [train.py:996] (0/4) Epoch 6, batch 16550, loss[loss=0.2687, simple_loss=0.3436, pruned_loss=0.09688, over 21700.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3129, pruned_loss=0.08417, over 4282986.82 frames. ], batch size: 351, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:57:26,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1014138.0, ans=0.125 2023-06-21 16:57:59,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1014198.0, ans=0.125 2023-06-21 16:58:29,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 2.954e+02 3.435e+02 4.498e+02 9.143e+02, threshold=6.870e+02, percent-clipped=8.0 2023-06-21 16:58:48,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1014378.0, ans=0.125 2023-06-21 16:58:54,075 INFO [train.py:996] (0/4) Epoch 6, batch 16600, loss[loss=0.2705, simple_loss=0.3714, pruned_loss=0.08473, over 21557.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3235, pruned_loss=0.08801, over 4285760.21 frames. ], batch size: 230, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:59:02,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1014438.0, ans=0.0 2023-06-21 16:59:03,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1014438.0, ans=0.1 2023-06-21 16:59:59,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1014618.0, ans=0.125 2023-06-21 17:00:09,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-21 17:00:12,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1014678.0, ans=0.04949747468305833 2023-06-21 17:00:14,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=1014678.0, ans=12.0 2023-06-21 17:00:28,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1014678.0, ans=0.04949747468305833 2023-06-21 17:00:30,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1014678.0, ans=0.2 2023-06-21 17:00:33,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1014738.0, ans=0.125 2023-06-21 17:00:34,769 INFO [train.py:996] (0/4) Epoch 6, batch 16650, loss[loss=0.285, simple_loss=0.354, pruned_loss=0.108, over 21526.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3333, pruned_loss=0.09094, over 4281949.87 frames. ], batch size: 211, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 17:00:43,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1014738.0, ans=0.125 2023-06-21 17:01:01,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1014798.0, ans=0.125 2023-06-21 17:01:32,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.52 vs. limit=10.0 2023-06-21 17:01:49,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1014918.0, ans=0.125 2023-06-21 17:01:52,125 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.123e+02 3.593e+02 4.644e+02 7.930e+02, threshold=7.186e+02, percent-clipped=2.0 2023-06-21 17:02:21,312 INFO [train.py:996] (0/4) Epoch 6, batch 16700, loss[loss=0.209, simple_loss=0.2639, pruned_loss=0.07706, over 21902.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3353, pruned_loss=0.09214, over 4278064.90 frames. ], batch size: 98, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 17:02:23,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1015038.0, ans=0.1 2023-06-21 17:02:35,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1015098.0, ans=0.0 2023-06-21 17:02:56,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-21 17:03:07,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1015158.0, ans=0.1 2023-06-21 17:03:23,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1015218.0, ans=0.0 2023-06-21 17:03:35,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-21 17:03:39,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-21 17:03:58,842 INFO [train.py:996] (0/4) Epoch 6, batch 16750, loss[loss=0.2585, simple_loss=0.338, pruned_loss=0.08954, over 21639.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3377, pruned_loss=0.09435, over 4273310.75 frames. ], batch size: 263, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 17:03:59,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1015338.0, ans=0.125 2023-06-21 17:04:01,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1015338.0, ans=0.0 2023-06-21 17:04:27,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1015398.0, ans=0.0 2023-06-21 17:04:37,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1015398.0, ans=0.125 2023-06-21 17:05:12,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.506e+02 4.253e+02 6.038e+02 1.079e+03, threshold=8.506e+02, percent-clipped=10.0 2023-06-21 17:05:21,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1015578.0, ans=0.2 2023-06-21 17:05:33,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1015638.0, ans=0.125 2023-06-21 17:05:34,894 INFO [train.py:996] (0/4) Epoch 6, batch 16800, loss[loss=0.2485, simple_loss=0.3232, pruned_loss=0.0869, over 21821.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.343, pruned_loss=0.09524, over 4263255.56 frames. ], batch size: 298, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 17:06:44,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1015818.0, ans=0.2 2023-06-21 17:06:45,668 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-06-21 17:06:55,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1015878.0, ans=0.0 2023-06-21 17:07:05,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-21 17:07:08,802 INFO [train.py:996] (0/4) Epoch 6, batch 16850, loss[loss=0.248, simple_loss=0.3065, pruned_loss=0.09478, over 21486.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3383, pruned_loss=0.09492, over 4275603.85 frames. ], batch size: 194, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:07:13,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1015938.0, ans=0.0 2023-06-21 17:07:20,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1015938.0, ans=0.0 2023-06-21 17:08:05,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1016058.0, ans=0.5 2023-06-21 17:08:07,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1016058.0, ans=0.2 2023-06-21 17:08:21,511 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.931e+02 3.423e+02 4.482e+02 7.655e+02, threshold=6.845e+02, percent-clipped=0.0 2023-06-21 17:08:43,845 INFO [train.py:996] (0/4) Epoch 6, batch 16900, loss[loss=0.2482, simple_loss=0.3103, pruned_loss=0.09301, over 21559.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3321, pruned_loss=0.09264, over 4271572.51 frames. ], batch size: 441, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:09:06,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1016298.0, ans=0.0 2023-06-21 17:09:32,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1016358.0, ans=0.125 2023-06-21 17:10:16,684 INFO [train.py:996] (0/4) Epoch 6, batch 16950, loss[loss=0.2419, simple_loss=0.3107, pruned_loss=0.08652, over 21865.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3255, pruned_loss=0.0914, over 4278841.15 frames. ], batch size: 124, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:10:20,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1016538.0, ans=0.125 2023-06-21 17:10:27,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1016538.0, ans=0.125 2023-06-21 17:10:42,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1016598.0, ans=0.0 2023-06-21 17:10:58,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-21 17:11:04,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-21 17:11:10,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-21 17:11:17,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2023-06-21 17:11:25,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=22.5 2023-06-21 17:11:26,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.733e+02 3.000e+02 3.564e+02 5.984e+02, threshold=6.000e+02, percent-clipped=0.0 2023-06-21 17:11:29,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=15.0 2023-06-21 17:11:46,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1016778.0, ans=0.0 2023-06-21 17:11:47,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1016778.0, ans=0.125 2023-06-21 17:11:50,086 INFO [train.py:996] (0/4) Epoch 6, batch 17000, loss[loss=0.2534, simple_loss=0.3149, pruned_loss=0.09591, over 21939.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3213, pruned_loss=0.09121, over 4286834.84 frames. ], batch size: 316, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:12:30,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1016958.0, ans=0.125 2023-06-21 17:12:45,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1017018.0, ans=0.0 2023-06-21 17:13:20,261 INFO [train.py:996] (0/4) Epoch 6, batch 17050, loss[loss=0.2905, simple_loss=0.3736, pruned_loss=0.1037, over 21388.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3286, pruned_loss=0.0942, over 4289615.90 frames. ], batch size: 548, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:13:40,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1017138.0, ans=0.025 2023-06-21 17:13:47,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1017198.0, ans=0.125 2023-06-21 17:14:08,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1017258.0, ans=0.0 2023-06-21 17:14:30,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 3.084e+02 3.616e+02 4.433e+02 7.180e+02, threshold=7.232e+02, percent-clipped=5.0 2023-06-21 17:14:51,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017438.0, ans=0.1 2023-06-21 17:14:52,400 INFO [train.py:996] (0/4) Epoch 6, batch 17100, loss[loss=0.2718, simple_loss=0.3355, pruned_loss=0.1041, over 21733.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3278, pruned_loss=0.09475, over 4284391.50 frames. ], batch size: 112, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:14:55,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1017438.0, ans=0.1 2023-06-21 17:16:13,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1017678.0, ans=0.125 2023-06-21 17:16:26,694 INFO [train.py:996] (0/4) Epoch 6, batch 17150, loss[loss=0.2316, simple_loss=0.3054, pruned_loss=0.0789, over 21793.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3229, pruned_loss=0.09394, over 4285787.34 frames. ], batch size: 332, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:16:40,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-21 17:17:04,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-21 17:17:39,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 2.784e+02 3.036e+02 3.616e+02 5.334e+02, threshold=6.072e+02, percent-clipped=0.0 2023-06-21 17:17:47,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1017978.0, ans=0.04949747468305833 2023-06-21 17:18:05,513 INFO [train.py:996] (0/4) Epoch 6, batch 17200, loss[loss=0.2616, simple_loss=0.3276, pruned_loss=0.09787, over 21530.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3213, pruned_loss=0.09297, over 4291413.23 frames. ], batch size: 211, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:18:12,973 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.12 vs. limit=10.0 2023-06-21 17:19:40,277 INFO [train.py:996] (0/4) Epoch 6, batch 17250, loss[loss=0.2576, simple_loss=0.3237, pruned_loss=0.09572, over 21707.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.326, pruned_loss=0.09539, over 4284736.94 frames. ], batch size: 298, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:19:40,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1018338.0, ans=0.0 2023-06-21 17:19:55,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1018398.0, ans=0.125 2023-06-21 17:20:07,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.20 vs. limit=10.0 2023-06-21 17:20:54,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.506e+02 3.363e+02 4.058e+02 5.457e+02 1.011e+03, threshold=8.116e+02, percent-clipped=16.0 2023-06-21 17:20:58,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1018578.0, ans=0.125 2023-06-21 17:21:00,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1018578.0, ans=0.0 2023-06-21 17:21:09,829 INFO [train.py:996] (0/4) Epoch 6, batch 17300, loss[loss=0.2783, simple_loss=0.3458, pruned_loss=0.1054, over 21771.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.335, pruned_loss=0.0984, over 4282005.37 frames. ], batch size: 332, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:21:25,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1018698.0, ans=0.0 2023-06-21 17:21:34,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1018698.0, ans=10.0 2023-06-21 17:22:11,374 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.37 vs. limit=22.5 2023-06-21 17:22:37,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1018878.0, ans=0.125 2023-06-21 17:22:40,279 INFO [train.py:996] (0/4) Epoch 6, batch 17350, loss[loss=0.201, simple_loss=0.2432, pruned_loss=0.07941, over 19972.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3351, pruned_loss=0.09735, over 4274108.53 frames. ], batch size: 703, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:22:51,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-21 17:22:57,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1018938.0, ans=0.0 2023-06-21 17:23:02,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1018998.0, ans=0.2 2023-06-21 17:23:56,314 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.958e+02 3.315e+02 3.844e+02 7.686e+02, threshold=6.630e+02, percent-clipped=0.0 2023-06-21 17:23:58,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1019178.0, ans=0.1 2023-06-21 17:24:06,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019178.0, ans=0.1 2023-06-21 17:24:11,648 INFO [train.py:996] (0/4) Epoch 6, batch 17400, loss[loss=0.2185, simple_loss=0.2996, pruned_loss=0.06873, over 21765.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3301, pruned_loss=0.09282, over 4272033.06 frames. ], batch size: 282, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:24:12,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1019238.0, ans=0.1 2023-06-21 17:24:24,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1019238.0, ans=0.125 2023-06-21 17:24:28,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-21 17:24:52,690 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-06-21 17:25:08,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1019358.0, ans=0.125 2023-06-21 17:25:22,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1019418.0, ans=0.0 2023-06-21 17:25:46,437 INFO [train.py:996] (0/4) Epoch 6, batch 17450, loss[loss=0.2039, simple_loss=0.3087, pruned_loss=0.04955, over 21568.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.326, pruned_loss=0.09015, over 4273437.95 frames. ], batch size: 389, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:26:03,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1019538.0, ans=0.125 2023-06-21 17:26:12,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1019598.0, ans=0.2 2023-06-21 17:26:30,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1019658.0, ans=0.2 2023-06-21 17:26:32,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=1019658.0, ans=15.0 2023-06-21 17:26:33,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1019658.0, ans=0.1 2023-06-21 17:26:50,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1019718.0, ans=0.0 2023-06-21 17:26:53,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1019718.0, ans=0.2 2023-06-21 17:27:05,009 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.736e+02 3.255e+02 3.942e+02 6.172e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-21 17:27:13,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019778.0, ans=0.1 2023-06-21 17:27:18,936 INFO [train.py:996] (0/4) Epoch 6, batch 17500, loss[loss=0.2148, simple_loss=0.2898, pruned_loss=0.0699, over 21649.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3218, pruned_loss=0.08734, over 4281846.59 frames. ], batch size: 230, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:28:32,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1020078.0, ans=0.1 2023-06-21 17:28:42,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.62 vs. limit=10.0 2023-06-21 17:28:50,808 INFO [train.py:996] (0/4) Epoch 6, batch 17550, loss[loss=0.3388, simple_loss=0.3964, pruned_loss=0.1406, over 21450.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3222, pruned_loss=0.08597, over 4279666.60 frames. ], batch size: 507, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:29:59,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1020318.0, ans=0.125 2023-06-21 17:30:11,113 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.780e+02 3.218e+02 4.154e+02 6.196e+02, threshold=6.435e+02, percent-clipped=0.0 2023-06-21 17:30:13,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-06-21 17:30:20,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1020378.0, ans=0.0 2023-06-21 17:30:23,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1020438.0, ans=0.2 2023-06-21 17:30:24,495 INFO [train.py:996] (0/4) Epoch 6, batch 17600, loss[loss=0.2884, simple_loss=0.355, pruned_loss=0.1109, over 21595.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3249, pruned_loss=0.0862, over 4266612.58 frames. ], batch size: 389, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:30:26,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1020438.0, ans=0.125 2023-06-21 17:30:26,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1020438.0, ans=0.05 2023-06-21 17:30:38,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1020438.0, ans=0.125 2023-06-21 17:30:44,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1020438.0, ans=0.125 2023-06-21 17:30:44,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1020438.0, ans=0.0 2023-06-21 17:30:48,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1020498.0, ans=0.125 2023-06-21 17:31:08,514 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-06-21 17:31:47,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-21 17:31:56,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-21 17:31:59,658 INFO [train.py:996] (0/4) Epoch 6, batch 17650, loss[loss=0.2494, simple_loss=0.3223, pruned_loss=0.08825, over 21344.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3224, pruned_loss=0.08651, over 4267546.64 frames. ], batch size: 549, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:32:35,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1020798.0, ans=0.1 2023-06-21 17:32:47,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1020858.0, ans=0.2 2023-06-21 17:32:53,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-21 17:33:19,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1020978.0, ans=0.125 2023-06-21 17:33:20,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.977e+02 3.448e+02 4.059e+02 7.958e+02, threshold=6.896e+02, percent-clipped=7.0 2023-06-21 17:33:20,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1020978.0, ans=0.125 2023-06-21 17:33:39,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1020978.0, ans=0.125 2023-06-21 17:33:42,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1021038.0, ans=0.2 2023-06-21 17:33:47,964 INFO [train.py:996] (0/4) Epoch 6, batch 17700, loss[loss=0.2361, simple_loss=0.3126, pruned_loss=0.07985, over 20043.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3165, pruned_loss=0.08367, over 4268071.12 frames. ], batch size: 702, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:33:51,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-21 17:34:00,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1021038.0, ans=0.1 2023-06-21 17:34:19,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-21 17:35:18,363 INFO [train.py:996] (0/4) Epoch 6, batch 17750, loss[loss=0.3379, simple_loss=0.3908, pruned_loss=0.1425, over 21433.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3242, pruned_loss=0.08738, over 4271753.90 frames. ], batch size: 471, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:35:34,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.00 vs. limit=22.5 2023-06-21 17:35:38,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1021398.0, ans=0.125 2023-06-21 17:36:23,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1021518.0, ans=0.0 2023-06-21 17:36:32,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1021578.0, ans=0.1 2023-06-21 17:36:36,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.843e+02 3.336e+02 3.898e+02 5.169e+02, threshold=6.672e+02, percent-clipped=0.0 2023-06-21 17:36:37,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1021578.0, ans=0.0 2023-06-21 17:36:39,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1021578.0, ans=0.125 2023-06-21 17:36:49,095 INFO [train.py:996] (0/4) Epoch 6, batch 17800, loss[loss=0.2258, simple_loss=0.3041, pruned_loss=0.07375, over 21470.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3243, pruned_loss=0.08717, over 4265822.20 frames. ], batch size: 211, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:37:20,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1021698.0, ans=0.0 2023-06-21 17:37:40,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1021818.0, ans=0.2 2023-06-21 17:37:48,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1021818.0, ans=0.1 2023-06-21 17:38:03,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1021878.0, ans=0.2 2023-06-21 17:38:04,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-21 17:38:19,585 INFO [train.py:996] (0/4) Epoch 6, batch 17850, loss[loss=0.2494, simple_loss=0.312, pruned_loss=0.09337, over 20023.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3248, pruned_loss=0.08796, over 4262080.68 frames. ], batch size: 702, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:38:32,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1021938.0, ans=0.0 2023-06-21 17:38:54,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-21 17:39:37,563 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 3.033e+02 3.417e+02 4.325e+02 8.227e+02, threshold=6.834e+02, percent-clipped=5.0 2023-06-21 17:39:54,790 INFO [train.py:996] (0/4) Epoch 6, batch 17900, loss[loss=0.2473, simple_loss=0.3414, pruned_loss=0.07657, over 21850.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3311, pruned_loss=0.09114, over 4265713.20 frames. ], batch size: 282, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:40:30,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1022358.0, ans=0.0 2023-06-21 17:40:49,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1022358.0, ans=0.125 2023-06-21 17:41:29,376 INFO [train.py:996] (0/4) Epoch 6, batch 17950, loss[loss=0.1908, simple_loss=0.2849, pruned_loss=0.04837, over 21623.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3292, pruned_loss=0.08697, over 4259842.70 frames. ], batch size: 263, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:41:46,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-21 17:42:34,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1022718.0, ans=0.125 2023-06-21 17:42:45,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.768e+02 3.134e+02 4.074e+02 6.684e+02, threshold=6.268e+02, percent-clipped=0.0 2023-06-21 17:42:54,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1022778.0, ans=10.0 2023-06-21 17:42:57,289 INFO [train.py:996] (0/4) Epoch 6, batch 18000, loss[loss=0.2018, simple_loss=0.2585, pruned_loss=0.0725, over 20689.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3223, pruned_loss=0.0858, over 4257722.63 frames. ], batch size: 607, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:42:57,290 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 17:43:06,536 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.4215, 0.9774, 1.3495, 1.5732, 1.2615, 1.4420, 1.4669, 1.2851], device='cuda:0') 2023-06-21 17:43:13,461 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2661, simple_loss=0.365, pruned_loss=0.08355, over 1796401.00 frames. 2023-06-21 17:43:13,462 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 17:43:58,202 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.08 vs. limit=22.5 2023-06-21 17:44:01,174 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.78 vs. limit=15.0 2023-06-21 17:44:42,390 INFO [train.py:996] (0/4) Epoch 6, batch 18050, loss[loss=0.2253, simple_loss=0.2995, pruned_loss=0.07555, over 21743.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3155, pruned_loss=0.0842, over 4259141.92 frames. ], batch size: 124, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:46:01,046 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.145e+02 3.852e+02 4.625e+02 8.498e+02, threshold=7.705e+02, percent-clipped=3.0 2023-06-21 17:46:17,972 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-21 17:46:18,329 INFO [train.py:996] (0/4) Epoch 6, batch 18100, loss[loss=0.2498, simple_loss=0.3433, pruned_loss=0.07815, over 21667.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3221, pruned_loss=0.08714, over 4267918.89 frames. ], batch size: 414, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:46:18,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1023438.0, ans=0.125 2023-06-21 17:47:01,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1023558.0, ans=0.125 2023-06-21 17:47:48,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1023678.0, ans=0.125 2023-06-21 17:47:52,821 INFO [train.py:996] (0/4) Epoch 6, batch 18150, loss[loss=0.2215, simple_loss=0.2942, pruned_loss=0.07441, over 21286.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3219, pruned_loss=0.08641, over 4251668.27 frames. ], batch size: 144, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:48:10,295 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.39 vs. limit=10.0 2023-06-21 17:48:14,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1023798.0, ans=0.125 2023-06-21 17:48:29,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1023798.0, ans=0.125 2023-06-21 17:48:48,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1023918.0, ans=0.125 2023-06-21 17:49:04,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.867e+02 3.393e+02 4.008e+02 7.433e+02, threshold=6.785e+02, percent-clipped=0.0 2023-06-21 17:49:16,103 INFO [train.py:996] (0/4) Epoch 6, batch 18200, loss[loss=0.2006, simple_loss=0.2771, pruned_loss=0.06199, over 21737.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3163, pruned_loss=0.08651, over 4260989.60 frames. ], batch size: 333, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:49:36,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1024038.0, ans=0.0 2023-06-21 17:49:51,074 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:49:52,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1024098.0, ans=0.0 2023-06-21 17:50:38,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1024278.0, ans=0.04949747468305833 2023-06-21 17:50:43,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-21 17:50:47,204 INFO [train.py:996] (0/4) Epoch 6, batch 18250, loss[loss=0.2676, simple_loss=0.3255, pruned_loss=0.1048, over 21857.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3083, pruned_loss=0.08371, over 4267410.71 frames. ], batch size: 371, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:51:23,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=12.0 2023-06-21 17:51:30,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1024458.0, ans=0.1 2023-06-21 17:51:35,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1024458.0, ans=0.5 2023-06-21 17:51:52,854 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:52:04,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.691e+02 3.171e+02 4.057e+02 6.181e+02, threshold=6.342e+02, percent-clipped=0.0 2023-06-21 17:52:16,397 INFO [train.py:996] (0/4) Epoch 6, batch 18300, loss[loss=0.3104, simple_loss=0.3871, pruned_loss=0.1168, over 21783.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.308, pruned_loss=0.08289, over 4263474.19 frames. ], batch size: 414, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:52:45,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1024698.0, ans=0.125 2023-06-21 17:52:54,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1024698.0, ans=0.125 2023-06-21 17:53:21,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1024818.0, ans=0.125 2023-06-21 17:53:26,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1024818.0, ans=0.125 2023-06-21 17:53:39,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-06-21 17:53:49,944 INFO [train.py:996] (0/4) Epoch 6, batch 18350, loss[loss=0.2523, simple_loss=0.3177, pruned_loss=0.09346, over 21752.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3126, pruned_loss=0.08232, over 4238874.82 frames. ], batch size: 371, lr: 5.05e-03, grad_scale: 8.0 2023-06-21 17:53:51,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1024938.0, ans=0.2 2023-06-21 17:54:19,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1024998.0, ans=0.0 2023-06-21 17:54:39,259 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-21 17:55:01,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1025118.0, ans=0.0 2023-06-21 17:55:13,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.969e+02 3.614e+02 4.589e+02 8.811e+02, threshold=7.228e+02, percent-clipped=4.0 2023-06-21 17:55:24,037 INFO [train.py:996] (0/4) Epoch 6, batch 18400, loss[loss=0.2244, simple_loss=0.2859, pruned_loss=0.08145, over 21713.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3083, pruned_loss=0.08113, over 4233753.62 frames. ], batch size: 112, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:55:59,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1025298.0, ans=0.125 2023-06-21 17:56:10,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1025358.0, ans=10.0 2023-06-21 17:56:54,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1025478.0, ans=0.125 2023-06-21 17:56:57,271 INFO [train.py:996] (0/4) Epoch 6, batch 18450, loss[loss=0.1913, simple_loss=0.2588, pruned_loss=0.06188, over 15964.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3043, pruned_loss=0.07744, over 4234627.39 frames. ], batch size: 60, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:57:14,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-21 17:57:17,513 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:57:31,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1025598.0, ans=0.0 2023-06-21 17:57:58,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1025658.0, ans=0.125 2023-06-21 17:58:10,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1025718.0, ans=0.035 2023-06-21 17:58:19,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1025778.0, ans=10.0 2023-06-21 17:58:20,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.633e+02 3.038e+02 3.698e+02 5.788e+02, threshold=6.076e+02, percent-clipped=0.0 2023-06-21 17:58:22,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-21 17:58:30,592 INFO [train.py:996] (0/4) Epoch 6, batch 18500, loss[loss=0.2432, simple_loss=0.3011, pruned_loss=0.09268, over 21985.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2996, pruned_loss=0.07663, over 4245261.29 frames. ], batch size: 103, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:58:39,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1025838.0, ans=0.09899494936611666 2023-06-21 17:58:44,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1025898.0, ans=0.025 2023-06-21 17:59:37,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1026018.0, ans=0.0 2023-06-21 18:00:02,407 INFO [train.py:996] (0/4) Epoch 6, batch 18550, loss[loss=0.2194, simple_loss=0.2839, pruned_loss=0.07741, over 21736.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2973, pruned_loss=0.076, over 4243551.75 frames. ], batch size: 316, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:00:08,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1026138.0, ans=0.0 2023-06-21 18:00:21,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1026198.0, ans=0.2 2023-06-21 18:00:56,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1026258.0, ans=0.0 2023-06-21 18:01:17,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026318.0, ans=0.1 2023-06-21 18:01:26,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.898e+02 3.436e+02 4.284e+02 7.618e+02, threshold=6.872e+02, percent-clipped=4.0 2023-06-21 18:01:27,136 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-21 18:01:36,733 INFO [train.py:996] (0/4) Epoch 6, batch 18600, loss[loss=0.2399, simple_loss=0.2938, pruned_loss=0.09299, over 15811.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2975, pruned_loss=0.07702, over 4241494.53 frames. ], batch size: 63, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:01:44,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026438.0, ans=0.1 2023-06-21 18:01:50,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1026498.0, ans=0.125 2023-06-21 18:02:46,208 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-21 18:03:05,864 INFO [train.py:996] (0/4) Epoch 6, batch 18650, loss[loss=0.2092, simple_loss=0.2777, pruned_loss=0.07036, over 21733.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2977, pruned_loss=0.07729, over 4251696.93 frames. ], batch size: 124, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:03:06,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-21 18:04:18,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026918.0, ans=0.1 2023-06-21 18:04:28,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.661e+02 3.021e+02 3.518e+02 6.281e+02, threshold=6.043e+02, percent-clipped=0.0 2023-06-21 18:04:38,440 INFO [train.py:996] (0/4) Epoch 6, batch 18700, loss[loss=0.2091, simple_loss=0.2674, pruned_loss=0.07542, over 21597.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.297, pruned_loss=0.07967, over 4254154.37 frames. ], batch size: 230, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:04:43,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027038.0, ans=0.1 2023-06-21 18:05:29,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1027158.0, ans=0.04949747468305833 2023-06-21 18:05:55,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1027278.0, ans=0.0 2023-06-21 18:06:11,381 INFO [train.py:996] (0/4) Epoch 6, batch 18750, loss[loss=0.2316, simple_loss=0.2955, pruned_loss=0.0839, over 21210.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.2998, pruned_loss=0.08276, over 4264252.59 frames. ], batch size: 608, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:07:34,242 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.279e+02 2.900e+02 3.294e+02 3.931e+02 7.649e+02, threshold=6.589e+02, percent-clipped=4.0 2023-06-21 18:07:45,377 INFO [train.py:996] (0/4) Epoch 6, batch 18800, loss[loss=0.2144, simple_loss=0.2954, pruned_loss=0.06664, over 21360.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3059, pruned_loss=0.08368, over 4271344.35 frames. ], batch size: 211, lr: 5.05e-03, grad_scale: 32.0 2023-06-21 18:07:51,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1027638.0, ans=0.0 2023-06-21 18:07:52,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1027638.0, ans=0.1 2023-06-21 18:07:59,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1027638.0, ans=0.2 2023-06-21 18:08:06,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1027698.0, ans=0.0 2023-06-21 18:08:26,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1027758.0, ans=0.2 2023-06-21 18:09:06,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-21 18:09:18,555 INFO [train.py:996] (0/4) Epoch 6, batch 18850, loss[loss=0.2105, simple_loss=0.2869, pruned_loss=0.06704, over 21691.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3023, pruned_loss=0.07888, over 4269422.75 frames. ], batch size: 298, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:09:53,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-21 18:10:27,016 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:10:41,517 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.769e+02 2.569e+02 2.914e+02 3.331e+02 4.866e+02, threshold=5.828e+02, percent-clipped=0.0 2023-06-21 18:10:51,729 INFO [train.py:996] (0/4) Epoch 6, batch 18900, loss[loss=0.2271, simple_loss=0.2919, pruned_loss=0.08115, over 21710.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2995, pruned_loss=0.07898, over 4268146.42 frames. ], batch size: 391, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:12:23,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1028538.0, ans=0.0 2023-06-21 18:12:24,802 INFO [train.py:996] (0/4) Epoch 6, batch 18950, loss[loss=0.2615, simple_loss=0.3221, pruned_loss=0.1005, over 21747.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.301, pruned_loss=0.08158, over 4279980.94 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:12:25,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2023-06-21 18:13:28,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1028718.0, ans=0.0 2023-06-21 18:13:33,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-21 18:13:34,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1028718.0, ans=0.125 2023-06-21 18:13:37,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-21 18:13:48,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 3.005e+02 3.685e+02 4.757e+02 8.623e+02, threshold=7.371e+02, percent-clipped=7.0 2023-06-21 18:14:02,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.02 vs. limit=10.0 2023-06-21 18:14:04,059 INFO [train.py:996] (0/4) Epoch 6, batch 19000, loss[loss=0.2454, simple_loss=0.2936, pruned_loss=0.09863, over 21848.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3119, pruned_loss=0.0842, over 4283207.79 frames. ], batch size: 98, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:14:06,686 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-21 18:14:09,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1028838.0, ans=0.2 2023-06-21 18:14:45,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1028898.0, ans=0.5 2023-06-21 18:15:36,831 INFO [train.py:996] (0/4) Epoch 6, batch 19050, loss[loss=0.2657, simple_loss=0.3224, pruned_loss=0.1044, over 21205.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.317, pruned_loss=0.0889, over 4285907.97 frames. ], batch size: 143, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:16:41,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1029318.0, ans=0.125 2023-06-21 18:16:51,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-21 18:16:54,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 2.879e+02 3.312e+02 3.801e+02 5.598e+02, threshold=6.624e+02, percent-clipped=0.0 2023-06-21 18:17:10,690 INFO [train.py:996] (0/4) Epoch 6, batch 19100, loss[loss=0.2082, simple_loss=0.2726, pruned_loss=0.07188, over 21251.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3149, pruned_loss=0.08995, over 4279863.92 frames. ], batch size: 548, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:17:50,161 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-21 18:17:51,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1029498.0, ans=0.0 2023-06-21 18:17:57,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1029558.0, ans=0.035 2023-06-21 18:18:03,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1029558.0, ans=0.125 2023-06-21 18:18:04,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1029558.0, ans=0.125 2023-06-21 18:18:07,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1029558.0, ans=0.125 2023-06-21 18:18:37,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1029678.0, ans=0.2 2023-06-21 18:18:51,227 INFO [train.py:996] (0/4) Epoch 6, batch 19150, loss[loss=0.2883, simple_loss=0.3827, pruned_loss=0.09697, over 21657.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3168, pruned_loss=0.09009, over 4282118.21 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:19:13,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1029738.0, ans=0.0 2023-06-21 18:19:16,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1029798.0, ans=0.1 2023-06-21 18:19:21,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1029798.0, ans=0.1 2023-06-21 18:20:17,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 3.009e+02 3.594e+02 4.563e+02 7.134e+02, threshold=7.188e+02, percent-clipped=1.0 2023-06-21 18:20:30,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1030038.0, ans=0.0 2023-06-21 18:20:31,689 INFO [train.py:996] (0/4) Epoch 6, batch 19200, loss[loss=0.2843, simple_loss=0.4055, pruned_loss=0.08159, over 20690.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3284, pruned_loss=0.09121, over 4285944.04 frames. ], batch size: 607, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:20:44,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1030038.0, ans=0.2 2023-06-21 18:21:45,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1030278.0, ans=0.125 2023-06-21 18:22:00,352 INFO [train.py:996] (0/4) Epoch 6, batch 19250, loss[loss=0.2471, simple_loss=0.3298, pruned_loss=0.08215, over 21568.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3265, pruned_loss=0.08533, over 4289815.65 frames. ], batch size: 471, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:22:17,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1030338.0, ans=0.125 2023-06-21 18:22:23,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1030398.0, ans=0.125 2023-06-21 18:22:46,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1030458.0, ans=0.1 2023-06-21 18:22:55,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-21 18:23:12,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1030578.0, ans=0.025 2023-06-21 18:23:25,542 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 2.595e+02 3.039e+02 3.485e+02 4.997e+02, threshold=6.078e+02, percent-clipped=0.0 2023-06-21 18:23:37,855 INFO [train.py:996] (0/4) Epoch 6, batch 19300, loss[loss=0.2511, simple_loss=0.3279, pruned_loss=0.08715, over 21567.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3231, pruned_loss=0.0846, over 4289339.45 frames. ], batch size: 471, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:23:46,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=22.5 2023-06-21 18:24:07,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1030698.0, ans=0.125 2023-06-21 18:24:19,037 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-06-21 18:24:34,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1030818.0, ans=0.0 2023-06-21 18:24:41,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1030818.0, ans=10.0 2023-06-21 18:25:17,125 INFO [train.py:996] (0/4) Epoch 6, batch 19350, loss[loss=0.2246, simple_loss=0.3166, pruned_loss=0.06625, over 21704.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3178, pruned_loss=0.0808, over 4290852.67 frames. ], batch size: 391, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:25:39,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-21 18:26:38,757 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.597e+02 3.186e+02 4.013e+02 6.947e+02, threshold=6.372e+02, percent-clipped=2.0 2023-06-21 18:26:39,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1031178.0, ans=0.125 2023-06-21 18:26:50,820 INFO [train.py:996] (0/4) Epoch 6, batch 19400, loss[loss=0.2146, simple_loss=0.2789, pruned_loss=0.07514, over 21674.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3162, pruned_loss=0.08053, over 4291923.27 frames. ], batch size: 230, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:26:57,253 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:27:08,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-21 18:27:46,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1031418.0, ans=0.125 2023-06-21 18:28:23,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1031538.0, ans=0.125 2023-06-21 18:28:24,303 INFO [train.py:996] (0/4) Epoch 6, batch 19450, loss[loss=0.25, simple_loss=0.311, pruned_loss=0.09449, over 21962.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3143, pruned_loss=0.08272, over 4295608.91 frames. ], batch size: 119, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:28:38,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-21 18:28:40,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1031598.0, ans=0.125 2023-06-21 18:28:45,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1031598.0, ans=0.04949747468305833 2023-06-21 18:29:40,099 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.17 vs. limit=15.0 2023-06-21 18:29:51,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.868e+02 3.154e+02 3.556e+02 5.974e+02, threshold=6.308e+02, percent-clipped=0.0 2023-06-21 18:29:55,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-21 18:29:58,621 INFO [train.py:996] (0/4) Epoch 6, batch 19500, loss[loss=0.2923, simple_loss=0.3479, pruned_loss=0.1183, over 21384.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3102, pruned_loss=0.08409, over 4295953.87 frames. ], batch size: 507, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:30:10,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1031838.0, ans=0.2 2023-06-21 18:30:15,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1031898.0, ans=0.125 2023-06-21 18:30:16,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1031898.0, ans=0.0 2023-06-21 18:30:43,590 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-172000.pt 2023-06-21 18:31:19,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1032078.0, ans=0.125 2023-06-21 18:31:34,626 INFO [train.py:996] (0/4) Epoch 6, batch 19550, loss[loss=0.2294, simple_loss=0.3194, pruned_loss=0.06969, over 21508.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3066, pruned_loss=0.08237, over 4293227.16 frames. ], batch size: 471, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:31:37,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-06-21 18:32:26,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1032318.0, ans=0.1 2023-06-21 18:32:59,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1032378.0, ans=0.0 2023-06-21 18:33:00,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.941e+02 3.432e+02 4.314e+02 8.392e+02, threshold=6.865e+02, percent-clipped=2.0 2023-06-21 18:33:07,589 INFO [train.py:996] (0/4) Epoch 6, batch 19600, loss[loss=0.2359, simple_loss=0.2955, pruned_loss=0.08812, over 21226.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3075, pruned_loss=0.08297, over 4298927.53 frames. ], batch size: 159, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:33:24,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2023-06-21 18:33:47,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1032558.0, ans=0.0 2023-06-21 18:33:55,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1032558.0, ans=0.0 2023-06-21 18:34:11,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-21 18:34:12,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-21 18:34:42,364 INFO [train.py:996] (0/4) Epoch 6, batch 19650, loss[loss=0.2743, simple_loss=0.3389, pruned_loss=0.1049, over 21699.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3137, pruned_loss=0.08786, over 4304590.94 frames. ], batch size: 298, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:34:45,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1032738.0, ans=0.125 2023-06-21 18:34:54,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1032738.0, ans=0.125 2023-06-21 18:35:13,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1032798.0, ans=0.1 2023-06-21 18:35:13,074 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:36:10,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 3.307e+02 3.868e+02 4.631e+02 9.125e+02, threshold=7.736e+02, percent-clipped=5.0 2023-06-21 18:36:23,460 INFO [train.py:996] (0/4) Epoch 6, batch 19700, loss[loss=0.2211, simple_loss=0.2998, pruned_loss=0.0712, over 21584.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3166, pruned_loss=0.08823, over 4297515.34 frames. ], batch size: 230, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:36:26,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1033038.0, ans=0.1 2023-06-21 18:36:34,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1033038.0, ans=0.2 2023-06-21 18:36:37,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1033098.0, ans=0.125 2023-06-21 18:37:21,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1033158.0, ans=0.125 2023-06-21 18:37:28,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1033218.0, ans=0.0 2023-06-21 18:37:45,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1033278.0, ans=0.125 2023-06-21 18:37:58,200 INFO [train.py:996] (0/4) Epoch 6, batch 19750, loss[loss=0.2374, simple_loss=0.3348, pruned_loss=0.07001, over 21617.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3261, pruned_loss=0.08936, over 4293594.35 frames. ], batch size: 230, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:38:25,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1033398.0, ans=0.0 2023-06-21 18:38:45,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1033458.0, ans=0.2 2023-06-21 18:39:02,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1033518.0, ans=0.125 2023-06-21 18:39:24,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 3.146e+02 3.443e+02 4.036e+02 6.788e+02, threshold=6.886e+02, percent-clipped=0.0 2023-06-21 18:39:31,811 INFO [train.py:996] (0/4) Epoch 6, batch 19800, loss[loss=0.2127, simple_loss=0.2949, pruned_loss=0.06524, over 21776.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3243, pruned_loss=0.08945, over 4293906.68 frames. ], batch size: 332, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:40:07,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1033698.0, ans=0.0 2023-06-21 18:40:44,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1033818.0, ans=0.125 2023-06-21 18:40:53,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1033878.0, ans=0.1 2023-06-21 18:40:55,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1033878.0, ans=0.1 2023-06-21 18:40:55,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-21 18:40:57,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-21 18:41:11,660 INFO [train.py:996] (0/4) Epoch 6, batch 19850, loss[loss=0.1459, simple_loss=0.1995, pruned_loss=0.04618, over 16321.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3169, pruned_loss=0.0844, over 4276549.74 frames. ], batch size: 60, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:41:18,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.15 vs. limit=22.5 2023-06-21 18:41:33,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1033998.0, ans=0.125 2023-06-21 18:41:57,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1034058.0, ans=0.2 2023-06-21 18:42:03,920 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.05 vs. limit=10.0 2023-06-21 18:42:33,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.771e+02 3.274e+02 3.883e+02 8.276e+02, threshold=6.549e+02, percent-clipped=3.0 2023-06-21 18:42:42,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-21 18:42:44,466 INFO [train.py:996] (0/4) Epoch 6, batch 19900, loss[loss=0.206, simple_loss=0.2985, pruned_loss=0.05679, over 21456.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3193, pruned_loss=0.08238, over 4277430.69 frames. ], batch size: 211, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:43:04,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1034298.0, ans=0.2 2023-06-21 18:44:18,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1034538.0, ans=0.0 2023-06-21 18:44:19,352 INFO [train.py:996] (0/4) Epoch 6, batch 19950, loss[loss=0.2097, simple_loss=0.2913, pruned_loss=0.06402, over 21836.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3136, pruned_loss=0.08211, over 4271689.96 frames. ], batch size: 372, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:44:22,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1034538.0, ans=0.09899494936611666 2023-06-21 18:44:35,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1034538.0, ans=0.125 2023-06-21 18:44:42,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1034538.0, ans=0.0 2023-06-21 18:45:11,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-21 18:45:47,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.877e+02 3.269e+02 3.817e+02 6.552e+02, threshold=6.538e+02, percent-clipped=1.0 2023-06-21 18:45:53,429 INFO [train.py:996] (0/4) Epoch 6, batch 20000, loss[loss=0.2172, simple_loss=0.2906, pruned_loss=0.07191, over 21356.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3162, pruned_loss=0.08345, over 4265113.13 frames. ], batch size: 144, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:45:55,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1034838.0, ans=0.0 2023-06-21 18:46:52,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1035018.0, ans=0.125 2023-06-21 18:47:04,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1035018.0, ans=0.0 2023-06-21 18:47:26,473 INFO [train.py:996] (0/4) Epoch 6, batch 20050, loss[loss=0.2588, simple_loss=0.3225, pruned_loss=0.09761, over 21882.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3177, pruned_loss=0.08586, over 4275149.97 frames. ], batch size: 118, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:47:50,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1035198.0, ans=0.1 2023-06-21 18:47:52,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1035198.0, ans=0.125 2023-06-21 18:48:01,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1035198.0, ans=0.125 2023-06-21 18:48:02,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1035198.0, ans=0.2 2023-06-21 18:48:13,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1035258.0, ans=0.2 2023-06-21 18:48:50,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1035378.0, ans=0.0 2023-06-21 18:48:54,580 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 2.772e+02 3.158e+02 3.739e+02 6.638e+02, threshold=6.316e+02, percent-clipped=1.0 2023-06-21 18:48:56,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1035378.0, ans=0.125 2023-06-21 18:49:00,951 INFO [train.py:996] (0/4) Epoch 6, batch 20100, loss[loss=0.2189, simple_loss=0.2811, pruned_loss=0.07836, over 21256.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3199, pruned_loss=0.08857, over 4274574.67 frames. ], batch size: 608, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:49:31,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1035498.0, ans=0.0 2023-06-21 18:49:34,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1035498.0, ans=0.125 2023-06-21 18:50:39,809 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:50:44,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1035738.0, ans=0.125 2023-06-21 18:50:45,519 INFO [train.py:996] (0/4) Epoch 6, batch 20150, loss[loss=0.2824, simple_loss=0.3511, pruned_loss=0.1069, over 21355.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3285, pruned_loss=0.0918, over 4275506.14 frames. ], batch size: 159, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:51:58,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1035918.0, ans=0.125 2023-06-21 18:52:10,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-21 18:52:10,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1035978.0, ans=0.1 2023-06-21 18:52:17,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1035978.0, ans=0.2 2023-06-21 18:52:17,966 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.643e+02 3.447e+02 4.089e+02 4.717e+02 8.481e+02, threshold=8.179e+02, percent-clipped=7.0 2023-06-21 18:52:22,528 INFO [train.py:996] (0/4) Epoch 6, batch 20200, loss[loss=0.2497, simple_loss=0.3305, pruned_loss=0.08445, over 21765.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3343, pruned_loss=0.09438, over 4271189.68 frames. ], batch size: 282, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:52:44,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1036098.0, ans=0.125 2023-06-21 18:54:02,159 INFO [train.py:996] (0/4) Epoch 6, batch 20250, loss[loss=0.2412, simple_loss=0.317, pruned_loss=0.08272, over 21385.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3349, pruned_loss=0.09351, over 4279358.73 frames. ], batch size: 176, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:54:18,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-21 18:54:32,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1036398.0, ans=0.125 2023-06-21 18:54:59,712 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-06-21 18:55:22,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.562e+02 2.955e+02 3.343e+02 6.024e+02, threshold=5.910e+02, percent-clipped=0.0 2023-06-21 18:55:31,048 INFO [train.py:996] (0/4) Epoch 6, batch 20300, loss[loss=0.209, simple_loss=0.2803, pruned_loss=0.06881, over 21909.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3323, pruned_loss=0.09018, over 4282187.53 frames. ], batch size: 98, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 18:55:50,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1036698.0, ans=0.0 2023-06-21 18:56:01,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1036698.0, ans=0.05 2023-06-21 18:56:02,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1036698.0, ans=0.1 2023-06-21 18:56:04,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1036758.0, ans=0.035 2023-06-21 18:56:10,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-06-21 18:56:36,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1036818.0, ans=0.2 2023-06-21 18:56:59,750 INFO [train.py:996] (0/4) Epoch 6, batch 20350, loss[loss=0.256, simple_loss=0.327, pruned_loss=0.09245, over 21463.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3326, pruned_loss=0.09067, over 4279908.85 frames. ], batch size: 548, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 18:57:19,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036998.0, ans=0.1 2023-06-21 18:57:19,780 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.54 vs. limit=10.0 2023-06-21 18:57:21,402 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-21 18:57:22,362 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:57:43,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1037058.0, ans=0.0 2023-06-21 18:57:50,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1037058.0, ans=15.0 2023-06-21 18:58:09,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1037118.0, ans=0.125 2023-06-21 18:58:12,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037118.0, ans=0.1 2023-06-21 18:58:16,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1037178.0, ans=0.0 2023-06-21 18:58:34,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.934e+02 3.391e+02 4.276e+02 6.956e+02, threshold=6.781e+02, percent-clipped=4.0 2023-06-21 18:58:37,440 INFO [train.py:996] (0/4) Epoch 6, batch 20400, loss[loss=0.2682, simple_loss=0.3348, pruned_loss=0.1007, over 21852.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3344, pruned_loss=0.09326, over 4271400.61 frames. ], batch size: 118, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:58:37,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037238.0, ans=0.1 2023-06-21 18:58:42,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-21 18:59:21,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1037358.0, ans=0.125 2023-06-21 18:59:24,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037358.0, ans=0.1 2023-06-21 18:59:38,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-21 19:00:05,791 INFO [train.py:996] (0/4) Epoch 6, batch 20450, loss[loss=0.278, simple_loss=0.3384, pruned_loss=0.1088, over 21575.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.335, pruned_loss=0.09543, over 4259690.70 frames. ], batch size: 548, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:00:09,033 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:00:46,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-21 19:01:36,786 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.567e+02 4.308e+02 5.221e+02 9.242e+02, threshold=8.616e+02, percent-clipped=7.0 2023-06-21 19:01:39,725 INFO [train.py:996] (0/4) Epoch 6, batch 20500, loss[loss=0.2003, simple_loss=0.2722, pruned_loss=0.06422, over 16687.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3298, pruned_loss=0.09497, over 4265679.84 frames. ], batch size: 62, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:01:42,350 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-21 19:01:43,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1037838.0, ans=0.0 2023-06-21 19:02:10,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1037898.0, ans=0.125 2023-06-21 19:02:28,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1038018.0, ans=0.2 2023-06-21 19:02:48,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1038018.0, ans=0.1 2023-06-21 19:03:10,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1038078.0, ans=0.125 2023-06-21 19:03:14,147 INFO [train.py:996] (0/4) Epoch 6, batch 20550, loss[loss=0.3324, simple_loss=0.3902, pruned_loss=0.1374, over 21409.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3237, pruned_loss=0.09412, over 4259369.84 frames. ], batch size: 508, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:03:18,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-21 19:03:25,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1038138.0, ans=0.125 2023-06-21 19:03:55,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1038258.0, ans=0.09899494936611666 2023-06-21 19:04:07,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1038318.0, ans=0.0 2023-06-21 19:04:45,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.022e+02 3.769e+02 4.529e+02 7.328e+02, threshold=7.538e+02, percent-clipped=0.0 2023-06-21 19:04:46,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1038378.0, ans=0.125 2023-06-21 19:04:48,966 INFO [train.py:996] (0/4) Epoch 6, batch 20600, loss[loss=0.313, simple_loss=0.3694, pruned_loss=0.1283, over 21537.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3286, pruned_loss=0.09251, over 4249521.28 frames. ], batch size: 471, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:04:54,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1038438.0, ans=0.2 2023-06-21 19:05:33,259 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:06:20,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1038738.0, ans=0.04949747468305833 2023-06-21 19:06:21,570 INFO [train.py:996] (0/4) Epoch 6, batch 20650, loss[loss=0.2239, simple_loss=0.2889, pruned_loss=0.07948, over 21714.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3219, pruned_loss=0.09138, over 4241116.49 frames. ], batch size: 415, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:06:53,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1038798.0, ans=0.5 2023-06-21 19:07:24,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1038918.0, ans=0.125 2023-06-21 19:07:36,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1038978.0, ans=0.2 2023-06-21 19:07:55,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.764e+02 3.114e+02 3.548e+02 5.059e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-21 19:07:57,289 INFO [train.py:996] (0/4) Epoch 6, batch 20700, loss[loss=0.1765, simple_loss=0.2517, pruned_loss=0.05062, over 20766.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3132, pruned_loss=0.08767, over 4248474.14 frames. ], batch size: 608, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:07:59,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1039038.0, ans=0.0 2023-06-21 19:08:20,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1039098.0, ans=0.0 2023-06-21 19:08:39,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1039158.0, ans=0.07 2023-06-21 19:09:36,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1039338.0, ans=0.05 2023-06-21 19:09:36,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1039338.0, ans=0.1 2023-06-21 19:09:37,897 INFO [train.py:996] (0/4) Epoch 6, batch 20750, loss[loss=0.2081, simple_loss=0.2775, pruned_loss=0.06935, over 21446.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3151, pruned_loss=0.08661, over 4257762.97 frames. ], batch size: 211, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:09:56,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1039398.0, ans=0.2 2023-06-21 19:10:17,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1039458.0, ans=0.1 2023-06-21 19:10:23,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1039458.0, ans=0.125 2023-06-21 19:10:38,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1039518.0, ans=0.125 2023-06-21 19:10:53,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1039518.0, ans=0.0 2023-06-21 19:11:11,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 3.291e+02 4.399e+02 5.919e+02 1.160e+03, threshold=8.798e+02, percent-clipped=22.0 2023-06-21 19:11:12,894 INFO [train.py:996] (0/4) Epoch 6, batch 20800, loss[loss=0.239, simple_loss=0.3047, pruned_loss=0.08664, over 21746.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3198, pruned_loss=0.08789, over 4263600.34 frames. ], batch size: 351, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:11:41,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1039698.0, ans=0.0 2023-06-21 19:12:29,535 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:12:41,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1039878.0, ans=0.125 2023-06-21 19:12:45,689 INFO [train.py:996] (0/4) Epoch 6, batch 20850, loss[loss=0.222, simple_loss=0.2834, pruned_loss=0.08032, over 21543.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.314, pruned_loss=0.0866, over 4265947.95 frames. ], batch size: 212, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:13:01,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.90 vs. limit=15.0 2023-06-21 19:14:13,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1040178.0, ans=0.125 2023-06-21 19:14:17,410 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.782e+02 3.461e+02 4.341e+02 9.177e+02, threshold=6.922e+02, percent-clipped=2.0 2023-06-21 19:14:18,813 INFO [train.py:996] (0/4) Epoch 6, batch 20900, loss[loss=0.2462, simple_loss=0.32, pruned_loss=0.08616, over 21548.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3154, pruned_loss=0.08799, over 4268807.02 frames. ], batch size: 195, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:14:29,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1040238.0, ans=10.0 2023-06-21 19:15:05,981 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-21 19:15:45,731 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=8.0 2023-06-21 19:15:51,590 INFO [train.py:996] (0/4) Epoch 6, batch 20950, loss[loss=0.201, simple_loss=0.2725, pruned_loss=0.06478, over 21589.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3102, pruned_loss=0.08421, over 4270313.87 frames. ], batch size: 230, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:16:03,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1040538.0, ans=0.125 2023-06-21 19:16:12,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1040598.0, ans=0.0 2023-06-21 19:16:13,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1040598.0, ans=0.2 2023-06-21 19:16:22,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1040658.0, ans=0.2 2023-06-21 19:16:35,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1040658.0, ans=0.125 2023-06-21 19:16:46,024 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-21 19:16:55,546 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:17:17,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.529e+02 2.877e+02 3.319e+02 6.338e+02, threshold=5.753e+02, percent-clipped=0.0 2023-06-21 19:17:19,465 INFO [train.py:996] (0/4) Epoch 6, batch 21000, loss[loss=0.2109, simple_loss=0.2757, pruned_loss=0.07304, over 21386.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3079, pruned_loss=0.08443, over 4275394.78 frames. ], batch size: 176, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:17:19,465 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 19:17:35,753 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2688, simple_loss=0.3681, pruned_loss=0.08473, over 1796401.00 frames. 2023-06-21 19:17:35,754 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 19:17:41,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.23 vs. limit=15.0 2023-06-21 19:17:45,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1040838.0, ans=0.1 2023-06-21 19:18:09,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1040898.0, ans=0.1 2023-06-21 19:18:12,400 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-21 19:18:57,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-21 19:19:06,350 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.95 vs. limit=5.0 2023-06-21 19:19:08,179 INFO [train.py:996] (0/4) Epoch 6, batch 21050, loss[loss=0.228, simple_loss=0.2868, pruned_loss=0.08455, over 22004.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3057, pruned_loss=0.08425, over 4268773.19 frames. ], batch size: 103, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:19:51,352 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-21 19:19:55,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1041258.0, ans=0.0 2023-06-21 19:20:07,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1041318.0, ans=0.0 2023-06-21 19:20:34,870 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.795e+02 3.116e+02 3.832e+02 6.783e+02, threshold=6.232e+02, percent-clipped=3.0 2023-06-21 19:20:36,443 INFO [train.py:996] (0/4) Epoch 6, batch 21100, loss[loss=0.2178, simple_loss=0.2778, pruned_loss=0.07887, over 21245.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3025, pruned_loss=0.08427, over 4272183.43 frames. ], batch size: 176, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:21:19,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1041558.0, ans=0.125 2023-06-21 19:22:04,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-21 19:22:10,072 INFO [train.py:996] (0/4) Epoch 6, batch 21150, loss[loss=0.2446, simple_loss=0.3021, pruned_loss=0.09357, over 21844.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.2994, pruned_loss=0.08447, over 4269435.28 frames. ], batch size: 107, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:23:41,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.856e+02 3.274e+02 4.026e+02 6.885e+02, threshold=6.548e+02, percent-clipped=2.0 2023-06-21 19:23:43,139 INFO [train.py:996] (0/4) Epoch 6, batch 21200, loss[loss=0.1875, simple_loss=0.256, pruned_loss=0.05952, over 21463.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2964, pruned_loss=0.08337, over 4270371.84 frames. ], batch size: 131, lr: 5.01e-03, grad_scale: 32.0 2023-06-21 19:24:21,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1042158.0, ans=0.125 2023-06-21 19:25:08,285 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:25:12,314 INFO [train.py:996] (0/4) Epoch 6, batch 21250, loss[loss=0.2298, simple_loss=0.2943, pruned_loss=0.08262, over 21629.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2948, pruned_loss=0.08325, over 4258130.51 frames. ], batch size: 263, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:25:45,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1042458.0, ans=0.1 2023-06-21 19:26:41,155 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.008e+02 3.484e+02 4.132e+02 8.300e+02, threshold=6.969e+02, percent-clipped=3.0 2023-06-21 19:26:41,176 INFO [train.py:996] (0/4) Epoch 6, batch 21300, loss[loss=0.2322, simple_loss=0.308, pruned_loss=0.07818, over 21697.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3024, pruned_loss=0.08608, over 4258756.42 frames. ], batch size: 247, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:27:31,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1042758.0, ans=0.1 2023-06-21 19:28:14,539 INFO [train.py:996] (0/4) Epoch 6, batch 21350, loss[loss=0.3084, simple_loss=0.3906, pruned_loss=0.1131, over 19822.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3051, pruned_loss=0.08553, over 4265188.61 frames. ], batch size: 704, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:28:19,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1042938.0, ans=0.125 2023-06-21 19:28:48,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1042998.0, ans=0.015 2023-06-21 19:29:14,730 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:29:38,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1043178.0, ans=0.0 2023-06-21 19:29:40,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1043178.0, ans=0.125 2023-06-21 19:29:49,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.778e+02 3.087e+02 3.779e+02 5.135e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-21 19:29:49,121 INFO [train.py:996] (0/4) Epoch 6, batch 21400, loss[loss=0.3154, simple_loss=0.374, pruned_loss=0.1284, over 21450.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3085, pruned_loss=0.08564, over 4270836.64 frames. ], batch size: 471, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:30:29,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1043358.0, ans=0.125 2023-06-21 19:31:22,541 INFO [train.py:996] (0/4) Epoch 6, batch 21450, loss[loss=0.2454, simple_loss=0.3126, pruned_loss=0.08915, over 21873.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3115, pruned_loss=0.08659, over 4276331.72 frames. ], batch size: 371, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:31:36,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-21 19:31:57,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1043598.0, ans=0.2 2023-06-21 19:32:55,671 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 2.827e+02 3.165e+02 3.718e+02 5.694e+02, threshold=6.329e+02, percent-clipped=0.0 2023-06-21 19:32:55,702 INFO [train.py:996] (0/4) Epoch 6, batch 21500, loss[loss=0.2211, simple_loss=0.2789, pruned_loss=0.08164, over 21241.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.31, pruned_loss=0.08766, over 4269101.46 frames. ], batch size: 143, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:33:15,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-21 19:33:24,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1043898.0, ans=0.07 2023-06-21 19:33:43,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1043958.0, ans=0.0 2023-06-21 19:34:12,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1044018.0, ans=0.1 2023-06-21 19:34:29,732 INFO [train.py:996] (0/4) Epoch 6, batch 21550, loss[loss=0.2259, simple_loss=0.2852, pruned_loss=0.08332, over 21597.00 frames. ], tot_loss[loss=0.236, simple_loss=0.303, pruned_loss=0.08452, over 4268018.86 frames. ], batch size: 391, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:34:46,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1044138.0, ans=0.0 2023-06-21 19:34:51,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1044198.0, ans=0.125 2023-06-21 19:34:59,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1044198.0, ans=0.125 2023-06-21 19:35:01,772 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-21 19:35:02,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1044198.0, ans=0.0 2023-06-21 19:35:16,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1044258.0, ans=0.0 2023-06-21 19:35:41,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-21 19:36:04,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.859e+02 3.362e+02 4.302e+02 8.120e+02, threshold=6.725e+02, percent-clipped=2.0 2023-06-21 19:36:04,934 INFO [train.py:996] (0/4) Epoch 6, batch 21600, loss[loss=0.1912, simple_loss=0.2478, pruned_loss=0.06729, over 20749.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2978, pruned_loss=0.08333, over 4272037.56 frames. ], batch size: 608, lr: 5.00e-03, grad_scale: 32.0 2023-06-21 19:36:27,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1044438.0, ans=0.1 2023-06-21 19:36:39,605 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:37:07,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1044558.0, ans=0.125 2023-06-21 19:37:26,936 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-21 19:37:29,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.67 vs. limit=6.0 2023-06-21 19:37:33,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1044678.0, ans=0.125 2023-06-21 19:37:39,261 INFO [train.py:996] (0/4) Epoch 6, batch 21650, loss[loss=0.2114, simple_loss=0.2939, pruned_loss=0.06448, over 21849.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3025, pruned_loss=0.08137, over 4270927.49 frames. ], batch size: 118, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:38:21,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1044798.0, ans=0.125 2023-06-21 19:39:06,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1044978.0, ans=0.1 2023-06-21 19:39:13,285 INFO [train.py:996] (0/4) Epoch 6, batch 21700, loss[loss=0.1941, simple_loss=0.2716, pruned_loss=0.05832, over 21497.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3027, pruned_loss=0.07917, over 4267593.22 frames. ], batch size: 211, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:39:14,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.795e+02 3.316e+02 4.085e+02 7.380e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-21 19:39:56,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-21 19:40:00,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1045158.0, ans=0.0 2023-06-21 19:40:34,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1045278.0, ans=0.2 2023-06-21 19:40:44,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1045338.0, ans=0.0 2023-06-21 19:40:45,996 INFO [train.py:996] (0/4) Epoch 6, batch 21750, loss[loss=0.2349, simple_loss=0.2858, pruned_loss=0.09198, over 21211.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2989, pruned_loss=0.07954, over 4256074.97 frames. ], batch size: 471, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:41:03,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1045338.0, ans=0.1 2023-06-21 19:41:18,510 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=12.0 2023-06-21 19:41:29,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-21 19:41:36,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1045458.0, ans=0.125 2023-06-21 19:41:43,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1045458.0, ans=0.0 2023-06-21 19:41:52,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1045518.0, ans=0.125 2023-06-21 19:42:19,843 INFO [train.py:996] (0/4) Epoch 6, batch 21800, loss[loss=0.2367, simple_loss=0.3021, pruned_loss=0.08567, over 21483.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.2978, pruned_loss=0.08061, over 4255712.48 frames. ], batch size: 212, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:42:21,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.706e+02 3.025e+02 3.711e+02 5.699e+02, threshold=6.051e+02, percent-clipped=0.0 2023-06-21 19:42:26,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1045638.0, ans=0.04949747468305833 2023-06-21 19:42:41,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-21 19:43:01,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1045698.0, ans=0.1 2023-06-21 19:43:53,881 INFO [train.py:996] (0/4) Epoch 6, batch 21850, loss[loss=0.2223, simple_loss=0.2996, pruned_loss=0.07248, over 21448.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.304, pruned_loss=0.08157, over 4253881.31 frames. ], batch size: 211, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:44:11,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1045938.0, ans=0.05 2023-06-21 19:44:13,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-21 19:44:14,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1045938.0, ans=0.07 2023-06-21 19:45:13,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1046178.0, ans=0.05 2023-06-21 19:45:21,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1046178.0, ans=0.0 2023-06-21 19:45:26,641 INFO [train.py:996] (0/4) Epoch 6, batch 21900, loss[loss=0.2408, simple_loss=0.2966, pruned_loss=0.09244, over 21659.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3068, pruned_loss=0.08286, over 4258047.67 frames. ], batch size: 282, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:45:28,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.966e+02 3.405e+02 4.071e+02 6.584e+02, threshold=6.811e+02, percent-clipped=2.0 2023-06-21 19:46:29,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1046418.0, ans=0.2 2023-06-21 19:46:30,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1046418.0, ans=0.1 2023-06-21 19:46:41,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1046478.0, ans=0.0 2023-06-21 19:47:00,490 INFO [train.py:996] (0/4) Epoch 6, batch 21950, loss[loss=0.1918, simple_loss=0.2612, pruned_loss=0.06122, over 21576.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3014, pruned_loss=0.08216, over 4261422.65 frames. ], batch size: 298, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:47:56,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1046658.0, ans=0.125 2023-06-21 19:48:00,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1046718.0, ans=0.2 2023-06-21 19:48:04,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-21 19:48:10,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1046718.0, ans=0.0 2023-06-21 19:48:34,344 INFO [train.py:996] (0/4) Epoch 6, batch 22000, loss[loss=0.1867, simple_loss=0.2564, pruned_loss=0.05846, over 21485.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2938, pruned_loss=0.07834, over 4261361.51 frames. ], batch size: 230, lr: 5.00e-03, grad_scale: 32.0 2023-06-21 19:48:40,680 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.414e+02 2.927e+02 3.631e+02 6.492e+02, threshold=5.855e+02, percent-clipped=0.0 2023-06-21 19:49:22,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1046958.0, ans=0.1 2023-06-21 19:49:27,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1046958.0, ans=0.2 2023-06-21 19:49:36,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1047018.0, ans=0.125 2023-06-21 19:50:14,062 INFO [train.py:996] (0/4) Epoch 6, batch 22050, loss[loss=0.3999, simple_loss=0.4438, pruned_loss=0.178, over 21407.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2991, pruned_loss=0.08047, over 4257239.82 frames. ], batch size: 507, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:50:25,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=15.0 2023-06-21 19:50:26,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1047138.0, ans=0.07 2023-06-21 19:50:46,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1047198.0, ans=0.0 2023-06-21 19:50:52,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1047258.0, ans=0.0 2023-06-21 19:51:48,071 INFO [train.py:996] (0/4) Epoch 6, batch 22100, loss[loss=0.2651, simple_loss=0.3378, pruned_loss=0.09621, over 21764.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3101, pruned_loss=0.08506, over 4249752.29 frames. ], batch size: 247, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:51:48,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1047438.0, ans=0.0 2023-06-21 19:51:51,162 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 3.410e+02 3.908e+02 4.704e+02 7.568e+02, threshold=7.817e+02, percent-clipped=7.0 2023-06-21 19:51:59,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1047438.0, ans=0.2 2023-06-21 19:52:27,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1047558.0, ans=0.0 2023-06-21 19:53:12,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-21 19:53:15,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1047678.0, ans=0.07 2023-06-21 19:53:22,052 INFO [train.py:996] (0/4) Epoch 6, batch 22150, loss[loss=0.2322, simple_loss=0.307, pruned_loss=0.07867, over 21862.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3133, pruned_loss=0.08701, over 4254674.33 frames. ], batch size: 371, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:53:51,667 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-06-21 19:54:06,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1047858.0, ans=0.0 2023-06-21 19:54:21,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1047918.0, ans=0.125 2023-06-21 19:54:49,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1047978.0, ans=0.125 2023-06-21 19:54:50,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1047978.0, ans=0.125 2023-06-21 19:55:00,801 INFO [train.py:996] (0/4) Epoch 6, batch 22200, loss[loss=0.2249, simple_loss=0.289, pruned_loss=0.08043, over 21225.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3157, pruned_loss=0.0889, over 4267246.69 frames. ], batch size: 608, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:55:08,464 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.486e+02 3.160e+02 3.693e+02 4.495e+02 7.335e+02, threshold=7.385e+02, percent-clipped=0.0 2023-06-21 19:55:13,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1048038.0, ans=0.125 2023-06-21 19:55:17,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1048038.0, ans=0.2 2023-06-21 19:55:31,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1048098.0, ans=0.07 2023-06-21 19:55:44,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1048158.0, ans=0.2 2023-06-21 19:56:34,376 INFO [train.py:996] (0/4) Epoch 6, batch 22250, loss[loss=0.2758, simple_loss=0.3381, pruned_loss=0.1067, over 21499.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.322, pruned_loss=0.08976, over 4265233.26 frames. ], batch size: 211, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:56:34,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1048338.0, ans=0.1 2023-06-21 19:57:06,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.18 vs. limit=10.0 2023-06-21 19:57:18,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1048518.0, ans=0.125 2023-06-21 19:57:20,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1048518.0, ans=0.0 2023-06-21 19:57:47,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1048578.0, ans=0.2 2023-06-21 19:57:58,005 INFO [train.py:996] (0/4) Epoch 6, batch 22300, loss[loss=0.2879, simple_loss=0.3465, pruned_loss=0.1146, over 21830.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3255, pruned_loss=0.09291, over 4271522.60 frames. ], batch size: 414, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 19:58:05,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 3.056e+02 3.498e+02 3.964e+02 6.113e+02, threshold=6.996e+02, percent-clipped=0.0 2023-06-21 19:58:25,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1048698.0, ans=0.125 2023-06-21 19:58:27,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2023-06-21 19:58:34,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1048758.0, ans=10.0 2023-06-21 19:58:41,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1048758.0, ans=0.125 2023-06-21 19:58:51,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1048818.0, ans=0.0 2023-06-21 19:59:14,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1048878.0, ans=0.0 2023-06-21 19:59:25,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1048878.0, ans=0.07 2023-06-21 19:59:27,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-21 19:59:27,788 INFO [train.py:996] (0/4) Epoch 6, batch 22350, loss[loss=0.2807, simple_loss=0.3359, pruned_loss=0.1128, over 21847.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3237, pruned_loss=0.09427, over 4282968.66 frames. ], batch size: 371, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 19:59:34,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1048938.0, ans=0.0 2023-06-21 20:00:35,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1049178.0, ans=0.07 2023-06-21 20:01:01,242 INFO [train.py:996] (0/4) Epoch 6, batch 22400, loss[loss=0.2105, simple_loss=0.3041, pruned_loss=0.05843, over 20961.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3197, pruned_loss=0.09071, over 4273393.24 frames. ], batch size: 608, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:01:04,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.868e+02 3.552e+02 4.171e+02 5.869e+02, threshold=7.104e+02, percent-clipped=0.0 2023-06-21 20:01:22,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1049298.0, ans=0.2 2023-06-21 20:02:34,803 INFO [train.py:996] (0/4) Epoch 6, batch 22450, loss[loss=0.2028, simple_loss=0.2602, pruned_loss=0.07271, over 21585.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3136, pruned_loss=0.08874, over 4268189.93 frames. ], batch size: 247, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:02:55,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-21 20:03:00,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1049598.0, ans=0.0 2023-06-21 20:03:03,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1049658.0, ans=0.0 2023-06-21 20:03:37,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1049718.0, ans=0.07 2023-06-21 20:03:38,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1049718.0, ans=0.125 2023-06-21 20:03:41,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.58 vs. limit=8.0 2023-06-21 20:04:08,821 INFO [train.py:996] (0/4) Epoch 6, batch 22500, loss[loss=0.2136, simple_loss=0.2749, pruned_loss=0.07613, over 21554.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3096, pruned_loss=0.08851, over 4268938.43 frames. ], batch size: 230, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:04:11,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.833e+02 3.380e+02 4.088e+02 7.765e+02, threshold=6.760e+02, percent-clipped=2.0 2023-06-21 20:04:50,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1049958.0, ans=0.125 2023-06-21 20:04:50,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1049958.0, ans=0.0 2023-06-21 20:04:51,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1049958.0, ans=0.09899494936611666 2023-06-21 20:05:05,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1050018.0, ans=0.2 2023-06-21 20:05:12,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1050018.0, ans=0.07 2023-06-21 20:05:29,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1050078.0, ans=0.125 2023-06-21 20:05:42,966 INFO [train.py:996] (0/4) Epoch 6, batch 22550, loss[loss=0.2031, simple_loss=0.2766, pruned_loss=0.06477, over 16526.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3137, pruned_loss=0.08811, over 4273445.54 frames. ], batch size: 60, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:06:06,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-06-21 20:06:11,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1050198.0, ans=0.125 2023-06-21 20:06:40,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1050258.0, ans=0.04949747468305833 2023-06-21 20:06:56,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1050318.0, ans=0.125 2023-06-21 20:07:18,463 INFO [train.py:996] (0/4) Epoch 6, batch 22600, loss[loss=0.2419, simple_loss=0.326, pruned_loss=0.07894, over 21886.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3167, pruned_loss=0.08857, over 4283671.97 frames. ], batch size: 372, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:07:21,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.122e+02 3.804e+02 4.633e+02 7.875e+02, threshold=7.609e+02, percent-clipped=4.0 2023-06-21 20:07:21,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1050438.0, ans=0.0 2023-06-21 20:07:33,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1050498.0, ans=0.1 2023-06-21 20:07:47,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-21 20:08:31,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1050678.0, ans=0.125 2023-06-21 20:08:41,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1050678.0, ans=0.125 2023-06-21 20:08:47,121 INFO [train.py:996] (0/4) Epoch 6, batch 22650, loss[loss=0.254, simple_loss=0.3228, pruned_loss=0.09263, over 15203.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3134, pruned_loss=0.08789, over 4265750.95 frames. ], batch size: 60, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:09:00,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1050798.0, ans=0.2 2023-06-21 20:09:11,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1050798.0, ans=0.125 2023-06-21 20:10:06,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1050978.0, ans=0.1 2023-06-21 20:10:19,383 INFO [train.py:996] (0/4) Epoch 6, batch 22700, loss[loss=0.205, simple_loss=0.2715, pruned_loss=0.06925, over 21808.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3067, pruned_loss=0.0868, over 4255369.67 frames. ], batch size: 102, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:10:23,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.096e+02 3.667e+02 4.331e+02 7.482e+02, threshold=7.334e+02, percent-clipped=0.0 2023-06-21 20:10:56,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1051098.0, ans=0.1 2023-06-21 20:11:01,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1051158.0, ans=0.125 2023-06-21 20:11:04,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1051158.0, ans=0.0 2023-06-21 20:11:31,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1051218.0, ans=0.0 2023-06-21 20:11:40,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1051278.0, ans=0.0 2023-06-21 20:11:52,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-21 20:11:53,622 INFO [train.py:996] (0/4) Epoch 6, batch 22750, loss[loss=0.2778, simple_loss=0.3417, pruned_loss=0.1069, over 20696.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3076, pruned_loss=0.08843, over 4260369.18 frames. ], batch size: 607, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:11:54,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1051338.0, ans=0.125 2023-06-21 20:11:58,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1051338.0, ans=0.1 2023-06-21 20:12:03,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-21 20:12:34,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1051458.0, ans=0.125 2023-06-21 20:13:25,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1051638.0, ans=0.125 2023-06-21 20:13:26,412 INFO [train.py:996] (0/4) Epoch 6, batch 22800, loss[loss=0.2276, simple_loss=0.2995, pruned_loss=0.07787, over 21413.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3119, pruned_loss=0.091, over 4267822.40 frames. ], batch size: 159, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:13:30,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 2.968e+02 3.368e+02 3.965e+02 6.508e+02, threshold=6.737e+02, percent-clipped=0.0 2023-06-21 20:13:48,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1051698.0, ans=0.0 2023-06-21 20:13:50,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1051698.0, ans=0.1 2023-06-21 20:14:13,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1051758.0, ans=0.125 2023-06-21 20:14:59,380 INFO [train.py:996] (0/4) Epoch 6, batch 22850, loss[loss=0.2336, simple_loss=0.2859, pruned_loss=0.09067, over 21740.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3075, pruned_loss=0.09025, over 4264420.42 frames. ], batch size: 283, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:16:34,413 INFO [train.py:996] (0/4) Epoch 6, batch 22900, loss[loss=0.2067, simple_loss=0.2639, pruned_loss=0.07473, over 21835.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3097, pruned_loss=0.08922, over 4247769.10 frames. ], batch size: 107, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:16:38,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1052238.0, ans=0.05 2023-06-21 20:16:39,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.845e+02 3.273e+02 3.939e+02 6.144e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-21 20:16:58,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1052298.0, ans=0.04949747468305833 2023-06-21 20:17:05,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1052298.0, ans=0.04949747468305833 2023-06-21 20:17:15,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1052298.0, ans=0.2 2023-06-21 20:17:23,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-21 20:17:35,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1052358.0, ans=0.125 2023-06-21 20:17:36,588 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.33 vs. limit=15.0 2023-06-21 20:18:15,286 INFO [train.py:996] (0/4) Epoch 6, batch 22950, loss[loss=0.2433, simple_loss=0.3595, pruned_loss=0.06354, over 21782.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3231, pruned_loss=0.0874, over 4253178.13 frames. ], batch size: 316, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:18:52,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1052598.0, ans=0.125 2023-06-21 20:18:56,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1052598.0, ans=0.125 2023-06-21 20:19:01,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1052658.0, ans=0.125 2023-06-21 20:19:07,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1052658.0, ans=0.125 2023-06-21 20:19:11,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1052658.0, ans=0.0 2023-06-21 20:19:17,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1052718.0, ans=0.125 2023-06-21 20:19:20,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1052718.0, ans=0.125 2023-06-21 20:19:49,048 INFO [train.py:996] (0/4) Epoch 6, batch 23000, loss[loss=0.2588, simple_loss=0.3155, pruned_loss=0.101, over 21246.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3221, pruned_loss=0.0858, over 4261640.93 frames. ], batch size: 143, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:19:53,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.747e+02 3.155e+02 3.821e+02 7.452e+02, threshold=6.310e+02, percent-clipped=2.0 2023-06-21 20:20:22,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1052898.0, ans=0.125 2023-06-21 20:20:57,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1053018.0, ans=0.0 2023-06-21 20:21:29,286 INFO [train.py:996] (0/4) Epoch 6, batch 23050, loss[loss=0.3041, simple_loss=0.3652, pruned_loss=0.1215, over 21790.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3247, pruned_loss=0.0891, over 4263973.20 frames. ], batch size: 441, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:21:43,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1053138.0, ans=0.0 2023-06-21 20:22:18,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1053318.0, ans=0.1 2023-06-21 20:22:21,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1053318.0, ans=0.125 2023-06-21 20:22:42,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1053378.0, ans=0.5 2023-06-21 20:23:02,819 INFO [train.py:996] (0/4) Epoch 6, batch 23100, loss[loss=0.239, simple_loss=0.2991, pruned_loss=0.08951, over 21784.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.32, pruned_loss=0.08941, over 4264883.08 frames. ], batch size: 118, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:23:07,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.234e+02 3.747e+02 4.482e+02 8.068e+02, threshold=7.493e+02, percent-clipped=4.0 2023-06-21 20:23:13,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-06-21 20:23:26,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053498.0, ans=0.1 2023-06-21 20:23:42,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1053558.0, ans=0.05 2023-06-21 20:23:53,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053618.0, ans=0.1 2023-06-21 20:23:58,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1053618.0, ans=0.2 2023-06-21 20:24:35,506 INFO [train.py:996] (0/4) Epoch 6, batch 23150, loss[loss=0.2607, simple_loss=0.3194, pruned_loss=0.1011, over 21767.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3136, pruned_loss=0.08829, over 4272660.24 frames. ], batch size: 441, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:25:09,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1053858.0, ans=0.125 2023-06-21 20:25:10,885 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:25:16,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-21 20:25:28,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1053918.0, ans=22.5 2023-06-21 20:25:37,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1053978.0, ans=0.125 2023-06-21 20:25:58,106 INFO [train.py:996] (0/4) Epoch 6, batch 23200, loss[loss=0.2327, simple_loss=0.2976, pruned_loss=0.08386, over 21550.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3125, pruned_loss=0.08863, over 4276110.77 frames. ], batch size: 212, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:26:13,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.774e+02 3.196e+02 3.706e+02 6.362e+02, threshold=6.391e+02, percent-clipped=0.0 2023-06-21 20:26:21,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1054098.0, ans=0.125 2023-06-21 20:26:32,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1054098.0, ans=0.0 2023-06-21 20:27:07,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1054278.0, ans=0.0 2023-06-21 20:27:07,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1054278.0, ans=0.5 2023-06-21 20:27:30,822 INFO [train.py:996] (0/4) Epoch 6, batch 23250, loss[loss=0.267, simple_loss=0.3257, pruned_loss=0.1041, over 21815.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3122, pruned_loss=0.08963, over 4287504.95 frames. ], batch size: 282, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:27:31,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1054338.0, ans=0.1 2023-06-21 20:28:06,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1054458.0, ans=0.1 2023-06-21 20:28:33,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1054518.0, ans=0.125 2023-06-21 20:28:41,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1054578.0, ans=0.05 2023-06-21 20:29:03,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1054578.0, ans=0.035 2023-06-21 20:29:05,999 INFO [train.py:996] (0/4) Epoch 6, batch 23300, loss[loss=0.3353, simple_loss=0.4256, pruned_loss=0.1225, over 21520.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3198, pruned_loss=0.0913, over 4288808.18 frames. ], batch size: 471, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:29:06,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1054638.0, ans=0.125 2023-06-21 20:29:12,272 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 2.961e+02 3.509e+02 4.048e+02 6.618e+02, threshold=7.018e+02, percent-clipped=1.0 2023-06-21 20:29:14,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1054638.0, ans=0.0 2023-06-21 20:29:22,278 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.14 vs. limit=10.0 2023-06-21 20:30:10,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1054818.0, ans=0.125 2023-06-21 20:30:12,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1054818.0, ans=0.1 2023-06-21 20:30:34,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1054878.0, ans=0.0 2023-06-21 20:30:39,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.15 vs. limit=22.5 2023-06-21 20:30:40,393 INFO [train.py:996] (0/4) Epoch 6, batch 23350, loss[loss=0.1831, simple_loss=0.2589, pruned_loss=0.0536, over 21263.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3237, pruned_loss=0.09008, over 4278424.90 frames. ], batch size: 176, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:30:40,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1054938.0, ans=0.125 2023-06-21 20:30:45,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1054938.0, ans=0.07 2023-06-21 20:31:08,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1054998.0, ans=0.1 2023-06-21 20:31:11,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1055058.0, ans=0.125 2023-06-21 20:31:27,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1055058.0, ans=0.0 2023-06-21 20:32:01,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1055178.0, ans=0.125 2023-06-21 20:32:12,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1055238.0, ans=0.125 2023-06-21 20:32:12,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.51 vs. limit=22.5 2023-06-21 20:32:13,199 INFO [train.py:996] (0/4) Epoch 6, batch 23400, loss[loss=0.2237, simple_loss=0.2896, pruned_loss=0.07894, over 21276.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3165, pruned_loss=0.08605, over 4279330.68 frames. ], batch size: 159, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:32:18,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.966e+02 3.517e+02 4.346e+02 6.933e+02, threshold=7.034e+02, percent-clipped=0.0 2023-06-21 20:32:42,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1055358.0, ans=0.0 2023-06-21 20:32:45,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1055358.0, ans=0.125 2023-06-21 20:32:46,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1055358.0, ans=0.125 2023-06-21 20:33:20,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1055418.0, ans=0.125 2023-06-21 20:33:47,375 INFO [train.py:996] (0/4) Epoch 6, batch 23450, loss[loss=0.2729, simple_loss=0.3364, pruned_loss=0.1047, over 21288.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3178, pruned_loss=0.08874, over 4278477.00 frames. ], batch size: 143, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:34:19,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1055658.0, ans=0.125 2023-06-21 20:34:36,138 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:35:02,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1055718.0, ans=0.125 2023-06-21 20:35:09,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-21 20:35:11,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1055778.0, ans=0.125 2023-06-21 20:35:20,286 INFO [train.py:996] (0/4) Epoch 6, batch 23500, loss[loss=0.236, simple_loss=0.2979, pruned_loss=0.08709, over 21833.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3179, pruned_loss=0.09049, over 4285365.87 frames. ], batch size: 247, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:35:22,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1055838.0, ans=0.1 2023-06-21 20:35:27,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 2.940e+02 3.315e+02 3.870e+02 5.953e+02, threshold=6.630e+02, percent-clipped=0.0 2023-06-21 20:35:48,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=12.0 2023-06-21 20:36:06,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1055958.0, ans=0.125 2023-06-21 20:36:08,105 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-176000.pt 2023-06-21 20:36:34,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1056018.0, ans=0.0 2023-06-21 20:36:53,713 INFO [train.py:996] (0/4) Epoch 6, batch 23550, loss[loss=0.2259, simple_loss=0.2907, pruned_loss=0.08056, over 21866.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3118, pruned_loss=0.08976, over 4271945.21 frames. ], batch size: 107, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:37:09,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1056198.0, ans=0.1 2023-06-21 20:37:42,450 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:37:47,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-21 20:38:17,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1056378.0, ans=0.0 2023-06-21 20:38:22,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1056378.0, ans=0.125 2023-06-21 20:38:27,646 INFO [train.py:996] (0/4) Epoch 6, batch 23600, loss[loss=0.3186, simple_loss=0.3837, pruned_loss=0.1268, over 21847.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3134, pruned_loss=0.09013, over 4261710.43 frames. ], batch size: 124, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:38:34,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.807e+02 3.254e+02 4.113e+02 6.430e+02, threshold=6.509e+02, percent-clipped=0.0 2023-06-21 20:38:37,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1056438.0, ans=0.1 2023-06-21 20:38:39,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1056438.0, ans=0.1 2023-06-21 20:38:46,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1056498.0, ans=0.125 2023-06-21 20:39:08,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1056558.0, ans=0.09899494936611666 2023-06-21 20:39:34,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1056618.0, ans=0.0 2023-06-21 20:39:41,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1056618.0, ans=0.125 2023-06-21 20:39:45,059 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:39:53,239 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=12.0 2023-06-21 20:39:57,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1056678.0, ans=0.1 2023-06-21 20:40:01,263 INFO [train.py:996] (0/4) Epoch 6, batch 23650, loss[loss=0.3135, simple_loss=0.3764, pruned_loss=0.1253, over 21381.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.314, pruned_loss=0.08773, over 4264188.62 frames. ], batch size: 507, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:40:06,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-21 20:40:10,849 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:40:26,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1056798.0, ans=0.1 2023-06-21 20:40:30,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1056798.0, ans=0.0 2023-06-21 20:41:07,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1056918.0, ans=0.0 2023-06-21 20:41:23,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1056978.0, ans=0.0 2023-06-21 20:41:39,603 INFO [train.py:996] (0/4) Epoch 6, batch 23700, loss[loss=0.2157, simple_loss=0.2853, pruned_loss=0.07304, over 21297.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3168, pruned_loss=0.08786, over 4265447.49 frames. ], batch size: 176, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:41:40,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1057038.0, ans=0.0 2023-06-21 20:41:51,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.889e+02 3.360e+02 4.132e+02 7.517e+02, threshold=6.720e+02, percent-clipped=1.0 2023-06-21 20:41:58,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1057098.0, ans=0.1 2023-06-21 20:42:27,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1057158.0, ans=0.0 2023-06-21 20:42:39,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1057218.0, ans=0.125 2023-06-21 20:42:52,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-21 20:42:59,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1057278.0, ans=0.2 2023-06-21 20:43:20,236 INFO [train.py:996] (0/4) Epoch 6, batch 23750, loss[loss=0.1874, simple_loss=0.2857, pruned_loss=0.04455, over 21751.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3201, pruned_loss=0.08892, over 4267714.57 frames. ], batch size: 332, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:43:27,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1057338.0, ans=0.1 2023-06-21 20:43:53,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1057398.0, ans=0.05 2023-06-21 20:44:31,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1057578.0, ans=0.1 2023-06-21 20:44:47,631 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-21 20:44:55,718 INFO [train.py:996] (0/4) Epoch 6, batch 23800, loss[loss=0.3383, simple_loss=0.4099, pruned_loss=0.1334, over 21439.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3196, pruned_loss=0.08697, over 4262481.82 frames. ], batch size: 471, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:45:03,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.613e+02 2.976e+02 3.389e+02 5.789e+02, threshold=5.953e+02, percent-clipped=0.0 2023-06-21 20:45:30,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1057758.0, ans=0.125 2023-06-21 20:45:34,519 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:45:48,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1057818.0, ans=0.125 2023-06-21 20:46:30,954 INFO [train.py:996] (0/4) Epoch 6, batch 23850, loss[loss=0.2671, simple_loss=0.3323, pruned_loss=0.101, over 21380.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3272, pruned_loss=0.08914, over 4264471.05 frames. ], batch size: 176, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:46:37,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1057938.0, ans=0.0 2023-06-21 20:46:44,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1057998.0, ans=0.0 2023-06-21 20:46:53,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1057998.0, ans=0.125 2023-06-21 20:47:47,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-21 20:48:00,302 INFO [train.py:996] (0/4) Epoch 6, batch 23900, loss[loss=0.2654, simple_loss=0.3359, pruned_loss=0.09743, over 21736.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3341, pruned_loss=0.09068, over 4264970.41 frames. ], batch size: 124, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:48:07,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.320e+02 3.834e+02 4.673e+02 6.802e+02, threshold=7.669e+02, percent-clipped=5.0 2023-06-21 20:48:12,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1058238.0, ans=0.125 2023-06-21 20:48:14,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1058298.0, ans=0.0 2023-06-21 20:48:26,295 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-21 20:48:51,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1058358.0, ans=0.125 2023-06-21 20:49:33,463 INFO [train.py:996] (0/4) Epoch 6, batch 23950, loss[loss=0.2289, simple_loss=0.287, pruned_loss=0.0854, over 21451.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3274, pruned_loss=0.09043, over 4254524.91 frames. ], batch size: 211, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:49:40,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-21 20:50:03,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1058658.0, ans=0.125 2023-06-21 20:50:04,539 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:50:31,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1058718.0, ans=0.1 2023-06-21 20:50:36,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1058718.0, ans=0.0 2023-06-21 20:50:54,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1058778.0, ans=0.1 2023-06-21 20:51:08,229 INFO [train.py:996] (0/4) Epoch 6, batch 24000, loss[loss=0.3181, simple_loss=0.3788, pruned_loss=0.1287, over 21596.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3284, pruned_loss=0.0931, over 4245583.48 frames. ], batch size: 415, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:51:08,230 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 20:51:24,743 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2687, simple_loss=0.3663, pruned_loss=0.08552, over 1796401.00 frames. 2023-06-21 20:51:24,744 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 20:51:25,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1058838.0, ans=0.125 2023-06-21 20:51:32,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.640e+02 3.190e+02 3.718e+02 4.654e+02 6.990e+02, threshold=7.435e+02, percent-clipped=0.0 2023-06-21 20:52:36,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1059018.0, ans=0.0 2023-06-21 20:52:54,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1059078.0, ans=0.0 2023-06-21 20:52:58,920 INFO [train.py:996] (0/4) Epoch 6, batch 24050, loss[loss=0.1848, simple_loss=0.2764, pruned_loss=0.04662, over 21402.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.33, pruned_loss=0.09347, over 4257681.45 frames. ], batch size: 194, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:53:02,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1059138.0, ans=0.125 2023-06-21 20:53:09,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1059138.0, ans=0.125 2023-06-21 20:53:34,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-21 20:53:40,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1059198.0, ans=0.025 2023-06-21 20:54:18,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1059378.0, ans=0.125 2023-06-21 20:54:33,355 INFO [train.py:996] (0/4) Epoch 6, batch 24100, loss[loss=0.3232, simple_loss=0.3865, pruned_loss=0.1299, over 21446.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3297, pruned_loss=0.0912, over 4260238.22 frames. ], batch size: 471, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:54:40,891 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.753e+02 3.093e+02 3.531e+02 5.265e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-21 20:55:16,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1059558.0, ans=10.0 2023-06-21 20:55:30,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1059558.0, ans=0.125 2023-06-21 20:55:40,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1059618.0, ans=0.0 2023-06-21 20:55:49,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.41 vs. limit=15.0 2023-06-21 20:56:07,341 INFO [train.py:996] (0/4) Epoch 6, batch 24150, loss[loss=0.2923, simple_loss=0.3435, pruned_loss=0.1206, over 21599.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3299, pruned_loss=0.09333, over 4266451.66 frames. ], batch size: 471, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 20:56:45,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1059798.0, ans=0.1 2023-06-21 20:57:50,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1060038.0, ans=0.0 2023-06-21 20:57:51,182 INFO [train.py:996] (0/4) Epoch 6, batch 24200, loss[loss=0.2355, simple_loss=0.3133, pruned_loss=0.07886, over 21481.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.332, pruned_loss=0.0944, over 4275245.43 frames. ], batch size: 212, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 20:58:05,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.135e+02 3.604e+02 4.507e+02 8.443e+02, threshold=7.208e+02, percent-clipped=5.0 2023-06-21 20:58:10,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1060098.0, ans=0.125 2023-06-21 20:58:37,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1060158.0, ans=0.025 2023-06-21 20:59:30,806 INFO [train.py:996] (0/4) Epoch 6, batch 24250, loss[loss=0.2635, simple_loss=0.3484, pruned_loss=0.08928, over 21452.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3285, pruned_loss=0.08691, over 4276640.73 frames. ], batch size: 507, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 21:01:03,905 INFO [train.py:996] (0/4) Epoch 6, batch 24300, loss[loss=0.1646, simple_loss=0.2483, pruned_loss=0.04042, over 21757.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3205, pruned_loss=0.08104, over 4277227.50 frames. ], batch size: 282, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 21:01:12,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.484e+02 3.071e+02 3.742e+02 5.232e+02, threshold=6.142e+02, percent-clipped=0.0 2023-06-21 21:01:54,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-21 21:01:54,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.48 vs. limit=10.0 2023-06-21 21:02:37,556 INFO [train.py:996] (0/4) Epoch 6, batch 24350, loss[loss=0.2269, simple_loss=0.2828, pruned_loss=0.08548, over 20234.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3175, pruned_loss=0.08241, over 4282170.65 frames. ], batch size: 702, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 21:03:17,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1061058.0, ans=0.125 2023-06-21 21:04:16,773 INFO [train.py:996] (0/4) Epoch 6, batch 24400, loss[loss=0.2432, simple_loss=0.3119, pruned_loss=0.08723, over 21694.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.322, pruned_loss=0.08557, over 4278653.72 frames. ], batch size: 351, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 21:04:25,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.143e+02 3.570e+02 4.226e+02 5.954e+02, threshold=7.140e+02, percent-clipped=0.0 2023-06-21 21:04:52,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1061358.0, ans=0.2 2023-06-21 21:05:01,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1061358.0, ans=0.1 2023-06-21 21:05:27,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1061418.0, ans=0.125 2023-06-21 21:05:43,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1061478.0, ans=0.2 2023-06-21 21:05:48,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1061478.0, ans=0.125 2023-06-21 21:05:51,491 INFO [train.py:996] (0/4) Epoch 6, batch 24450, loss[loss=0.2839, simple_loss=0.38, pruned_loss=0.09391, over 21678.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3256, pruned_loss=0.08791, over 4271097.40 frames. ], batch size: 414, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:07:24,541 INFO [train.py:996] (0/4) Epoch 6, batch 24500, loss[loss=0.2582, simple_loss=0.327, pruned_loss=0.09474, over 21883.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3275, pruned_loss=0.08864, over 4277238.55 frames. ], batch size: 107, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:07:25,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-21 21:07:30,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-21 21:07:33,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.842e+02 3.184e+02 3.780e+02 5.341e+02, threshold=6.369e+02, percent-clipped=0.0 2023-06-21 21:07:37,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1061838.0, ans=0.1 2023-06-21 21:07:45,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1061898.0, ans=0.02 2023-06-21 21:08:16,681 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:08:18,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1061958.0, ans=0.0 2023-06-21 21:08:58,981 INFO [train.py:996] (0/4) Epoch 6, batch 24550, loss[loss=0.3243, simple_loss=0.3914, pruned_loss=0.1286, over 21552.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3298, pruned_loss=0.09152, over 4274247.98 frames. ], batch size: 414, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:09:04,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1062138.0, ans=0.04949747468305833 2023-06-21 21:09:49,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1062258.0, ans=0.1 2023-06-21 21:09:52,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1062258.0, ans=0.2 2023-06-21 21:10:06,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1062318.0, ans=0.05 2023-06-21 21:10:29,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1062378.0, ans=0.0 2023-06-21 21:10:33,718 INFO [train.py:996] (0/4) Epoch 6, batch 24600, loss[loss=0.2271, simple_loss=0.2787, pruned_loss=0.08771, over 21229.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3253, pruned_loss=0.09209, over 4278544.26 frames. ], batch size: 143, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:10:35,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1062438.0, ans=0.0 2023-06-21 21:10:44,186 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 2.960e+02 3.461e+02 4.086e+02 6.859e+02, threshold=6.922e+02, percent-clipped=1.0 2023-06-21 21:11:28,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1062558.0, ans=0.125 2023-06-21 21:11:57,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062678.0, ans=0.1 2023-06-21 21:12:08,346 INFO [train.py:996] (0/4) Epoch 6, batch 24650, loss[loss=0.2032, simple_loss=0.2663, pruned_loss=0.07006, over 21554.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3172, pruned_loss=0.09095, over 4265872.03 frames. ], batch size: 391, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:12:17,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-21 21:12:28,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1062798.0, ans=0.2 2023-06-21 21:12:48,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1062798.0, ans=0.09899494936611666 2023-06-21 21:12:59,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-21 21:13:41,569 INFO [train.py:996] (0/4) Epoch 6, batch 24700, loss[loss=0.2124, simple_loss=0.277, pruned_loss=0.07396, over 21824.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3146, pruned_loss=0.08904, over 4258127.76 frames. ], batch size: 98, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:13:49,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1063038.0, ans=0.0 2023-06-21 21:13:50,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1063038.0, ans=0.0 2023-06-21 21:13:51,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 2.793e+02 3.149e+02 3.525e+02 6.939e+02, threshold=6.298e+02, percent-clipped=1.0 2023-06-21 21:13:52,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1063038.0, ans=0.125 2023-06-21 21:14:20,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1063098.0, ans=0.125 2023-06-21 21:14:20,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-21 21:14:20,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=22.5 2023-06-21 21:14:20,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-21 21:14:23,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-21 21:14:49,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1063218.0, ans=0.0 2023-06-21 21:14:49,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1063218.0, ans=0.125 2023-06-21 21:14:58,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1063218.0, ans=0.0 2023-06-21 21:15:02,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1063278.0, ans=0.0 2023-06-21 21:15:15,547 INFO [train.py:996] (0/4) Epoch 6, batch 24750, loss[loss=0.183, simple_loss=0.2508, pruned_loss=0.05762, over 21643.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3087, pruned_loss=0.08656, over 4266930.90 frames. ], batch size: 282, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:15:26,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1063338.0, ans=0.0 2023-06-21 21:15:59,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1063458.0, ans=0.0 2023-06-21 21:16:12,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1063458.0, ans=0.5 2023-06-21 21:16:37,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1063578.0, ans=0.125 2023-06-21 21:16:42,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1063578.0, ans=0.0 2023-06-21 21:16:44,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-21 21:16:46,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1063578.0, ans=0.125 2023-06-21 21:16:49,357 INFO [train.py:996] (0/4) Epoch 6, batch 24800, loss[loss=0.2465, simple_loss=0.3026, pruned_loss=0.09525, over 21623.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3026, pruned_loss=0.08519, over 4274688.57 frames. ], batch size: 263, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:16:55,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1063638.0, ans=0.125 2023-06-21 21:16:57,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1063638.0, ans=0.04949747468305833 2023-06-21 21:17:07,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.811e+02 3.326e+02 3.870e+02 1.010e+03, threshold=6.653e+02, percent-clipped=1.0 2023-06-21 21:18:20,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1063878.0, ans=0.04949747468305833 2023-06-21 21:18:22,758 INFO [train.py:996] (0/4) Epoch 6, batch 24850, loss[loss=0.2321, simple_loss=0.2943, pruned_loss=0.08491, over 21750.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3025, pruned_loss=0.08612, over 4281785.45 frames. ], batch size: 247, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:18:26,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1063938.0, ans=0.125 2023-06-21 21:18:52,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1063998.0, ans=0.1 2023-06-21 21:19:00,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1063998.0, ans=0.2 2023-06-21 21:19:34,880 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:19:57,068 INFO [train.py:996] (0/4) Epoch 6, batch 24900, loss[loss=0.3323, simple_loss=0.386, pruned_loss=0.1393, over 21429.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.305, pruned_loss=0.08709, over 4272037.31 frames. ], batch size: 471, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:20:15,242 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.136e+02 3.665e+02 4.988e+02 9.346e+02, threshold=7.330e+02, percent-clipped=11.0 2023-06-21 21:21:21,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1064478.0, ans=0.1 2023-06-21 21:21:38,229 INFO [train.py:996] (0/4) Epoch 6, batch 24950, loss[loss=0.2625, simple_loss=0.3245, pruned_loss=0.1002, over 20678.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3145, pruned_loss=0.09217, over 4271068.94 frames. ], batch size: 607, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:22:20,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1064598.0, ans=0.0 2023-06-21 21:22:27,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1064658.0, ans=0.0 2023-06-21 21:22:47,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1064718.0, ans=0.125 2023-06-21 21:23:18,873 INFO [train.py:996] (0/4) Epoch 6, batch 25000, loss[loss=0.2618, simple_loss=0.3346, pruned_loss=0.0945, over 20723.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3221, pruned_loss=0.09422, over 4274373.67 frames. ], batch size: 607, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:23:29,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1064838.0, ans=0.125 2023-06-21 21:23:29,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1064838.0, ans=0.125 2023-06-21 21:23:36,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.934e+02 3.469e+02 4.480e+02 7.234e+02, threshold=6.939e+02, percent-clipped=0.0 2023-06-21 21:24:05,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1064958.0, ans=0.0 2023-06-21 21:24:16,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1065018.0, ans=0.1 2023-06-21 21:24:21,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-21 21:24:38,202 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-21 21:24:52,471 INFO [train.py:996] (0/4) Epoch 6, batch 25050, loss[loss=0.2093, simple_loss=0.2731, pruned_loss=0.07277, over 21749.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3156, pruned_loss=0.09247, over 4270733.42 frames. ], batch size: 317, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:25:46,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1065318.0, ans=0.04949747468305833 2023-06-21 21:26:09,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-21 21:26:27,041 INFO [train.py:996] (0/4) Epoch 6, batch 25100, loss[loss=0.2688, simple_loss=0.3851, pruned_loss=0.07623, over 19738.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3101, pruned_loss=0.09005, over 4270359.50 frames. ], batch size: 702, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:26:45,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.865e+02 3.430e+02 4.483e+02 9.616e+02, threshold=6.861e+02, percent-clipped=4.0 2023-06-21 21:27:39,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1065678.0, ans=0.2 2023-06-21 21:28:01,882 INFO [train.py:996] (0/4) Epoch 6, batch 25150, loss[loss=0.287, simple_loss=0.3536, pruned_loss=0.1102, over 21687.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3142, pruned_loss=0.08824, over 4257552.40 frames. ], batch size: 508, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:28:36,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-21 21:28:54,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-21 21:28:58,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1065918.0, ans=10.0 2023-06-21 21:29:10,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1065978.0, ans=0.025 2023-06-21 21:29:17,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1065978.0, ans=0.0 2023-06-21 21:29:28,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-21 21:29:32,239 INFO [train.py:996] (0/4) Epoch 6, batch 25200, loss[loss=0.2279, simple_loss=0.3266, pruned_loss=0.06457, over 21717.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3128, pruned_loss=0.08503, over 4263022.80 frames. ], batch size: 351, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:29:55,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.627e+02 3.080e+02 3.902e+02 5.113e+02, threshold=6.160e+02, percent-clipped=0.0 2023-06-21 21:30:23,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1066158.0, ans=0.125 2023-06-21 21:31:06,411 INFO [train.py:996] (0/4) Epoch 6, batch 25250, loss[loss=0.2002, simple_loss=0.2678, pruned_loss=0.06634, over 21694.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3093, pruned_loss=0.08352, over 4264526.98 frames. ], batch size: 282, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:31:14,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1066338.0, ans=0.125 2023-06-21 21:31:36,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.15 vs. limit=10.0 2023-06-21 21:32:08,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1066518.0, ans=0.125 2023-06-21 21:32:46,530 INFO [train.py:996] (0/4) Epoch 6, batch 25300, loss[loss=0.2971, simple_loss=0.359, pruned_loss=0.1176, over 21444.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3068, pruned_loss=0.08278, over 4262249.22 frames. ], batch size: 131, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:33:05,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.867e+02 3.250e+02 3.935e+02 6.834e+02, threshold=6.501e+02, percent-clipped=3.0 2023-06-21 21:33:11,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1066698.0, ans=0.125 2023-06-21 21:33:11,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1066698.0, ans=0.125 2023-06-21 21:33:28,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1066758.0, ans=0.125 2023-06-21 21:33:56,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1066818.0, ans=0.0 2023-06-21 21:34:16,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1066878.0, ans=0.125 2023-06-21 21:34:17,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1066878.0, ans=0.125 2023-06-21 21:34:21,342 INFO [train.py:996] (0/4) Epoch 6, batch 25350, loss[loss=0.1805, simple_loss=0.2642, pruned_loss=0.04839, over 21363.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3088, pruned_loss=0.08241, over 4262373.33 frames. ], batch size: 194, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:34:36,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1066938.0, ans=0.0 2023-06-21 21:35:49,711 INFO [train.py:996] (0/4) Epoch 6, batch 25400, loss[loss=0.2196, simple_loss=0.2791, pruned_loss=0.08003, over 21165.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3047, pruned_loss=0.08208, over 4260412.53 frames. ], batch size: 548, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:35:50,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1067238.0, ans=0.125 2023-06-21 21:35:59,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1067238.0, ans=0.2 2023-06-21 21:36:13,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.660e+02 3.051e+02 3.605e+02 5.899e+02, threshold=6.102e+02, percent-clipped=0.0 2023-06-21 21:37:23,848 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-06-21 21:37:30,719 INFO [train.py:996] (0/4) Epoch 6, batch 25450, loss[loss=0.2247, simple_loss=0.3221, pruned_loss=0.06365, over 21748.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3065, pruned_loss=0.08393, over 4258182.56 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:38:03,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-21 21:38:12,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1067658.0, ans=0.125 2023-06-21 21:38:20,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1067658.0, ans=0.5 2023-06-21 21:38:43,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1067718.0, ans=0.0 2023-06-21 21:38:43,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1067718.0, ans=0.0 2023-06-21 21:38:44,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=22.5 2023-06-21 21:39:10,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-21 21:39:10,625 INFO [train.py:996] (0/4) Epoch 6, batch 25500, loss[loss=0.2013, simple_loss=0.2858, pruned_loss=0.05842, over 21384.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3063, pruned_loss=0.08035, over 4251047.96 frames. ], batch size: 211, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:39:12,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1067838.0, ans=0.025 2023-06-21 21:39:25,733 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.857e+02 3.431e+02 4.303e+02 7.136e+02, threshold=6.862e+02, percent-clipped=5.0 2023-06-21 21:39:27,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1067898.0, ans=0.125 2023-06-21 21:39:43,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1067898.0, ans=0.0 2023-06-21 21:39:51,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1067958.0, ans=0.035 2023-06-21 21:40:45,937 INFO [train.py:996] (0/4) Epoch 6, batch 25550, loss[loss=0.2203, simple_loss=0.3195, pruned_loss=0.06062, over 21768.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3124, pruned_loss=0.08134, over 4239066.12 frames. ], batch size: 332, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:40:58,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1068138.0, ans=0.0 2023-06-21 21:41:00,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1068198.0, ans=15.0 2023-06-21 21:41:26,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1068258.0, ans=0.125 2023-06-21 21:41:42,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-21 21:41:51,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1068318.0, ans=0.0 2023-06-21 21:41:54,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1068318.0, ans=0.125 2023-06-21 21:41:56,972 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:42:21,199 INFO [train.py:996] (0/4) Epoch 6, batch 25600, loss[loss=0.2577, simple_loss=0.3277, pruned_loss=0.09387, over 21778.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3163, pruned_loss=0.08181, over 4253190.40 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:42:29,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1068438.0, ans=0.0 2023-06-21 21:42:41,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.868e+02 3.276e+02 3.835e+02 9.464e+02, threshold=6.552e+02, percent-clipped=3.0 2023-06-21 21:43:12,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068558.0, ans=0.1 2023-06-21 21:43:12,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1068558.0, ans=0.0 2023-06-21 21:43:44,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1068678.0, ans=0.125 2023-06-21 21:43:53,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1068678.0, ans=0.125 2023-06-21 21:43:55,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1068738.0, ans=0.0 2023-06-21 21:43:56,067 INFO [train.py:996] (0/4) Epoch 6, batch 25650, loss[loss=0.2087, simple_loss=0.2667, pruned_loss=0.07531, over 21555.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3193, pruned_loss=0.08586, over 4263858.14 frames. ], batch size: 263, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:45:00,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1068918.0, ans=0.125 2023-06-21 21:45:28,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-21 21:45:28,620 INFO [train.py:996] (0/4) Epoch 6, batch 25700, loss[loss=0.2453, simple_loss=0.3232, pruned_loss=0.08375, over 21271.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3156, pruned_loss=0.08689, over 4271345.71 frames. ], batch size: 143, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:45:48,640 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 2.859e+02 3.225e+02 3.794e+02 7.100e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-21 21:45:50,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1069098.0, ans=0.1 2023-06-21 21:46:05,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1069098.0, ans=0.2 2023-06-21 21:46:10,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1069158.0, ans=0.05 2023-06-21 21:46:18,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1069158.0, ans=0.125 2023-06-21 21:46:20,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069158.0, ans=0.1 2023-06-21 21:46:31,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1069218.0, ans=0.125 2023-06-21 21:46:47,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1069218.0, ans=0.2 2023-06-21 21:47:05,082 INFO [train.py:996] (0/4) Epoch 6, batch 25750, loss[loss=0.3077, simple_loss=0.3761, pruned_loss=0.1197, over 21765.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3213, pruned_loss=0.09, over 4281810.36 frames. ], batch size: 441, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:47:17,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1069338.0, ans=0.125 2023-06-21 21:47:28,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069398.0, ans=0.1 2023-06-21 21:47:30,831 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.25 vs. limit=10.0 2023-06-21 21:47:33,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1069398.0, ans=0.0 2023-06-21 21:48:50,362 INFO [train.py:996] (0/4) Epoch 6, batch 25800, loss[loss=0.3296, simple_loss=0.4013, pruned_loss=0.129, over 21379.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3327, pruned_loss=0.09479, over 4280745.78 frames. ], batch size: 131, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:49:10,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 3.168e+02 3.870e+02 4.969e+02 1.145e+03, threshold=7.739e+02, percent-clipped=13.0 2023-06-21 21:49:20,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069698.0, ans=0.1 2023-06-21 21:49:24,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-21 21:49:33,779 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:49:44,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1069818.0, ans=0.0 2023-06-21 21:49:55,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1069818.0, ans=0.0 2023-06-21 21:50:25,976 INFO [train.py:996] (0/4) Epoch 6, batch 25850, loss[loss=0.2313, simple_loss=0.3029, pruned_loss=0.07989, over 21679.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3343, pruned_loss=0.09388, over 4284894.06 frames. ], batch size: 230, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:50:40,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1069938.0, ans=0.0 2023-06-21 21:51:02,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1069998.0, ans=0.125 2023-06-21 21:51:32,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1070118.0, ans=0.0 2023-06-21 21:52:10,927 INFO [train.py:996] (0/4) Epoch 6, batch 25900, loss[loss=0.3008, simple_loss=0.3947, pruned_loss=0.1035, over 21357.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3368, pruned_loss=0.09517, over 4289650.35 frames. ], batch size: 548, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:52:25,940 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.005e+02 3.553e+02 4.246e+02 7.646e+02, threshold=7.106e+02, percent-clipped=0.0 2023-06-21 21:52:43,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1070298.0, ans=0.2 2023-06-21 21:52:57,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1070358.0, ans=0.125 2023-06-21 21:53:09,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1070418.0, ans=0.0 2023-06-21 21:53:10,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1070418.0, ans=0.125 2023-06-21 21:53:10,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1070418.0, ans=0.125 2023-06-21 21:53:19,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1070478.0, ans=0.1 2023-06-21 21:53:45,645 INFO [train.py:996] (0/4) Epoch 6, batch 25950, loss[loss=0.2663, simple_loss=0.3355, pruned_loss=0.09857, over 21254.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3428, pruned_loss=0.09757, over 4288084.43 frames. ], batch size: 159, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:54:46,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1070718.0, ans=0.125 2023-06-21 21:55:09,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1070778.0, ans=0.125 2023-06-21 21:55:14,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1070778.0, ans=0.07 2023-06-21 21:55:21,682 INFO [train.py:996] (0/4) Epoch 6, batch 26000, loss[loss=0.2298, simple_loss=0.3202, pruned_loss=0.0697, over 21713.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.342, pruned_loss=0.0952, over 4276888.73 frames. ], batch size: 124, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:55:41,609 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 3.120e+02 3.589e+02 4.615e+02 8.181e+02, threshold=7.178e+02, percent-clipped=1.0 2023-06-21 21:55:49,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1070898.0, ans=0.035 2023-06-21 21:56:07,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1070958.0, ans=0.125 2023-06-21 21:56:43,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1071078.0, ans=0.125 2023-06-21 21:56:55,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-21 21:56:55,914 INFO [train.py:996] (0/4) Epoch 6, batch 26050, loss[loss=0.2462, simple_loss=0.3093, pruned_loss=0.09151, over 21744.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.343, pruned_loss=0.0963, over 4272026.81 frames. ], batch size: 112, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:57:32,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1071258.0, ans=0.0 2023-06-21 21:57:52,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=12.0 2023-06-21 21:58:16,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1071378.0, ans=0.125 2023-06-21 21:58:29,361 INFO [train.py:996] (0/4) Epoch 6, batch 26100, loss[loss=0.237, simple_loss=0.2957, pruned_loss=0.08918, over 21479.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3358, pruned_loss=0.09553, over 4280423.59 frames. ], batch size: 194, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:58:31,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1071438.0, ans=0.125 2023-06-21 21:58:48,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1071498.0, ans=0.0 2023-06-21 21:58:49,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.066e+02 3.551e+02 4.321e+02 9.246e+02, threshold=7.101e+02, percent-clipped=1.0 2023-06-21 21:58:49,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1071498.0, ans=0.125 2023-06-21 21:59:13,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1071558.0, ans=0.0 2023-06-21 22:00:03,477 INFO [train.py:996] (0/4) Epoch 6, batch 26150, loss[loss=0.2645, simple_loss=0.3382, pruned_loss=0.09539, over 21829.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3328, pruned_loss=0.09568, over 4280757.56 frames. ], batch size: 118, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:00:24,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1071798.0, ans=0.2 2023-06-21 22:00:24,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1071798.0, ans=0.1 2023-06-21 22:00:25,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1071798.0, ans=0.125 2023-06-21 22:01:19,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1071918.0, ans=0.0 2023-06-21 22:01:29,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1071978.0, ans=0.1 2023-06-21 22:01:43,218 INFO [train.py:996] (0/4) Epoch 6, batch 26200, loss[loss=0.2748, simple_loss=0.3702, pruned_loss=0.08974, over 21647.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3327, pruned_loss=0.09383, over 4278308.36 frames. ], batch size: 414, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:01:56,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1072038.0, ans=0.0 2023-06-21 22:01:58,605 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.878e+02 3.123e+02 3.619e+02 5.924e+02, threshold=6.246e+02, percent-clipped=0.0 2023-06-21 22:02:11,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.26 vs. limit=10.0 2023-06-21 22:03:06,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1072278.0, ans=0.0 2023-06-21 22:03:10,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1072278.0, ans=0.125 2023-06-21 22:03:17,237 INFO [train.py:996] (0/4) Epoch 6, batch 26250, loss[loss=0.2593, simple_loss=0.3298, pruned_loss=0.09442, over 16920.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3364, pruned_loss=0.0923, over 4280271.42 frames. ], batch size: 64, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:03:48,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1072398.0, ans=0.0 2023-06-21 22:04:12,388 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-06-21 22:04:12,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-21 22:04:23,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1072518.0, ans=0.125 2023-06-21 22:04:33,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1072578.0, ans=0.0 2023-06-21 22:04:46,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1072578.0, ans=0.125 2023-06-21 22:04:49,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1072638.0, ans=0.0 2023-06-21 22:04:50,877 INFO [train.py:996] (0/4) Epoch 6, batch 26300, loss[loss=0.2263, simple_loss=0.2915, pruned_loss=0.08053, over 21880.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3329, pruned_loss=0.09325, over 4284412.36 frames. ], batch size: 298, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:05:00,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-21 22:05:10,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.912e+02 3.361e+02 4.041e+02 6.857e+02, threshold=6.722e+02, percent-clipped=1.0 2023-06-21 22:05:37,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.28 vs. limit=15.0 2023-06-21 22:05:41,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1072758.0, ans=0.2 2023-06-21 22:06:29,503 INFO [train.py:996] (0/4) Epoch 6, batch 26350, loss[loss=0.2232, simple_loss=0.2901, pruned_loss=0.07819, over 21217.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3316, pruned_loss=0.09414, over 4284325.33 frames. ], batch size: 608, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:07:38,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1073118.0, ans=0.1 2023-06-21 22:07:41,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-21 22:07:44,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-06-21 22:07:49,033 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:08:02,667 INFO [train.py:996] (0/4) Epoch 6, batch 26400, loss[loss=0.2382, simple_loss=0.2938, pruned_loss=0.09132, over 21813.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3253, pruned_loss=0.09334, over 4277745.85 frames. ], batch size: 98, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:08:12,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1073238.0, ans=0.125 2023-06-21 22:08:22,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.009e+02 3.283e+02 3.744e+02 6.986e+02, threshold=6.566e+02, percent-clipped=1.0 2023-06-21 22:08:56,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1073418.0, ans=0.125 2023-06-21 22:09:17,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-21 22:09:43,582 INFO [train.py:996] (0/4) Epoch 6, batch 26450, loss[loss=0.2339, simple_loss=0.2974, pruned_loss=0.08514, over 21184.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3234, pruned_loss=0.0925, over 4273427.60 frames. ], batch size: 159, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:09:59,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1073538.0, ans=0.0 2023-06-21 22:10:11,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-21 22:10:16,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1073598.0, ans=0.125 2023-06-21 22:10:34,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1073658.0, ans=0.0 2023-06-21 22:10:35,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.01 vs. limit=10.0 2023-06-21 22:11:23,585 INFO [train.py:996] (0/4) Epoch 6, batch 26500, loss[loss=0.2951, simple_loss=0.3734, pruned_loss=0.1084, over 21634.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3242, pruned_loss=0.09167, over 4270654.36 frames. ], batch size: 441, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:11:38,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.232e+02 3.914e+02 4.900e+02 8.574e+02, threshold=7.829e+02, percent-clipped=7.0 2023-06-21 22:12:23,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1074018.0, ans=0.0 2023-06-21 22:12:59,962 INFO [train.py:996] (0/4) Epoch 6, batch 26550, loss[loss=0.2013, simple_loss=0.2698, pruned_loss=0.0664, over 21286.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3225, pruned_loss=0.08948, over 4262987.31 frames. ], batch size: 176, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:14:09,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1074318.0, ans=0.125 2023-06-21 22:14:19,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1074378.0, ans=0.125 2023-06-21 22:14:34,304 INFO [train.py:996] (0/4) Epoch 6, batch 26600, loss[loss=0.2291, simple_loss=0.3148, pruned_loss=0.07165, over 21398.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3217, pruned_loss=0.08605, over 4264495.17 frames. ], batch size: 211, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:14:58,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1074498.0, ans=0.0 2023-06-21 22:15:00,583 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.868e+02 3.429e+02 4.174e+02 7.700e+02, threshold=6.858e+02, percent-clipped=0.0 2023-06-21 22:15:10,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1074498.0, ans=0.125 2023-06-21 22:15:11,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1074498.0, ans=0.125 2023-06-21 22:15:35,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1074558.0, ans=0.0 2023-06-21 22:15:46,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1074618.0, ans=0.2 2023-06-21 22:15:48,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1074618.0, ans=0.125 2023-06-21 22:16:13,065 INFO [train.py:996] (0/4) Epoch 6, batch 26650, loss[loss=0.1631, simple_loss=0.2332, pruned_loss=0.04652, over 21184.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3149, pruned_loss=0.0838, over 4260489.00 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:16:16,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1074738.0, ans=0.2 2023-06-21 22:16:42,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1074798.0, ans=0.0 2023-06-21 22:16:47,424 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:17:35,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.94 vs. limit=6.0 2023-06-21 22:17:50,852 INFO [train.py:996] (0/4) Epoch 6, batch 26700, loss[loss=0.2003, simple_loss=0.2931, pruned_loss=0.05375, over 20794.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3074, pruned_loss=0.08063, over 4263826.59 frames. ], batch size: 609, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:18:07,351 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.839e+02 3.537e+02 4.249e+02 6.809e+02, threshold=7.074e+02, percent-clipped=0.0 2023-06-21 22:18:35,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1075158.0, ans=0.125 2023-06-21 22:18:47,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1075218.0, ans=0.1 2023-06-21 22:19:05,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1075278.0, ans=0.05 2023-06-21 22:19:25,053 INFO [train.py:996] (0/4) Epoch 6, batch 26750, loss[loss=0.2492, simple_loss=0.3208, pruned_loss=0.08879, over 21845.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3085, pruned_loss=0.08015, over 4270734.41 frames. ], batch size: 107, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:19:34,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1075338.0, ans=0.2 2023-06-21 22:20:11,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1075458.0, ans=0.125 2023-06-21 22:20:15,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1075458.0, ans=0.125 2023-06-21 22:21:00,193 INFO [train.py:996] (0/4) Epoch 6, batch 26800, loss[loss=0.2596, simple_loss=0.3304, pruned_loss=0.09439, over 21315.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3178, pruned_loss=0.08595, over 4272000.18 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:21:23,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1075698.0, ans=0.0 2023-06-21 22:21:25,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.803e+02 3.255e+02 3.983e+02 6.627e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-21 22:21:36,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1075698.0, ans=0.0 2023-06-21 22:21:44,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1075758.0, ans=0.125 2023-06-21 22:21:57,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1075818.0, ans=0.025 2023-06-21 22:22:16,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1075818.0, ans=0.125 2023-06-21 22:22:22,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1075878.0, ans=0.09899494936611666 2023-06-21 22:22:28,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1075878.0, ans=0.0 2023-06-21 22:22:38,520 INFO [train.py:996] (0/4) Epoch 6, batch 26850, loss[loss=0.265, simple_loss=0.334, pruned_loss=0.09803, over 20687.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3207, pruned_loss=0.09, over 4272982.53 frames. ], batch size: 607, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:23:24,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1076058.0, ans=0.1 2023-06-21 22:23:56,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1076178.0, ans=0.125 2023-06-21 22:23:56,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1076178.0, ans=0.1 2023-06-21 22:24:06,570 INFO [train.py:996] (0/4) Epoch 6, batch 26900, loss[loss=0.2103, simple_loss=0.2651, pruned_loss=0.07772, over 21264.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.312, pruned_loss=0.08856, over 4271883.14 frames. ], batch size: 177, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:24:08,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1076238.0, ans=0.0 2023-06-21 22:24:32,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 2.927e+02 3.403e+02 4.314e+02 6.686e+02, threshold=6.806e+02, percent-clipped=1.0 2023-06-21 22:24:37,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1076298.0, ans=0.125 2023-06-21 22:25:21,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1076418.0, ans=0.0 2023-06-21 22:25:28,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-21 22:25:31,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-21 22:25:40,933 INFO [train.py:996] (0/4) Epoch 6, batch 26950, loss[loss=0.2875, simple_loss=0.3593, pruned_loss=0.1079, over 21578.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.31, pruned_loss=0.08813, over 4270264.93 frames. ], batch size: 441, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:26:24,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1076658.0, ans=0.2 2023-06-21 22:27:15,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-21 22:27:20,706 INFO [train.py:996] (0/4) Epoch 6, batch 27000, loss[loss=0.2247, simple_loss=0.3162, pruned_loss=0.06655, over 20811.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3095, pruned_loss=0.08538, over 4265460.72 frames. ], batch size: 608, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:27:20,707 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-21 22:27:39,471 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2469, simple_loss=0.3452, pruned_loss=0.07428, over 1796401.00 frames. 2023-06-21 22:27:39,472 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-21 22:27:39,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1076838.0, ans=0.02 2023-06-21 22:27:47,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1076838.0, ans=0.125 2023-06-21 22:27:57,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.871e+02 3.391e+02 3.871e+02 6.119e+02, threshold=6.783e+02, percent-clipped=0.0 2023-06-21 22:27:58,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1076898.0, ans=0.0 2023-06-21 22:28:22,255 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:29:09,025 INFO [train.py:996] (0/4) Epoch 6, batch 27050, loss[loss=0.2405, simple_loss=0.3144, pruned_loss=0.08327, over 21868.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3124, pruned_loss=0.08239, over 4272889.19 frames. ], batch size: 332, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:29:15,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1077138.0, ans=0.1 2023-06-21 22:29:19,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1077138.0, ans=0.2 2023-06-21 22:30:38,697 INFO [train.py:996] (0/4) Epoch 6, batch 27100, loss[loss=0.2323, simple_loss=0.3095, pruned_loss=0.07752, over 21837.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3149, pruned_loss=0.08337, over 4282050.13 frames. ], batch size: 107, lr: 4.93e-03, grad_scale: 8.0 2023-06-21 22:30:49,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1077438.0, ans=0.125 2023-06-21 22:30:52,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1077498.0, ans=0.0 2023-06-21 22:31:06,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2023-06-21 22:31:08,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.819e+02 3.363e+02 4.112e+02 5.749e+02, threshold=6.726e+02, percent-clipped=0.0 2023-06-21 22:31:08,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1077498.0, ans=0.07 2023-06-21 22:31:12,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-21 22:32:06,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1077678.0, ans=0.1 2023-06-21 22:32:09,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1077678.0, ans=0.0 2023-06-21 22:32:13,383 INFO [train.py:996] (0/4) Epoch 6, batch 27150, loss[loss=0.2543, simple_loss=0.3304, pruned_loss=0.08906, over 21277.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3252, pruned_loss=0.08701, over 4285764.32 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 8.0 2023-06-21 22:32:41,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1077798.0, ans=0.07 2023-06-21 22:32:41,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1077798.0, ans=0.0 2023-06-21 22:33:03,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1077858.0, ans=0.07 2023-06-21 22:33:23,317 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:33:47,061 INFO [train.py:996] (0/4) Epoch 6, batch 27200, loss[loss=0.2689, simple_loss=0.3425, pruned_loss=0.09763, over 21931.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3337, pruned_loss=0.08992, over 4283021.25 frames. ], batch size: 316, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:34:15,786 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.516e+02 3.235e+02 3.777e+02 4.284e+02 9.441e+02, threshold=7.555e+02, percent-clipped=8.0 2023-06-21 22:34:16,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1078098.0, ans=0.125 2023-06-21 22:34:20,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1078098.0, ans=0.0 2023-06-21 22:34:37,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1078158.0, ans=0.125 2023-06-21 22:34:45,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1078158.0, ans=0.0 2023-06-21 22:34:51,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1078218.0, ans=0.125 2023-06-21 22:34:58,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1078218.0, ans=0.125 2023-06-21 22:35:30,910 INFO [train.py:996] (0/4) Epoch 6, batch 27250, loss[loss=0.2555, simple_loss=0.3293, pruned_loss=0.09081, over 20688.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.337, pruned_loss=0.09369, over 4280871.86 frames. ], batch size: 607, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:35:45,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1078398.0, ans=0.07 2023-06-21 22:36:27,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1078518.0, ans=0.025 2023-06-21 22:36:47,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1078518.0, ans=0.0 2023-06-21 22:37:06,709 INFO [train.py:996] (0/4) Epoch 6, batch 27300, loss[loss=0.272, simple_loss=0.346, pruned_loss=0.09901, over 21264.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3389, pruned_loss=0.0942, over 4276820.34 frames. ], batch size: 159, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:37:36,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.091e+02 3.407e+02 3.961e+02 5.625e+02, threshold=6.815e+02, percent-clipped=0.0 2023-06-21 22:38:17,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1078818.0, ans=0.0 2023-06-21 22:38:31,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-21 22:38:45,654 INFO [train.py:996] (0/4) Epoch 6, batch 27350, loss[loss=0.2604, simple_loss=0.3489, pruned_loss=0.08596, over 21296.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.343, pruned_loss=0.09547, over 4281497.55 frames. ], batch size: 548, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:40:17,999 INFO [train.py:996] (0/4) Epoch 6, batch 27400, loss[loss=0.2365, simple_loss=0.2974, pruned_loss=0.0878, over 21622.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3376, pruned_loss=0.09525, over 4289853.92 frames. ], batch size: 441, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:40:35,227 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:40:38,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-21 22:40:43,763 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 2.934e+02 3.230e+02 3.710e+02 5.363e+02, threshold=6.461e+02, percent-clipped=0.0 2023-06-21 22:41:00,833 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:41:34,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1079478.0, ans=0.125 2023-06-21 22:41:51,781 INFO [train.py:996] (0/4) Epoch 6, batch 27450, loss[loss=0.2377, simple_loss=0.3194, pruned_loss=0.07797, over 21418.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3311, pruned_loss=0.09322, over 4290205.46 frames. ], batch size: 194, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:41:54,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1079538.0, ans=0.2 2023-06-21 22:42:07,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1079538.0, ans=0.1 2023-06-21 22:42:17,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1079598.0, ans=0.0 2023-06-21 22:42:21,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1079598.0, ans=0.125 2023-06-21 22:42:37,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1079658.0, ans=0.125 2023-06-21 22:42:40,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1079658.0, ans=0.125 2023-06-21 22:42:59,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1079718.0, ans=0.2 2023-06-21 22:43:24,736 INFO [train.py:996] (0/4) Epoch 6, batch 27500, loss[loss=0.2575, simple_loss=0.3151, pruned_loss=0.0999, over 21643.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.329, pruned_loss=0.09334, over 4291271.46 frames. ], batch size: 263, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:43:37,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1079838.0, ans=0.125 2023-06-21 22:43:43,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1079898.0, ans=0.125 2023-06-21 22:43:47,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1079898.0, ans=0.1 2023-06-21 22:43:47,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1079898.0, ans=0.125 2023-06-21 22:43:50,271 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.559e+02 2.999e+02 3.729e+02 4.399e+02 9.645e+02, threshold=7.458e+02, percent-clipped=3.0 2023-06-21 22:44:03,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1079958.0, ans=0.0 2023-06-21 22:44:12,347 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-180000.pt 2023-06-21 22:44:23,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1080018.0, ans=0.125 2023-06-21 22:44:25,208 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-21 22:44:56,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1080078.0, ans=0.0 2023-06-21 22:44:58,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1080138.0, ans=0.0 2023-06-21 22:44:59,255 INFO [train.py:996] (0/4) Epoch 6, batch 27550, loss[loss=0.1989, simple_loss=0.2669, pruned_loss=0.06542, over 21640.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3233, pruned_loss=0.0898, over 4292080.35 frames. ], batch size: 298, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:45:24,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1080198.0, ans=0.125 2023-06-21 22:45:25,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=8.0 2023-06-21 22:45:42,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1080258.0, ans=0.1 2023-06-21 22:45:46,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1080258.0, ans=0.0 2023-06-21 22:46:37,814 INFO [train.py:996] (0/4) Epoch 6, batch 27600, loss[loss=0.2255, simple_loss=0.2866, pruned_loss=0.08223, over 21785.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3165, pruned_loss=0.08898, over 4290775.06 frames. ], batch size: 112, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:46:39,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1080438.0, ans=0.0 2023-06-21 22:46:43,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-21 22:46:58,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 2.844e+02 3.346e+02 3.964e+02 7.072e+02, threshold=6.692e+02, percent-clipped=0.0 2023-06-21 22:47:14,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1080558.0, ans=0.0 2023-06-21 22:47:15,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1080558.0, ans=0.2 2023-06-21 22:47:32,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1080618.0, ans=0.1 2023-06-21 22:47:38,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1080618.0, ans=0.1 2023-06-21 22:48:03,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1080678.0, ans=0.0 2023-06-21 22:48:06,209 INFO [train.py:996] (0/4) Epoch 6, batch 27650, loss[loss=0.245, simple_loss=0.3208, pruned_loss=0.08458, over 21841.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3113, pruned_loss=0.08834, over 4289051.59 frames. ], batch size: 371, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:48:08,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-21 22:48:50,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-21 22:48:52,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1080858.0, ans=0.0 2023-06-21 22:49:44,276 INFO [train.py:996] (0/4) Epoch 6, batch 27700, loss[loss=0.2473, simple_loss=0.3262, pruned_loss=0.08416, over 21534.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3088, pruned_loss=0.08611, over 4281053.49 frames. ], batch size: 471, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:49:46,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1081038.0, ans=0.125 2023-06-21 22:50:05,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.845e+02 3.268e+02 3.924e+02 7.341e+02, threshold=6.535e+02, percent-clipped=1.0 2023-06-21 22:51:18,680 INFO [train.py:996] (0/4) Epoch 6, batch 27750, loss[loss=0.2105, simple_loss=0.2676, pruned_loss=0.07666, over 20190.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3129, pruned_loss=0.08641, over 4276064.78 frames. ], batch size: 703, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:51:42,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1081398.0, ans=0.125 2023-06-21 22:51:51,558 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:52:11,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1081518.0, ans=0.2 2023-06-21 22:52:51,596 INFO [train.py:996] (0/4) Epoch 6, batch 27800, loss[loss=0.2368, simple_loss=0.3035, pruned_loss=0.08507, over 21872.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3138, pruned_loss=0.08705, over 4287425.72 frames. ], batch size: 371, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:53:12,157 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.907e+02 3.249e+02 3.877e+02 6.679e+02, threshold=6.497e+02, percent-clipped=1.0 2023-06-21 22:54:12,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1081878.0, ans=0.015 2023-06-21 22:54:25,872 INFO [train.py:996] (0/4) Epoch 6, batch 27850, loss[loss=0.2552, simple_loss=0.3421, pruned_loss=0.08412, over 21716.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3145, pruned_loss=0.08752, over 4288793.25 frames. ], batch size: 389, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:54:30,651 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:54:41,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1081998.0, ans=0.2 2023-06-21 22:55:13,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1082058.0, ans=0.125 2023-06-21 22:56:01,490 INFO [train.py:996] (0/4) Epoch 6, batch 27900, loss[loss=0.2134, simple_loss=0.2978, pruned_loss=0.06446, over 21137.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3226, pruned_loss=0.08874, over 4293348.25 frames. ], batch size: 143, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:56:27,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.965e+02 3.401e+02 4.272e+02 8.717e+02, threshold=6.802e+02, percent-clipped=4.0 2023-06-21 22:56:37,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1082298.0, ans=0.1 2023-06-21 22:56:56,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-21 22:57:11,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-21 22:57:19,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1082418.0, ans=0.2 2023-06-21 22:57:30,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1082478.0, ans=0.2 2023-06-21 22:57:42,226 INFO [train.py:996] (0/4) Epoch 6, batch 27950, loss[loss=0.2696, simple_loss=0.3429, pruned_loss=0.09817, over 21457.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3225, pruned_loss=0.08571, over 4291337.78 frames. ], batch size: 131, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:58:31,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1082658.0, ans=0.125 2023-06-21 22:58:59,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1082778.0, ans=0.125 2023-06-21 22:59:15,437 INFO [train.py:996] (0/4) Epoch 6, batch 28000, loss[loss=0.2569, simple_loss=0.3258, pruned_loss=0.09399, over 21777.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3206, pruned_loss=0.0835, over 4292216.87 frames. ], batch size: 112, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:59:42,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1082898.0, ans=0.0 2023-06-21 22:59:43,155 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.927e+02 3.364e+02 4.265e+02 7.771e+02, threshold=6.727e+02, percent-clipped=2.0 2023-06-21 22:59:43,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1082898.0, ans=0.125 2023-06-21 23:00:50,699 INFO [train.py:996] (0/4) Epoch 6, batch 28050, loss[loss=0.2241, simple_loss=0.2762, pruned_loss=0.08599, over 21270.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3177, pruned_loss=0.08429, over 4291266.93 frames. ], batch size: 176, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:01:00,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1083138.0, ans=0.125 2023-06-21 23:01:21,726 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:01:51,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-21 23:02:11,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-21 23:02:29,346 INFO [train.py:996] (0/4) Epoch 6, batch 28100, loss[loss=0.2065, simple_loss=0.2649, pruned_loss=0.07409, over 21500.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3165, pruned_loss=0.0842, over 4291544.34 frames. ], batch size: 230, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:02:55,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1083498.0, ans=0.125 2023-06-21 23:03:00,995 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.119e+02 3.918e+02 4.692e+02 8.833e+02, threshold=7.836e+02, percent-clipped=5.0 2023-06-21 23:03:02,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1083498.0, ans=0.035 2023-06-21 23:03:03,613 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-21 23:03:49,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1083678.0, ans=0.125 2023-06-21 23:03:57,412 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.50 vs. limit=12.0 2023-06-21 23:04:02,335 INFO [train.py:996] (0/4) Epoch 6, batch 28150, loss[loss=0.2902, simple_loss=0.3182, pruned_loss=0.1311, over 21491.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.31, pruned_loss=0.08381, over 4275950.38 frames. ], batch size: 511, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:04:07,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1083738.0, ans=0.0 2023-06-21 23:04:41,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1083858.0, ans=0.2 2023-06-21 23:04:51,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1083858.0, ans=0.0 2023-06-21 23:05:04,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1083918.0, ans=0.125 2023-06-21 23:05:32,941 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:05:39,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1084038.0, ans=0.0 2023-06-21 23:05:40,576 INFO [train.py:996] (0/4) Epoch 6, batch 28200, loss[loss=0.2805, simple_loss=0.3398, pruned_loss=0.1106, over 21284.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3094, pruned_loss=0.08532, over 4262414.78 frames. ], batch size: 143, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:06:07,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.112e+02 3.798e+02 4.464e+02 8.953e+02, threshold=7.596e+02, percent-clipped=1.0 2023-06-21 23:06:09,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1084098.0, ans=0.0 2023-06-21 23:06:40,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=12.0 2023-06-21 23:06:41,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1084218.0, ans=0.0 2023-06-21 23:06:50,716 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-21 23:06:54,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1084278.0, ans=0.125 2023-06-21 23:07:14,390 INFO [train.py:996] (0/4) Epoch 6, batch 28250, loss[loss=0.2888, simple_loss=0.3503, pruned_loss=0.1137, over 21302.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3121, pruned_loss=0.08816, over 4270548.18 frames. ], batch size: 159, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:07:36,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1084398.0, ans=0.125 2023-06-21 23:07:40,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1084398.0, ans=0.0 2023-06-21 23:07:51,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1084458.0, ans=0.2 2023-06-21 23:08:00,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1084458.0, ans=0.125 2023-06-21 23:08:04,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1084518.0, ans=0.125 2023-06-21 23:08:24,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1084578.0, ans=0.125 2023-06-21 23:08:36,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1084578.0, ans=10.0 2023-06-21 23:08:54,453 INFO [train.py:996] (0/4) Epoch 6, batch 28300, loss[loss=0.1917, simple_loss=0.2883, pruned_loss=0.04754, over 21697.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3102, pruned_loss=0.0863, over 4271961.16 frames. ], batch size: 298, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:09:01,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1084638.0, ans=0.125 2023-06-21 23:09:17,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.819e+02 3.236e+02 3.708e+02 8.201e+02, threshold=6.472e+02, percent-clipped=2.0 2023-06-21 23:09:32,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1084758.0, ans=0.0 2023-06-21 23:10:23,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-21 23:10:25,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1084878.0, ans=0.125 2023-06-21 23:10:28,259 INFO [train.py:996] (0/4) Epoch 6, batch 28350, loss[loss=0.2043, simple_loss=0.3159, pruned_loss=0.04636, over 20802.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.307, pruned_loss=0.08105, over 4271632.29 frames. ], batch size: 608, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:10:34,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1084938.0, ans=0.125 2023-06-21 23:10:46,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1084998.0, ans=0.0 2023-06-21 23:10:49,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1084998.0, ans=0.1 2023-06-21 23:10:51,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1084998.0, ans=0.125 2023-06-21 23:11:40,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1085118.0, ans=0.5 2023-06-21 23:12:03,503 INFO [train.py:996] (0/4) Epoch 6, batch 28400, loss[loss=0.2292, simple_loss=0.2829, pruned_loss=0.08779, over 21552.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.304, pruned_loss=0.08026, over 4255350.77 frames. ], batch size: 263, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:12:03,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1085238.0, ans=0.09899494936611666 2023-06-21 23:12:19,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1085298.0, ans=0.1 2023-06-21 23:12:26,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.685e+02 3.251e+02 3.858e+02 5.974e+02, threshold=6.502e+02, percent-clipped=0.0 2023-06-21 23:12:27,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=12.0 2023-06-21 23:12:44,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1085358.0, ans=0.125 2023-06-21 23:13:03,610 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=15.0 2023-06-21 23:13:10,221 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.36 vs. limit=22.5 2023-06-21 23:13:13,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1085418.0, ans=0.125 2023-06-21 23:13:37,451 INFO [train.py:996] (0/4) Epoch 6, batch 28450, loss[loss=0.2543, simple_loss=0.3218, pruned_loss=0.09337, over 21870.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3086, pruned_loss=0.08371, over 4256366.88 frames. ], batch size: 371, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:13:55,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1085598.0, ans=0.0 2023-06-21 23:14:00,681 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-21 23:14:53,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1085718.0, ans=0.125 2023-06-21 23:15:10,661 INFO [train.py:996] (0/4) Epoch 6, batch 28500, loss[loss=0.3, simple_loss=0.3623, pruned_loss=0.1188, over 21254.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.311, pruned_loss=0.08622, over 4260706.78 frames. ], batch size: 143, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:15:20,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1085838.0, ans=0.0 2023-06-21 23:15:29,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1085898.0, ans=0.125 2023-06-21 23:15:38,221 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.148e+02 3.464e+02 4.022e+02 7.400e+02, threshold=6.927e+02, percent-clipped=1.0 2023-06-21 23:16:41,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1086078.0, ans=0.125 2023-06-21 23:16:45,747 INFO [train.py:996] (0/4) Epoch 6, batch 28550, loss[loss=0.3381, simple_loss=0.4268, pruned_loss=0.1247, over 21651.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3205, pruned_loss=0.08995, over 4270167.25 frames. ], batch size: 414, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:17:05,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1086198.0, ans=0.125 2023-06-21 23:18:12,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1086378.0, ans=0.1 2023-06-21 23:18:20,988 INFO [train.py:996] (0/4) Epoch 6, batch 28600, loss[loss=0.2242, simple_loss=0.3068, pruned_loss=0.07082, over 21414.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3266, pruned_loss=0.09229, over 4272608.17 frames. ], batch size: 131, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:18:48,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1086498.0, ans=0.2 2023-06-21 23:18:50,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1086498.0, ans=0.125 2023-06-21 23:18:58,678 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.626e+02 3.164e+02 3.571e+02 4.573e+02 8.343e+02, threshold=7.141e+02, percent-clipped=3.0 2023-06-21 23:19:12,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1086558.0, ans=0.0 2023-06-21 23:19:14,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1086558.0, ans=0.125 2023-06-21 23:19:27,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-21 23:19:36,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1086618.0, ans=0.125 2023-06-21 23:19:59,200 INFO [train.py:996] (0/4) Epoch 6, batch 28650, loss[loss=0.2524, simple_loss=0.3026, pruned_loss=0.1011, over 21526.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3202, pruned_loss=0.09151, over 4271789.89 frames. ], batch size: 263, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:20:09,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1086738.0, ans=0.2 2023-06-21 23:20:24,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1086798.0, ans=0.0 2023-06-21 23:20:24,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1086798.0, ans=0.2 2023-06-21 23:20:27,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1086798.0, ans=0.0 2023-06-21 23:20:39,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1086798.0, ans=0.125 2023-06-21 23:20:41,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1086798.0, ans=0.0 2023-06-21 23:21:14,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1086978.0, ans=0.1 2023-06-21 23:21:28,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1086978.0, ans=0.125 2023-06-21 23:21:38,630 INFO [train.py:996] (0/4) Epoch 6, batch 28700, loss[loss=0.2668, simple_loss=0.341, pruned_loss=0.09631, over 21607.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3191, pruned_loss=0.09216, over 4265101.70 frames. ], batch size: 389, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:21:59,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1087098.0, ans=0.0 2023-06-21 23:22:07,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.373e+02 3.119e+02 3.496e+02 4.060e+02 9.079e+02, threshold=6.992e+02, percent-clipped=1.0 2023-06-21 23:22:35,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1087218.0, ans=0.2 2023-06-21 23:23:09,096 INFO [train.py:996] (0/4) Epoch 6, batch 28750, loss[loss=0.2274, simple_loss=0.2964, pruned_loss=0.07926, over 20707.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3197, pruned_loss=0.09222, over 4267584.67 frames. ], batch size: 607, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:24:38,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1087578.0, ans=0.125 2023-06-21 23:24:43,719 INFO [train.py:996] (0/4) Epoch 6, batch 28800, loss[loss=0.3076, simple_loss=0.3625, pruned_loss=0.1263, over 21763.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3239, pruned_loss=0.0933, over 4271007.89 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:25:05,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1087638.0, ans=0.0 2023-06-21 23:25:14,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1087698.0, ans=0.04949747468305833 2023-06-21 23:25:16,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.957e+02 3.291e+02 3.824e+02 6.486e+02, threshold=6.582e+02, percent-clipped=0.0 2023-06-21 23:25:21,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1087758.0, ans=0.125 2023-06-21 23:25:33,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1087758.0, ans=0.1 2023-06-21 23:25:55,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-21 23:25:57,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1087818.0, ans=0.125 2023-06-21 23:26:21,871 INFO [train.py:996] (0/4) Epoch 6, batch 28850, loss[loss=0.2414, simple_loss=0.3066, pruned_loss=0.0881, over 21809.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3253, pruned_loss=0.09501, over 4279953.67 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:27:14,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1088058.0, ans=0.1 2023-06-21 23:27:50,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1088178.0, ans=0.125 2023-06-21 23:28:01,563 INFO [train.py:996] (0/4) Epoch 6, batch 28900, loss[loss=0.2497, simple_loss=0.3142, pruned_loss=0.09261, over 21690.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3286, pruned_loss=0.09675, over 4285201.31 frames. ], batch size: 230, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:28:26,426 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.480e+02 3.260e+02 3.561e+02 4.329e+02 7.781e+02, threshold=7.122e+02, percent-clipped=1.0 2023-06-21 23:29:00,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-21 23:29:06,196 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:29:38,523 INFO [train.py:996] (0/4) Epoch 6, batch 28950, loss[loss=0.1875, simple_loss=0.2496, pruned_loss=0.06271, over 21289.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3286, pruned_loss=0.09592, over 4275307.52 frames. ], batch size: 159, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:30:14,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-21 23:30:19,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1088658.0, ans=15.0 2023-06-21 23:30:50,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1088718.0, ans=0.95 2023-06-21 23:30:57,174 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.93 vs. limit=10.0 2023-06-21 23:31:09,138 INFO [train.py:996] (0/4) Epoch 6, batch 29000, loss[loss=0.2856, simple_loss=0.3577, pruned_loss=0.1068, over 21800.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3309, pruned_loss=0.09389, over 4272218.90 frames. ], batch size: 124, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:31:16,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1088838.0, ans=0.0 2023-06-21 23:31:28,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1088898.0, ans=0.0 2023-06-21 23:31:43,616 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.241e+02 3.719e+02 4.877e+02 7.775e+02, threshold=7.438e+02, percent-clipped=3.0 2023-06-21 23:31:54,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1088958.0, ans=0.125 2023-06-21 23:32:42,132 INFO [train.py:996] (0/4) Epoch 6, batch 29050, loss[loss=0.2225, simple_loss=0.2896, pruned_loss=0.07765, over 21871.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3298, pruned_loss=0.09354, over 4273033.80 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:32:48,295 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:34:15,100 INFO [train.py:996] (0/4) Epoch 6, batch 29100, loss[loss=0.205, simple_loss=0.2693, pruned_loss=0.07037, over 21605.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.321, pruned_loss=0.09124, over 4265763.03 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:34:49,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.936e+02 3.266e+02 3.955e+02 6.605e+02, threshold=6.533e+02, percent-clipped=0.0 2023-06-21 23:34:50,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1089498.0, ans=0.125 2023-06-21 23:34:50,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-21 23:35:15,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-21 23:35:23,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1089618.0, ans=0.07 2023-06-21 23:35:42,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1089678.0, ans=0.0 2023-06-21 23:35:45,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1089678.0, ans=0.2 2023-06-21 23:35:46,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1089738.0, ans=0.125 2023-06-21 23:35:48,098 INFO [train.py:996] (0/4) Epoch 6, batch 29150, loss[loss=0.2487, simple_loss=0.3325, pruned_loss=0.08248, over 21332.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.32, pruned_loss=0.09013, over 4261176.39 frames. ], batch size: 176, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:36:03,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1089738.0, ans=0.0 2023-06-21 23:37:12,347 INFO [train.py:996] (0/4) Epoch 6, batch 29200, loss[loss=0.2241, simple_loss=0.2811, pruned_loss=0.08353, over 21233.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3163, pruned_loss=0.08954, over 4265529.82 frames. ], batch size: 144, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:37:47,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 2.885e+02 3.375e+02 4.210e+02 7.193e+02, threshold=6.750e+02, percent-clipped=2.0 2023-06-21 23:38:20,094 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-21 23:38:38,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1090278.0, ans=0.035 2023-06-21 23:38:55,719 INFO [train.py:996] (0/4) Epoch 6, batch 29250, loss[loss=0.2335, simple_loss=0.3252, pruned_loss=0.0709, over 21837.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3125, pruned_loss=0.08597, over 4259425.44 frames. ], batch size: 317, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:39:55,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1090518.0, ans=0.1 2023-06-21 23:40:29,857 INFO [train.py:996] (0/4) Epoch 6, batch 29300, loss[loss=0.2299, simple_loss=0.3138, pruned_loss=0.07302, over 21692.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3152, pruned_loss=0.08556, over 4266867.75 frames. ], batch size: 351, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:40:44,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1090698.0, ans=0.0 2023-06-21 23:40:58,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1090698.0, ans=0.125 2023-06-21 23:41:00,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.989e+02 3.718e+02 4.652e+02 8.892e+02, threshold=7.436e+02, percent-clipped=6.0 2023-06-21 23:41:26,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1090818.0, ans=0.2 2023-06-21 23:41:42,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1090878.0, ans=0.125 2023-06-21 23:42:00,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-21 23:42:00,610 INFO [train.py:996] (0/4) Epoch 6, batch 29350, loss[loss=0.2229, simple_loss=0.2852, pruned_loss=0.08033, over 21721.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3093, pruned_loss=0.0845, over 4254832.50 frames. ], batch size: 351, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:42:17,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1090998.0, ans=0.0 2023-06-21 23:42:43,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1091058.0, ans=0.125 2023-06-21 23:42:54,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1091058.0, ans=0.0 2023-06-21 23:43:00,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1091118.0, ans=0.0 2023-06-21 23:43:03,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=15.0 2023-06-21 23:43:04,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1091118.0, ans=0.0 2023-06-21 23:43:14,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091118.0, ans=0.1 2023-06-21 23:43:32,555 INFO [train.py:996] (0/4) Epoch 6, batch 29400, loss[loss=0.2128, simple_loss=0.2858, pruned_loss=0.06993, over 21582.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3102, pruned_loss=0.08276, over 4257267.63 frames. ], batch size: 263, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:44:03,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 2.776e+02 3.211e+02 3.938e+02 7.454e+02, threshold=6.422e+02, percent-clipped=1.0 2023-06-21 23:44:05,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1091298.0, ans=0.2 2023-06-21 23:44:12,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1091358.0, ans=0.0 2023-06-21 23:44:33,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1091418.0, ans=0.1 2023-06-21 23:44:44,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091478.0, ans=0.1 2023-06-21 23:44:58,845 INFO [train.py:996] (0/4) Epoch 6, batch 29450, loss[loss=0.2469, simple_loss=0.3227, pruned_loss=0.08555, over 20722.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3091, pruned_loss=0.08283, over 4259627.77 frames. ], batch size: 609, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:45:41,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1091658.0, ans=0.125 2023-06-21 23:46:16,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1091778.0, ans=0.0 2023-06-21 23:46:26,837 INFO [train.py:996] (0/4) Epoch 6, batch 29500, loss[loss=0.2743, simple_loss=0.3294, pruned_loss=0.1096, over 21813.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3143, pruned_loss=0.08618, over 4262854.59 frames. ], batch size: 441, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:46:28,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091838.0, ans=0.1 2023-06-21 23:46:44,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1091838.0, ans=0.1 2023-06-21 23:47:01,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.458e+02 2.999e+02 3.395e+02 3.971e+02 6.244e+02, threshold=6.790e+02, percent-clipped=0.0 2023-06-21 23:47:02,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-06-21 23:47:14,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1091958.0, ans=0.0 2023-06-21 23:47:15,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-21 23:47:17,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1091958.0, ans=0.1 2023-06-21 23:47:58,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1092078.0, ans=10.0 2023-06-21 23:48:05,517 INFO [train.py:996] (0/4) Epoch 6, batch 29550, loss[loss=0.2186, simple_loss=0.2827, pruned_loss=0.07721, over 21618.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3144, pruned_loss=0.08852, over 4276694.27 frames. ], batch size: 212, lr: 4.89e-03, grad_scale: 32.0 2023-06-21 23:48:32,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1092198.0, ans=0.125 2023-06-21 23:49:22,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1092378.0, ans=0.1 2023-06-21 23:49:43,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1092438.0, ans=0.2 2023-06-21 23:49:44,845 INFO [train.py:996] (0/4) Epoch 6, batch 29600, loss[loss=0.244, simple_loss=0.2962, pruned_loss=0.09589, over 20361.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3205, pruned_loss=0.09089, over 4285136.81 frames. ], batch size: 703, lr: 4.89e-03, grad_scale: 32.0 2023-06-21 23:50:11,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.056e+02 3.521e+02 4.458e+02 7.696e+02, threshold=7.042e+02, percent-clipped=3.0 2023-06-21 23:51:17,488 INFO [train.py:996] (0/4) Epoch 6, batch 29650, loss[loss=0.2212, simple_loss=0.2821, pruned_loss=0.08013, over 21188.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3173, pruned_loss=0.08694, over 4279619.23 frames. ], batch size: 159, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:51:31,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1092798.0, ans=0.0 2023-06-21 23:51:41,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1092798.0, ans=0.125 2023-06-21 23:51:43,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1092798.0, ans=0.1 2023-06-21 23:51:44,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1092798.0, ans=0.125 2023-06-21 23:51:53,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1092858.0, ans=0.125 2023-06-21 23:52:10,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1092918.0, ans=10.0 2023-06-21 23:52:50,745 INFO [train.py:996] (0/4) Epoch 6, batch 29700, loss[loss=0.3573, simple_loss=0.4396, pruned_loss=0.1375, over 21530.00 frames. ], tot_loss[loss=0.249, simple_loss=0.322, pruned_loss=0.08805, over 4280785.82 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:52:51,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1093038.0, ans=0.125 2023-06-21 23:53:03,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1093038.0, ans=0.125 2023-06-21 23:53:07,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1093098.0, ans=0.0 2023-06-21 23:53:09,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1093098.0, ans=0.125 2023-06-21 23:53:19,204 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.737e+02 3.024e+02 3.717e+02 5.941e+02, threshold=6.048e+02, percent-clipped=0.0 2023-06-21 23:53:54,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1093218.0, ans=0.125 2023-06-21 23:54:24,078 INFO [train.py:996] (0/4) Epoch 6, batch 29750, loss[loss=0.304, simple_loss=0.3813, pruned_loss=0.1134, over 21537.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3286, pruned_loss=0.08869, over 4282142.63 frames. ], batch size: 507, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:55:20,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-21 23:55:21,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1093518.0, ans=0.015 2023-06-21 23:55:27,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1093518.0, ans=0.125 2023-06-21 23:55:56,896 INFO [train.py:996] (0/4) Epoch 6, batch 29800, loss[loss=0.2509, simple_loss=0.3134, pruned_loss=0.09418, over 21574.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3284, pruned_loss=0.08895, over 4287536.58 frames. ], batch size: 548, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:56:24,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1093698.0, ans=0.0 2023-06-21 23:56:25,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.253e+02 2.711e+02 3.022e+02 3.723e+02 5.120e+02, threshold=6.044e+02, percent-clipped=0.0 2023-06-21 23:57:12,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1093878.0, ans=0.125 2023-06-21 23:57:24,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1093878.0, ans=0.0 2023-06-21 23:57:29,970 INFO [train.py:996] (0/4) Epoch 6, batch 29850, loss[loss=0.2241, simple_loss=0.304, pruned_loss=0.07206, over 21745.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3228, pruned_loss=0.08581, over 4284865.94 frames. ], batch size: 414, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:57:36,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1093938.0, ans=0.125 2023-06-21 23:57:37,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1093938.0, ans=0.0 2023-06-21 23:57:45,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1093998.0, ans=0.0 2023-06-21 23:57:46,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1093998.0, ans=0.1 2023-06-21 23:57:50,341 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-21 23:59:02,897 INFO [train.py:996] (0/4) Epoch 6, batch 29900, loss[loss=0.2584, simple_loss=0.3282, pruned_loss=0.09436, over 21880.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3223, pruned_loss=0.08734, over 4284252.34 frames. ], batch size: 371, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:59:04,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1094238.0, ans=0.125 2023-06-21 23:59:36,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 3.118e+02 4.055e+02 5.716e+02 1.068e+03, threshold=8.110e+02, percent-clipped=21.0 2023-06-21 23:59:36,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1094358.0, ans=0.0 2023-06-21 23:59:47,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1094358.0, ans=0.125 2023-06-22 00:00:22,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1094478.0, ans=0.1 2023-06-22 00:00:25,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1094478.0, ans=0.2 2023-06-22 00:00:37,290 INFO [train.py:996] (0/4) Epoch 6, batch 29950, loss[loss=0.2767, simple_loss=0.3427, pruned_loss=0.1054, over 21730.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3249, pruned_loss=0.09105, over 4285488.93 frames. ], batch size: 298, lr: 4.89e-03, grad_scale: 8.0 2023-06-22 00:00:46,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1094538.0, ans=0.125 2023-06-22 00:00:59,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1094598.0, ans=0.0 2023-06-22 00:01:05,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1094598.0, ans=0.0 2023-06-22 00:01:58,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1094778.0, ans=0.1 2023-06-22 00:02:01,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1094778.0, ans=0.0 2023-06-22 00:02:11,821 INFO [train.py:996] (0/4) Epoch 6, batch 30000, loss[loss=0.2, simple_loss=0.2947, pruned_loss=0.05271, over 21616.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.325, pruned_loss=0.09027, over 4281334.93 frames. ], batch size: 230, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:02:11,822 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 00:02:27,460 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.6301, 1.8835, 4.0364, 2.6314], device='cuda:0') 2023-06-22 00:02:30,092 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2467, simple_loss=0.3478, pruned_loss=0.07276, over 1796401.00 frames. 2023-06-22 00:02:30,093 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-22 00:02:53,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1094898.0, ans=0.2 2023-06-22 00:03:10,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1094898.0, ans=0.0 2023-06-22 00:03:14,301 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.660e+02 3.036e+02 3.460e+02 6.733e+02, threshold=6.073e+02, percent-clipped=0.0 2023-06-22 00:03:19,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1094958.0, ans=0.125 2023-06-22 00:03:37,503 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-22 00:04:11,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1095138.0, ans=0.2 2023-06-22 00:04:17,403 INFO [train.py:996] (0/4) Epoch 6, batch 30050, loss[loss=0.3505, simple_loss=0.4437, pruned_loss=0.1287, over 21622.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3295, pruned_loss=0.08801, over 4273703.41 frames. ], batch size: 441, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:05:46,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1095378.0, ans=0.1 2023-06-22 00:05:50,551 INFO [train.py:996] (0/4) Epoch 6, batch 30100, loss[loss=0.2272, simple_loss=0.2863, pruned_loss=0.08408, over 21732.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3272, pruned_loss=0.08724, over 4271390.14 frames. ], batch size: 112, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:06:24,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.037e+02 3.799e+02 4.739e+02 8.498e+02, threshold=7.598e+02, percent-clipped=11.0 2023-06-22 00:07:01,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1095618.0, ans=0.2 2023-06-22 00:07:16,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1095678.0, ans=0.125 2023-06-22 00:07:21,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1095678.0, ans=0.125 2023-06-22 00:07:25,734 INFO [train.py:996] (0/4) Epoch 6, batch 30150, loss[loss=0.1987, simple_loss=0.2507, pruned_loss=0.07336, over 20757.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3234, pruned_loss=0.08857, over 4271348.90 frames. ], batch size: 609, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:07:26,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1095738.0, ans=0.0 2023-06-22 00:07:58,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1095798.0, ans=0.125 2023-06-22 00:08:37,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1095918.0, ans=0.125 2023-06-22 00:08:43,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-22 00:08:57,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1095978.0, ans=0.0 2023-06-22 00:09:02,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1095978.0, ans=0.1 2023-06-22 00:09:03,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1095978.0, ans=0.2 2023-06-22 00:09:07,443 INFO [train.py:996] (0/4) Epoch 6, batch 30200, loss[loss=0.2245, simple_loss=0.322, pruned_loss=0.06348, over 21706.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3246, pruned_loss=0.08785, over 4264299.47 frames. ], batch size: 298, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:09:08,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1096038.0, ans=0.125 2023-06-22 00:09:08,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1096038.0, ans=0.0 2023-06-22 00:09:08,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-22 00:09:46,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 3.001e+02 3.567e+02 4.107e+02 7.558e+02, threshold=7.134e+02, percent-clipped=0.0 2023-06-22 00:10:17,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096218.0, ans=0.1 2023-06-22 00:10:17,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1096218.0, ans=0.0 2023-06-22 00:10:30,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1096278.0, ans=0.0 2023-06-22 00:10:43,116 INFO [train.py:996] (0/4) Epoch 6, batch 30250, loss[loss=0.2881, simple_loss=0.3903, pruned_loss=0.09291, over 21627.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3323, pruned_loss=0.09019, over 4267554.15 frames. ], batch size: 263, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:10:56,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-22 00:11:29,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1096458.0, ans=0.125 2023-06-22 00:11:37,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1096458.0, ans=0.125 2023-06-22 00:11:41,359 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:12:17,405 INFO [train.py:996] (0/4) Epoch 6, batch 30300, loss[loss=0.2184, simple_loss=0.2742, pruned_loss=0.08126, over 21270.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3297, pruned_loss=0.08985, over 4271938.26 frames. ], batch size: 159, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:12:23,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1096638.0, ans=0.125 2023-06-22 00:13:00,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.155e+02 3.767e+02 4.351e+02 8.059e+02, threshold=7.534e+02, percent-clipped=2.0 2023-06-22 00:13:06,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-22 00:13:57,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1096938.0, ans=0.125 2023-06-22 00:14:03,144 INFO [train.py:996] (0/4) Epoch 6, batch 30350, loss[loss=0.2316, simple_loss=0.2721, pruned_loss=0.09559, over 20238.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3288, pruned_loss=0.09, over 4262962.71 frames. ], batch size: 707, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:14:04,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-22 00:14:12,733 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:14:32,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1096998.0, ans=0.0 2023-06-22 00:14:55,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1097118.0, ans=0.1 2023-06-22 00:15:05,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1097178.0, ans=0.0 2023-06-22 00:15:21,241 INFO [train.py:996] (0/4) Epoch 6, batch 30400, loss[loss=0.2356, simple_loss=0.2814, pruned_loss=0.09485, over 20250.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3231, pruned_loss=0.08817, over 4247734.34 frames. ], batch size: 703, lr: 4.88e-03, grad_scale: 32.0 2023-06-22 00:15:21,703 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:15:28,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1097238.0, ans=0.125 2023-06-22 00:15:39,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1097298.0, ans=0.125 2023-06-22 00:15:50,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.413e+02 4.097e+02 5.278e+02 1.616e+03, threshold=8.194e+02, percent-clipped=3.0 2023-06-22 00:16:04,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1097418.0, ans=0.2 2023-06-22 00:16:13,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1097418.0, ans=0.0 2023-06-22 00:16:23,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1097478.0, ans=0.0 2023-06-22 00:16:39,271 INFO [train.py:996] (0/4) Epoch 6, batch 30450, loss[loss=0.3014, simple_loss=0.4174, pruned_loss=0.09273, over 19768.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3254, pruned_loss=0.08815, over 4191582.30 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:16:40,094 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-22 00:16:46,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1097538.0, ans=0.125 2023-06-22 00:17:07,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1097658.0, ans=0.125 2023-06-22 00:17:17,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1097658.0, ans=0.125 2023-06-22 00:17:19,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1097658.0, ans=0.04949747468305833 2023-06-22 00:17:31,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1097718.0, ans=0.2 2023-06-22 00:17:43,581 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-6.pt 2023-06-22 00:19:20,722 INFO [train.py:996] (0/4) Epoch 7, batch 0, loss[loss=0.2674, simple_loss=0.3297, pruned_loss=0.1025, over 21935.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3297, pruned_loss=0.1025, over 21935.00 frames. ], batch size: 113, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:19:20,723 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 00:19:38,895 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2422, simple_loss=0.3486, pruned_loss=0.06787, over 1796401.00 frames. 2023-06-22 00:19:38,896 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-22 00:19:45,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1097802.0, ans=0.09899494936611666 2023-06-22 00:20:20,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-22 00:20:26,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 4.648e+02 5.934e+02 9.527e+02 2.892e+03, threshold=1.187e+03, percent-clipped=31.0 2023-06-22 00:21:07,556 INFO [train.py:996] (0/4) Epoch 7, batch 50, loss[loss=0.2821, simple_loss=0.3698, pruned_loss=0.09722, over 21259.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3235, pruned_loss=0.08725, over 967825.49 frames. ], batch size: 143, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:21:30,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-22 00:21:41,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-22 00:21:49,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1098222.0, ans=0.125 2023-06-22 00:22:43,755 INFO [train.py:996] (0/4) Epoch 7, batch 100, loss[loss=0.3005, simple_loss=0.3734, pruned_loss=0.1138, over 21620.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3416, pruned_loss=0.08944, over 1707293.49 frames. ], batch size: 389, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:23:06,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1098462.0, ans=0.125 2023-06-22 00:23:10,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1098462.0, ans=0.125 2023-06-22 00:23:36,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1098522.0, ans=0.125 2023-06-22 00:23:37,737 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.827e+02 3.336e+02 3.937e+02 6.913e+02, threshold=6.673e+02, percent-clipped=0.0 2023-06-22 00:24:13,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-22 00:24:19,888 INFO [train.py:996] (0/4) Epoch 7, batch 150, loss[loss=0.2851, simple_loss=0.3744, pruned_loss=0.0979, over 21655.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3432, pruned_loss=0.09096, over 2278329.80 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:24:22,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-22 00:24:34,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1098702.0, ans=0.035 2023-06-22 00:24:38,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1098762.0, ans=0.2 2023-06-22 00:25:17,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1098822.0, ans=0.0 2023-06-22 00:25:19,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-22 00:25:28,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-22 00:25:34,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-06-22 00:25:37,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1098942.0, ans=0.09899494936611666 2023-06-22 00:25:58,223 INFO [train.py:996] (0/4) Epoch 7, batch 200, loss[loss=0.2778, simple_loss=0.3713, pruned_loss=0.0921, over 21687.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3426, pruned_loss=0.09066, over 2715932.04 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:26:45,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1099122.0, ans=0.0 2023-06-22 00:26:56,362 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.932e+02 3.409e+02 3.929e+02 8.481e+02, threshold=6.818e+02, percent-clipped=3.0 2023-06-22 00:26:59,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-22 00:27:13,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1099182.0, ans=0.1 2023-06-22 00:27:24,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2023-06-22 00:27:36,501 INFO [train.py:996] (0/4) Epoch 7, batch 250, loss[loss=0.2042, simple_loss=0.2964, pruned_loss=0.05607, over 21721.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3377, pruned_loss=0.08932, over 3057238.19 frames. ], batch size: 298, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:27:39,093 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.86 vs. limit=10.0 2023-06-22 00:28:01,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1099362.0, ans=0.125 2023-06-22 00:28:11,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1099362.0, ans=0.125 2023-06-22 00:28:57,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1099542.0, ans=0.125 2023-06-22 00:29:12,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-22 00:29:14,512 INFO [train.py:996] (0/4) Epoch 7, batch 300, loss[loss=0.2312, simple_loss=0.3001, pruned_loss=0.08115, over 21446.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3312, pruned_loss=0.08888, over 3328996.86 frames. ], batch size: 211, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:29:25,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1099602.0, ans=0.0 2023-06-22 00:29:36,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1099662.0, ans=0.2 2023-06-22 00:30:11,204 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.968e+02 3.407e+02 3.987e+02 5.179e+02, threshold=6.813e+02, percent-clipped=0.0 2023-06-22 00:30:52,931 INFO [train.py:996] (0/4) Epoch 7, batch 350, loss[loss=0.2677, simple_loss=0.307, pruned_loss=0.1142, over 21439.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3262, pruned_loss=0.08886, over 3541039.61 frames. ], batch size: 511, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:31:48,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100022.0, ans=0.1 2023-06-22 00:31:50,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-22 00:31:57,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1100082.0, ans=0.0 2023-06-22 00:32:04,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1100082.0, ans=0.0 2023-06-22 00:32:21,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1100142.0, ans=0.125 2023-06-22 00:32:36,684 INFO [train.py:996] (0/4) Epoch 7, batch 400, loss[loss=0.2406, simple_loss=0.3092, pruned_loss=0.08601, over 21613.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3217, pruned_loss=0.08791, over 3708438.17 frames. ], batch size: 391, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:32:43,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1100202.0, ans=0.0 2023-06-22 00:33:20,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1100322.0, ans=0.125 2023-06-22 00:33:22,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1100322.0, ans=0.125 2023-06-22 00:33:23,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1100322.0, ans=0.125 2023-06-22 00:33:29,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.211e+02 3.756e+02 4.853e+02 8.203e+02, threshold=7.513e+02, percent-clipped=4.0 2023-06-22 00:33:44,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1100382.0, ans=0.125 2023-06-22 00:34:15,919 INFO [train.py:996] (0/4) Epoch 7, batch 450, loss[loss=0.2542, simple_loss=0.3173, pruned_loss=0.09552, over 21964.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3173, pruned_loss=0.08547, over 3830982.10 frames. ], batch size: 316, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:34:16,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1100502.0, ans=0.0 2023-06-22 00:34:21,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-22 00:34:30,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-06-22 00:34:53,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-22 00:34:55,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1100562.0, ans=0.0 2023-06-22 00:35:34,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1100742.0, ans=0.125 2023-06-22 00:35:53,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1100742.0, ans=0.5 2023-06-22 00:36:00,511 INFO [train.py:996] (0/4) Epoch 7, batch 500, loss[loss=0.2412, simple_loss=0.32, pruned_loss=0.08117, over 21323.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3165, pruned_loss=0.0852, over 3931048.27 frames. ], batch size: 176, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:36:50,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.984e+02 3.762e+02 4.525e+02 7.787e+02, threshold=7.525e+02, percent-clipped=1.0 2023-06-22 00:37:09,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1100982.0, ans=0.125 2023-06-22 00:37:42,369 INFO [train.py:996] (0/4) Epoch 7, batch 550, loss[loss=0.3347, simple_loss=0.407, pruned_loss=0.1312, over 21742.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3173, pruned_loss=0.08443, over 4012907.80 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:38:13,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1101162.0, ans=0.2 2023-06-22 00:38:18,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=22.5 2023-06-22 00:38:24,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1101222.0, ans=0.125 2023-06-22 00:38:29,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1101222.0, ans=0.125 2023-06-22 00:38:50,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1101282.0, ans=22.5 2023-06-22 00:39:14,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-22 00:39:20,773 INFO [train.py:996] (0/4) Epoch 7, batch 600, loss[loss=0.2325, simple_loss=0.3241, pruned_loss=0.07051, over 21557.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3199, pruned_loss=0.08428, over 4071040.69 frames. ], batch size: 230, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:39:47,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1101462.0, ans=0.125 2023-06-22 00:40:04,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1101522.0, ans=0.125 2023-06-22 00:40:08,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1101522.0, ans=0.0 2023-06-22 00:40:10,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.944e+02 3.479e+02 4.173e+02 5.834e+02, threshold=6.959e+02, percent-clipped=0.0 2023-06-22 00:40:38,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1101642.0, ans=0.1 2023-06-22 00:40:52,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1101642.0, ans=0.0 2023-06-22 00:40:52,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1101642.0, ans=0.2 2023-06-22 00:40:58,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1101702.0, ans=0.1 2023-06-22 00:41:00,129 INFO [train.py:996] (0/4) Epoch 7, batch 650, loss[loss=0.2584, simple_loss=0.3416, pruned_loss=0.08761, over 21444.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3227, pruned_loss=0.08521, over 4109526.26 frames. ], batch size: 211, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:41:17,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1101762.0, ans=0.0 2023-06-22 00:41:40,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1101822.0, ans=0.0 2023-06-22 00:42:35,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1101942.0, ans=0.125 2023-06-22 00:42:38,127 INFO [train.py:996] (0/4) Epoch 7, batch 700, loss[loss=0.2402, simple_loss=0.2924, pruned_loss=0.09403, over 21582.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3237, pruned_loss=0.08635, over 4156626.19 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:42:42,481 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.37 vs. limit=12.0 2023-06-22 00:42:48,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1102002.0, ans=0.125 2023-06-22 00:43:07,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.15 vs. limit=10.0 2023-06-22 00:43:27,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.306e+02 4.325e+02 5.540e+02 9.236e+02, threshold=8.651e+02, percent-clipped=10.0 2023-06-22 00:44:01,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1102242.0, ans=0.0 2023-06-22 00:44:16,071 INFO [train.py:996] (0/4) Epoch 7, batch 750, loss[loss=0.2662, simple_loss=0.3379, pruned_loss=0.09723, over 21730.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3242, pruned_loss=0.08748, over 4190439.27 frames. ], batch size: 351, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:44:49,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1102362.0, ans=0.2 2023-06-22 00:45:53,761 INFO [train.py:996] (0/4) Epoch 7, batch 800, loss[loss=0.2363, simple_loss=0.2929, pruned_loss=0.08984, over 21712.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3219, pruned_loss=0.08696, over 4211840.82 frames. ], batch size: 282, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:46:21,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1102662.0, ans=0.09899494936611666 2023-06-22 00:46:42,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 3.353e+02 3.931e+02 5.238e+02 1.056e+03, threshold=7.862e+02, percent-clipped=1.0 2023-06-22 00:47:31,032 INFO [train.py:996] (0/4) Epoch 7, batch 850, loss[loss=0.2476, simple_loss=0.3127, pruned_loss=0.09123, over 21279.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3194, pruned_loss=0.0863, over 4231541.57 frames. ], batch size: 143, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:47:32,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1102902.0, ans=0.0 2023-06-22 00:47:55,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1102962.0, ans=0.125 2023-06-22 00:48:11,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-22 00:48:33,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1103082.0, ans=0.2 2023-06-22 00:49:00,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1103142.0, ans=0.2 2023-06-22 00:49:04,183 INFO [train.py:996] (0/4) Epoch 7, batch 900, loss[loss=0.2625, simple_loss=0.3403, pruned_loss=0.09233, over 20201.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3158, pruned_loss=0.08579, over 4248167.14 frames. ], batch size: 703, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:49:53,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.788e+02 3.273e+02 3.960e+02 6.263e+02, threshold=6.546e+02, percent-clipped=0.0 2023-06-22 00:50:24,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1103382.0, ans=0.125 2023-06-22 00:50:48,016 INFO [train.py:996] (0/4) Epoch 7, batch 950, loss[loss=0.2497, simple_loss=0.3172, pruned_loss=0.09112, over 21444.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3138, pruned_loss=0.08589, over 4263546.69 frames. ], batch size: 194, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:51:04,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1103562.0, ans=0.2 2023-06-22 00:51:12,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1103562.0, ans=0.0 2023-06-22 00:51:58,957 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-22 00:52:26,749 INFO [train.py:996] (0/4) Epoch 7, batch 1000, loss[loss=0.2167, simple_loss=0.308, pruned_loss=0.06273, over 21690.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3123, pruned_loss=0.08506, over 4273507.05 frames. ], batch size: 263, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:52:51,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.43 vs. limit=22.5 2023-06-22 00:52:57,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1103922.0, ans=0.125 2023-06-22 00:53:20,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.93 vs. limit=6.0 2023-06-22 00:53:25,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 2.971e+02 3.488e+02 4.258e+02 7.403e+02, threshold=6.977e+02, percent-clipped=1.0 2023-06-22 00:53:27,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1103982.0, ans=0.0 2023-06-22 00:53:32,143 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-184000.pt 2023-06-22 00:53:51,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104042.0, ans=0.1 2023-06-22 00:54:04,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104042.0, ans=0.1 2023-06-22 00:54:07,790 INFO [train.py:996] (0/4) Epoch 7, batch 1050, loss[loss=0.2556, simple_loss=0.3153, pruned_loss=0.09797, over 21812.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3096, pruned_loss=0.08418, over 4272148.37 frames. ], batch size: 351, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:54:13,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1104102.0, ans=0.125 2023-06-22 00:54:15,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-22 00:54:19,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1104102.0, ans=0.09899494936611666 2023-06-22 00:54:24,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1104162.0, ans=0.125 2023-06-22 00:54:28,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-22 00:54:31,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-22 00:55:47,366 INFO [train.py:996] (0/4) Epoch 7, batch 1100, loss[loss=0.2423, simple_loss=0.3189, pruned_loss=0.08284, over 21779.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3091, pruned_loss=0.08383, over 4277586.91 frames. ], batch size: 414, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:56:32,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1104522.0, ans=0.125 2023-06-22 00:56:48,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.865e+02 3.379e+02 3.962e+02 8.205e+02, threshold=6.758e+02, percent-clipped=2.0 2023-06-22 00:56:50,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104582.0, ans=0.1 2023-06-22 00:57:01,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1104582.0, ans=0.2 2023-06-22 00:57:10,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1104642.0, ans=0.0 2023-06-22 00:57:12,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1104642.0, ans=0.125 2023-06-22 00:57:27,389 INFO [train.py:996] (0/4) Epoch 7, batch 1150, loss[loss=0.2639, simple_loss=0.3321, pruned_loss=0.09789, over 21478.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3103, pruned_loss=0.08405, over 4276846.78 frames. ], batch size: 548, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:57:32,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1104702.0, ans=0.125 2023-06-22 00:58:18,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1104822.0, ans=0.125 2023-06-22 00:58:25,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1104822.0, ans=0.125 2023-06-22 00:58:26,948 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:58:36,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104882.0, ans=0.1 2023-06-22 00:59:07,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1105002.0, ans=0.1 2023-06-22 00:59:08,535 INFO [train.py:996] (0/4) Epoch 7, batch 1200, loss[loss=0.2933, simple_loss=0.3535, pruned_loss=0.1165, over 21638.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3132, pruned_loss=0.08469, over 4279233.83 frames. ], batch size: 263, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:59:21,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1105002.0, ans=0.2 2023-06-22 00:59:23,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1105002.0, ans=0.0 2023-06-22 00:59:49,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-22 01:00:10,596 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.150e+02 3.657e+02 4.212e+02 7.667e+02, threshold=7.313e+02, percent-clipped=2.0 2023-06-22 01:00:48,912 INFO [train.py:996] (0/4) Epoch 7, batch 1250, loss[loss=0.2138, simple_loss=0.2919, pruned_loss=0.0679, over 21510.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.316, pruned_loss=0.08642, over 4278614.67 frames. ], batch size: 131, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:01:18,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1105362.0, ans=0.2 2023-06-22 01:02:11,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1105542.0, ans=0.125 2023-06-22 01:02:24,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1105542.0, ans=0.09899494936611666 2023-06-22 01:02:28,469 INFO [train.py:996] (0/4) Epoch 7, batch 1300, loss[loss=0.3883, simple_loss=0.4489, pruned_loss=0.1638, over 21523.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3191, pruned_loss=0.08718, over 4280485.80 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:02:58,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105662.0, ans=0.1 2023-06-22 01:03:28,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1105722.0, ans=0.2 2023-06-22 01:03:36,719 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.032e+02 3.671e+02 4.552e+02 8.321e+02, threshold=7.341e+02, percent-clipped=3.0 2023-06-22 01:03:37,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1105782.0, ans=0.0 2023-06-22 01:03:43,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1105782.0, ans=0.125 2023-06-22 01:04:13,381 INFO [train.py:996] (0/4) Epoch 7, batch 1350, loss[loss=0.2304, simple_loss=0.2959, pruned_loss=0.0824, over 21820.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3196, pruned_loss=0.0879, over 4286688.03 frames. ], batch size: 124, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:04:28,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1105902.0, ans=0.125 2023-06-22 01:05:20,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1106082.0, ans=0.125 2023-06-22 01:05:36,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=15.0 2023-06-22 01:05:51,575 INFO [train.py:996] (0/4) Epoch 7, batch 1400, loss[loss=0.2367, simple_loss=0.3151, pruned_loss=0.07909, over 21879.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3183, pruned_loss=0.08795, over 4290391.06 frames. ], batch size: 332, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:05:52,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1106202.0, ans=0.05 2023-06-22 01:06:37,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-22 01:06:54,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.431e+02 3.168e+02 3.476e+02 4.011e+02 7.450e+02, threshold=6.951e+02, percent-clipped=1.0 2023-06-22 01:07:26,224 INFO [train.py:996] (0/4) Epoch 7, batch 1450, loss[loss=0.2394, simple_loss=0.3045, pruned_loss=0.08713, over 21816.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3202, pruned_loss=0.08927, over 4285963.74 frames. ], batch size: 107, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:08:01,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1106562.0, ans=0.125 2023-06-22 01:08:05,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1106562.0, ans=0.125 2023-06-22 01:08:09,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1106562.0, ans=0.2 2023-06-22 01:08:58,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1106742.0, ans=0.125 2023-06-22 01:09:11,613 INFO [train.py:996] (0/4) Epoch 7, batch 1500, loss[loss=0.3113, simple_loss=0.3803, pruned_loss=0.1212, over 21622.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3212, pruned_loss=0.09054, over 4289723.27 frames. ], batch size: 441, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:09:21,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1106802.0, ans=0.1 2023-06-22 01:09:34,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1106802.0, ans=0.0 2023-06-22 01:09:39,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1106862.0, ans=0.0 2023-06-22 01:10:15,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 2.952e+02 3.357e+02 3.782e+02 8.287e+02, threshold=6.713e+02, percent-clipped=2.0 2023-06-22 01:11:03,404 INFO [train.py:996] (0/4) Epoch 7, batch 1550, loss[loss=0.1999, simple_loss=0.2695, pruned_loss=0.06516, over 20190.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3213, pruned_loss=0.09024, over 4286835.41 frames. ], batch size: 703, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:11:11,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1107102.0, ans=0.125 2023-06-22 01:11:12,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1107102.0, ans=0.125 2023-06-22 01:11:51,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-22 01:12:25,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-22 01:12:32,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1107342.0, ans=0.125 2023-06-22 01:12:42,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1107402.0, ans=0.1 2023-06-22 01:12:43,981 INFO [train.py:996] (0/4) Epoch 7, batch 1600, loss[loss=0.297, simple_loss=0.355, pruned_loss=0.1195, over 21775.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3196, pruned_loss=0.08921, over 4288039.89 frames. ], batch size: 441, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 01:13:10,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1107462.0, ans=0.125 2023-06-22 01:13:22,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1107522.0, ans=0.125 2023-06-22 01:13:39,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.079e+02 3.590e+02 4.621e+02 8.115e+02, threshold=7.180e+02, percent-clipped=4.0 2023-06-22 01:13:42,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=22.5 2023-06-22 01:14:21,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.30 vs. limit=15.0 2023-06-22 01:14:25,786 INFO [train.py:996] (0/4) Epoch 7, batch 1650, loss[loss=0.2887, simple_loss=0.3474, pruned_loss=0.115, over 21337.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3191, pruned_loss=0.088, over 4284930.17 frames. ], batch size: 131, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:14:39,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1107702.0, ans=0.125 2023-06-22 01:14:55,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=15.0 2023-06-22 01:16:07,487 INFO [train.py:996] (0/4) Epoch 7, batch 1700, loss[loss=0.2634, simple_loss=0.333, pruned_loss=0.09686, over 20210.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3223, pruned_loss=0.08904, over 4280910.70 frames. ], batch size: 702, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:17:13,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.011e+02 3.603e+02 4.344e+02 6.909e+02, threshold=7.205e+02, percent-clipped=0.0 2023-06-22 01:17:21,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1108182.0, ans=0.125 2023-06-22 01:17:25,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-22 01:17:27,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-22 01:17:54,164 INFO [train.py:996] (0/4) Epoch 7, batch 1750, loss[loss=0.2385, simple_loss=0.3705, pruned_loss=0.05326, over 19781.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3205, pruned_loss=0.08684, over 4280398.73 frames. ], batch size: 702, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:17:56,414 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:17:58,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1108302.0, ans=0.2 2023-06-22 01:18:06,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1108302.0, ans=0.125 2023-06-22 01:18:09,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1108362.0, ans=0.0 2023-06-22 01:18:17,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.98 vs. limit=6.0 2023-06-22 01:19:32,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1108542.0, ans=0.125 2023-06-22 01:19:36,788 INFO [train.py:996] (0/4) Epoch 7, batch 1800, loss[loss=0.2664, simple_loss=0.3689, pruned_loss=0.08193, over 21658.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3184, pruned_loss=0.0848, over 4282146.18 frames. ], batch size: 414, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:20:06,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1108662.0, ans=0.0 2023-06-22 01:20:23,086 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:20:36,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.999e+02 3.807e+02 4.640e+02 8.092e+02, threshold=7.614e+02, percent-clipped=1.0 2023-06-22 01:20:49,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1108842.0, ans=0.125 2023-06-22 01:21:01,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-22 01:21:09,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1108842.0, ans=0.125 2023-06-22 01:21:12,529 INFO [train.py:996] (0/4) Epoch 7, batch 1850, loss[loss=0.2055, simple_loss=0.2858, pruned_loss=0.06264, over 21789.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.319, pruned_loss=0.08332, over 4276597.07 frames. ], batch size: 247, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:21:28,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1108962.0, ans=0.04949747468305833 2023-06-22 01:21:35,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1108962.0, ans=0.125 2023-06-22 01:22:27,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1109082.0, ans=0.125 2023-06-22 01:22:29,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1109142.0, ans=0.0 2023-06-22 01:22:36,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1109142.0, ans=0.0 2023-06-22 01:22:50,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1109202.0, ans=0.125 2023-06-22 01:22:52,005 INFO [train.py:996] (0/4) Epoch 7, batch 1900, loss[loss=0.2687, simple_loss=0.3274, pruned_loss=0.1049, over 21989.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3214, pruned_loss=0.08337, over 4283119.31 frames. ], batch size: 103, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:23:14,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1109262.0, ans=0.1 2023-06-22 01:23:55,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.011e+02 3.315e+02 4.225e+02 7.544e+02, threshold=6.631e+02, percent-clipped=0.0 2023-06-22 01:24:31,576 INFO [train.py:996] (0/4) Epoch 7, batch 1950, loss[loss=0.271, simple_loss=0.3629, pruned_loss=0.08949, over 21673.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3175, pruned_loss=0.08356, over 4271212.81 frames. ], batch size: 441, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:24:33,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1109502.0, ans=0.0 2023-06-22 01:25:16,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1109622.0, ans=0.0 2023-06-22 01:25:16,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-22 01:25:32,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1109682.0, ans=0.125 2023-06-22 01:25:39,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1109682.0, ans=0.0 2023-06-22 01:26:04,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1109742.0, ans=0.0 2023-06-22 01:26:08,035 INFO [train.py:996] (0/4) Epoch 7, batch 2000, loss[loss=0.2468, simple_loss=0.3186, pruned_loss=0.08743, over 21865.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3102, pruned_loss=0.08105, over 4259659.57 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:26:49,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-06-22 01:26:59,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-22 01:27:08,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.011e+02 3.534e+02 4.204e+02 7.079e+02, threshold=7.069e+02, percent-clipped=1.0 2023-06-22 01:27:25,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1109982.0, ans=0.125 2023-06-22 01:27:26,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1110042.0, ans=0.1 2023-06-22 01:27:43,235 INFO [train.py:996] (0/4) Epoch 7, batch 2050, loss[loss=0.2546, simple_loss=0.3401, pruned_loss=0.08459, over 21664.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3116, pruned_loss=0.08081, over 4266489.47 frames. ], batch size: 441, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:27:59,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1110102.0, ans=0.0 2023-06-22 01:28:19,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1110162.0, ans=0.0 2023-06-22 01:29:02,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-06-22 01:29:28,354 INFO [train.py:996] (0/4) Epoch 7, batch 2100, loss[loss=0.2124, simple_loss=0.2751, pruned_loss=0.07484, over 21741.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3155, pruned_loss=0.08296, over 4271957.32 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:30:04,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.93 vs. limit=10.0 2023-06-22 01:30:25,335 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:30:34,423 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.462e+02 4.025e+02 4.907e+02 9.309e+02, threshold=8.051e+02, percent-clipped=5.0 2023-06-22 01:31:08,533 INFO [train.py:996] (0/4) Epoch 7, batch 2150, loss[loss=0.2933, simple_loss=0.3681, pruned_loss=0.1093, over 21312.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3162, pruned_loss=0.08437, over 4273354.92 frames. ], batch size: 548, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:31:45,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.93 vs. limit=10.0 2023-06-22 01:32:48,381 INFO [train.py:996] (0/4) Epoch 7, batch 2200, loss[loss=0.2088, simple_loss=0.2981, pruned_loss=0.05977, over 21794.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3201, pruned_loss=0.08506, over 4279041.91 frames. ], batch size: 282, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:33:03,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1111062.0, ans=0.125 2023-06-22 01:33:27,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-22 01:33:34,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1111122.0, ans=0.125 2023-06-22 01:33:48,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.551e+02 3.115e+02 3.820e+02 5.117e+02 8.192e+02, threshold=7.640e+02, percent-clipped=1.0 2023-06-22 01:33:59,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111182.0, ans=0.1 2023-06-22 01:34:03,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1111182.0, ans=0.125 2023-06-22 01:34:10,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1111242.0, ans=0.0 2023-06-22 01:34:27,183 INFO [train.py:996] (0/4) Epoch 7, batch 2250, loss[loss=0.2304, simple_loss=0.2908, pruned_loss=0.08501, over 21686.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3171, pruned_loss=0.08364, over 4277917.88 frames. ], batch size: 282, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:35:14,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-22 01:36:02,583 INFO [train.py:996] (0/4) Epoch 7, batch 2300, loss[loss=0.2124, simple_loss=0.2795, pruned_loss=0.07265, over 21833.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3145, pruned_loss=0.0837, over 4281053.76 frames. ], batch size: 107, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:36:06,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1111602.0, ans=0.0 2023-06-22 01:36:14,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1111602.0, ans=0.125 2023-06-22 01:36:18,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111662.0, ans=0.1 2023-06-22 01:37:09,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.004e+02 3.529e+02 4.212e+02 9.324e+02, threshold=7.058e+02, percent-clipped=1.0 2023-06-22 01:37:42,539 INFO [train.py:996] (0/4) Epoch 7, batch 2350, loss[loss=0.2457, simple_loss=0.309, pruned_loss=0.09121, over 21874.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3139, pruned_loss=0.08431, over 4279044.91 frames. ], batch size: 98, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:37:55,655 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=6.002e-03 2023-06-22 01:38:13,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1111962.0, ans=0.125 2023-06-22 01:38:48,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1112082.0, ans=0.125 2023-06-22 01:39:10,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1112142.0, ans=0.2 2023-06-22 01:39:17,651 INFO [train.py:996] (0/4) Epoch 7, batch 2400, loss[loss=0.2568, simple_loss=0.331, pruned_loss=0.09125, over 21914.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3182, pruned_loss=0.08675, over 4278685.13 frames. ], batch size: 372, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:39:21,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1112202.0, ans=0.125 2023-06-22 01:40:10,425 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:40:25,234 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 3.130e+02 3.615e+02 4.219e+02 6.751e+02, threshold=7.231e+02, percent-clipped=0.0 2023-06-22 01:40:32,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.74 vs. limit=6.0 2023-06-22 01:40:53,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.15 vs. limit=22.5 2023-06-22 01:40:56,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1112442.0, ans=0.1 2023-06-22 01:40:59,135 INFO [train.py:996] (0/4) Epoch 7, batch 2450, loss[loss=0.2839, simple_loss=0.3557, pruned_loss=0.1061, over 21578.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3217, pruned_loss=0.08886, over 4279004.83 frames. ], batch size: 389, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:41:47,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1112622.0, ans=0.2 2023-06-22 01:41:51,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1112622.0, ans=0.2 2023-06-22 01:42:06,323 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-06-22 01:42:27,642 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:42:40,171 INFO [train.py:996] (0/4) Epoch 7, batch 2500, loss[loss=0.2414, simple_loss=0.3234, pruned_loss=0.07967, over 21158.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3183, pruned_loss=0.08741, over 4284565.46 frames. ], batch size: 548, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:43:25,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1112922.0, ans=0.125 2023-06-22 01:43:40,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1112922.0, ans=0.125 2023-06-22 01:43:48,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.102e+02 3.610e+02 4.513e+02 8.483e+02, threshold=7.220e+02, percent-clipped=3.0 2023-06-22 01:44:20,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1113102.0, ans=0.04949747468305833 2023-06-22 01:44:21,340 INFO [train.py:996] (0/4) Epoch 7, batch 2550, loss[loss=0.2247, simple_loss=0.2908, pruned_loss=0.07932, over 21774.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3153, pruned_loss=0.08554, over 4282879.67 frames. ], batch size: 351, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:44:59,317 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:45:57,695 INFO [train.py:996] (0/4) Epoch 7, batch 2600, loss[loss=0.2226, simple_loss=0.3111, pruned_loss=0.06698, over 21297.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3192, pruned_loss=0.08808, over 4285375.40 frames. ], batch size: 176, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:46:03,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-22 01:46:19,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1113462.0, ans=0.125 2023-06-22 01:46:37,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1113462.0, ans=0.2 2023-06-22 01:46:49,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-22 01:46:57,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1113582.0, ans=0.125 2023-06-22 01:47:06,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 3.288e+02 3.616e+02 4.316e+02 7.089e+02, threshold=7.232e+02, percent-clipped=0.0 2023-06-22 01:47:26,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1113642.0, ans=0.07 2023-06-22 01:47:31,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1113642.0, ans=0.125 2023-06-22 01:47:33,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-22 01:47:39,084 INFO [train.py:996] (0/4) Epoch 7, batch 2650, loss[loss=0.2385, simple_loss=0.3062, pruned_loss=0.08543, over 21852.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3209, pruned_loss=0.09028, over 4285451.49 frames. ], batch size: 414, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:48:23,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1113822.0, ans=0.1 2023-06-22 01:48:28,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1113822.0, ans=0.07 2023-06-22 01:49:19,727 INFO [train.py:996] (0/4) Epoch 7, batch 2700, loss[loss=0.1881, simple_loss=0.2549, pruned_loss=0.06068, over 21270.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3182, pruned_loss=0.0896, over 4277453.86 frames. ], batch size: 176, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:49:51,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-22 01:50:08,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1114122.0, ans=10.0 2023-06-22 01:50:28,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 2.952e+02 3.435e+02 4.194e+02 7.834e+02, threshold=6.870e+02, percent-clipped=2.0 2023-06-22 01:51:00,687 INFO [train.py:996] (0/4) Epoch 7, batch 2750, loss[loss=0.2587, simple_loss=0.3337, pruned_loss=0.09192, over 21872.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3149, pruned_loss=0.08831, over 4280279.48 frames. ], batch size: 107, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:52:10,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1114482.0, ans=0.2 2023-06-22 01:52:53,320 INFO [train.py:996] (0/4) Epoch 7, batch 2800, loss[loss=0.2307, simple_loss=0.3091, pruned_loss=0.07619, over 21077.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3185, pruned_loss=0.08894, over 4281931.63 frames. ], batch size: 607, lr: 4.45e-03, grad_scale: 32.0 2023-06-22 01:52:55,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1114602.0, ans=0.125 2023-06-22 01:53:11,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1114662.0, ans=0.125 2023-06-22 01:53:35,565 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:53:53,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1114782.0, ans=0.0 2023-06-22 01:53:57,891 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.206e+02 3.798e+02 4.545e+02 8.220e+02, threshold=7.596e+02, percent-clipped=2.0 2023-06-22 01:54:18,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1114842.0, ans=0.04949747468305833 2023-06-22 01:54:35,973 INFO [train.py:996] (0/4) Epoch 7, batch 2850, loss[loss=0.2973, simple_loss=0.3626, pruned_loss=0.116, over 21361.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3202, pruned_loss=0.08952, over 4276137.00 frames. ], batch size: 549, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:54:42,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1114902.0, ans=0.125 2023-06-22 01:54:58,745 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:55:02,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1114962.0, ans=0.2 2023-06-22 01:55:27,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1115022.0, ans=0.125 2023-06-22 01:55:39,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1115082.0, ans=0.0 2023-06-22 01:55:49,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1115142.0, ans=0.2 2023-06-22 01:55:59,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1115142.0, ans=0.07 2023-06-22 01:56:01,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1115142.0, ans=0.125 2023-06-22 01:56:01,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1115142.0, ans=0.0 2023-06-22 01:56:15,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1115202.0, ans=0.125 2023-06-22 01:56:16,770 INFO [train.py:996] (0/4) Epoch 7, batch 2900, loss[loss=0.2432, simple_loss=0.3061, pruned_loss=0.09014, over 21830.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3195, pruned_loss=0.0901, over 4277429.43 frames. ], batch size: 282, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:56:45,290 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:56:59,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-22 01:57:20,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-22 01:57:22,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.184e+02 3.787e+02 4.850e+02 9.590e+02, threshold=7.574e+02, percent-clipped=4.0 2023-06-22 01:57:58,399 INFO [train.py:996] (0/4) Epoch 7, batch 2950, loss[loss=0.2443, simple_loss=0.3051, pruned_loss=0.09177, over 21870.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3198, pruned_loss=0.09004, over 4286359.88 frames. ], batch size: 298, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:58:48,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1115622.0, ans=0.0 2023-06-22 01:59:22,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1115742.0, ans=0.125 2023-06-22 01:59:39,998 INFO [train.py:996] (0/4) Epoch 7, batch 3000, loss[loss=0.2778, simple_loss=0.3505, pruned_loss=0.1025, over 21423.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3242, pruned_loss=0.09014, over 4287653.03 frames. ], batch size: 159, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:59:40,000 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 01:59:56,487 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2473, simple_loss=0.3435, pruned_loss=0.07556, over 1796401.00 frames. 2023-06-22 01:59:56,488 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-22 02:00:25,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1115862.0, ans=0.125 2023-06-22 02:00:53,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-22 02:01:11,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.189e+02 3.683e+02 4.814e+02 8.214e+02, threshold=7.366e+02, percent-clipped=1.0 2023-06-22 02:01:18,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-22 02:01:36,620 INFO [train.py:996] (0/4) Epoch 7, batch 3050, loss[loss=0.2276, simple_loss=0.3099, pruned_loss=0.07268, over 21833.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3247, pruned_loss=0.08864, over 4287186.48 frames. ], batch size: 371, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:01:37,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1116102.0, ans=0.125 2023-06-22 02:01:40,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1116102.0, ans=0.0 2023-06-22 02:02:54,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1116282.0, ans=0.1 2023-06-22 02:03:01,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1116342.0, ans=0.125 2023-06-22 02:03:24,166 INFO [train.py:996] (0/4) Epoch 7, batch 3100, loss[loss=0.2095, simple_loss=0.2828, pruned_loss=0.06812, over 21487.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3237, pruned_loss=0.08737, over 4295416.36 frames. ], batch size: 211, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:03:44,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1116462.0, ans=0.125 2023-06-22 02:04:02,471 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:04:15,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1116522.0, ans=0.125 2023-06-22 02:04:34,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.895e+02 3.297e+02 4.092e+02 7.123e+02, threshold=6.595e+02, percent-clipped=0.0 2023-06-22 02:04:53,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1116642.0, ans=0.2 2023-06-22 02:04:54,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1116642.0, ans=0.125 2023-06-22 02:05:11,854 INFO [train.py:996] (0/4) Epoch 7, batch 3150, loss[loss=0.2697, simple_loss=0.3414, pruned_loss=0.09899, over 21704.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.326, pruned_loss=0.08864, over 4295223.16 frames. ], batch size: 351, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:06:52,854 INFO [train.py:996] (0/4) Epoch 7, batch 3200, loss[loss=0.2093, simple_loss=0.2976, pruned_loss=0.06049, over 21788.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.327, pruned_loss=0.08805, over 4294422.95 frames. ], batch size: 282, lr: 4.45e-03, grad_scale: 32.0 2023-06-22 02:07:04,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1117002.0, ans=0.125 2023-06-22 02:08:05,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.937e+02 3.458e+02 4.160e+02 8.829e+02, threshold=6.916e+02, percent-clipped=6.0 2023-06-22 02:08:11,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.11 vs. limit=22.5 2023-06-22 02:08:22,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-22 02:08:34,756 INFO [train.py:996] (0/4) Epoch 7, batch 3250, loss[loss=0.3104, simple_loss=0.3416, pruned_loss=0.1397, over 21451.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3283, pruned_loss=0.09, over 4292445.33 frames. ], batch size: 510, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:08:41,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1117302.0, ans=0.125 2023-06-22 02:09:17,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1117422.0, ans=0.0 2023-06-22 02:09:40,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1117482.0, ans=0.125 2023-06-22 02:10:01,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1117542.0, ans=0.125 2023-06-22 02:10:06,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1117542.0, ans=0.0 2023-06-22 02:10:06,706 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=12.0 2023-06-22 02:10:07,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1117542.0, ans=0.125 2023-06-22 02:10:20,748 INFO [train.py:996] (0/4) Epoch 7, batch 3300, loss[loss=0.236, simple_loss=0.3288, pruned_loss=0.07157, over 21731.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3227, pruned_loss=0.08957, over 4286222.66 frames. ], batch size: 282, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:11:16,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-22 02:11:26,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 2.992e+02 3.641e+02 4.480e+02 7.487e+02, threshold=7.281e+02, percent-clipped=2.0 2023-06-22 02:11:27,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-06-22 02:11:54,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1117902.0, ans=0.1 2023-06-22 02:12:00,473 INFO [train.py:996] (0/4) Epoch 7, batch 3350, loss[loss=0.2884, simple_loss=0.3463, pruned_loss=0.1152, over 21943.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3215, pruned_loss=0.0885, over 4278695.53 frames. ], batch size: 316, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:12:09,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.08 vs. limit=22.5 2023-06-22 02:12:51,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1118022.0, ans=0.0 2023-06-22 02:13:18,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1118082.0, ans=0.125 2023-06-22 02:13:46,400 INFO [train.py:996] (0/4) Epoch 7, batch 3400, loss[loss=0.2419, simple_loss=0.3105, pruned_loss=0.08661, over 21611.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3214, pruned_loss=0.08919, over 4280952.53 frames. ], batch size: 247, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:14:08,752 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:14:52,637 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.058e+02 3.533e+02 4.088e+02 6.686e+02, threshold=7.066e+02, percent-clipped=0.0 2023-06-22 02:14:59,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1118442.0, ans=0.5 2023-06-22 02:15:13,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1118442.0, ans=0.125 2023-06-22 02:15:26,633 INFO [train.py:996] (0/4) Epoch 7, batch 3450, loss[loss=0.2238, simple_loss=0.2919, pruned_loss=0.07783, over 21190.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.318, pruned_loss=0.08911, over 4276452.93 frames. ], batch size: 143, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:15:27,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1118502.0, ans=0.2 2023-06-22 02:15:40,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1118502.0, ans=0.125 2023-06-22 02:16:10,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1118622.0, ans=0.0 2023-06-22 02:16:24,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1118682.0, ans=0.0 2023-06-22 02:16:33,178 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.902e-02 2023-06-22 02:17:03,877 INFO [train.py:996] (0/4) Epoch 7, batch 3500, loss[loss=0.3065, simple_loss=0.3913, pruned_loss=0.1109, over 21588.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3265, pruned_loss=0.09192, over 4273576.75 frames. ], batch size: 263, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:17:35,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-22 02:17:37,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-22 02:18:04,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1118982.0, ans=0.0 2023-06-22 02:18:20,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.453e+02 3.920e+02 4.708e+02 8.175e+02, threshold=7.839e+02, percent-clipped=4.0 2023-06-22 02:18:31,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-22 02:18:35,436 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:18:44,636 INFO [train.py:996] (0/4) Epoch 7, batch 3550, loss[loss=0.2596, simple_loss=0.3157, pruned_loss=0.1017, over 21874.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3293, pruned_loss=0.09282, over 4268835.48 frames. ], batch size: 372, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:19:14,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1119162.0, ans=0.125 2023-06-22 02:19:32,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1119222.0, ans=0.125 2023-06-22 02:20:10,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1119342.0, ans=0.07 2023-06-22 02:20:20,930 INFO [train.py:996] (0/4) Epoch 7, batch 3600, loss[loss=0.2596, simple_loss=0.3525, pruned_loss=0.08329, over 20667.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3238, pruned_loss=0.09174, over 4267647.44 frames. ], batch size: 607, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:20:51,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1119462.0, ans=0.1 2023-06-22 02:20:51,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1119462.0, ans=0.125 2023-06-22 02:21:14,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-22 02:21:38,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.375e+02 4.106e+02 5.045e+02 9.366e+02, threshold=8.213e+02, percent-clipped=2.0 2023-06-22 02:21:43,067 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-06-22 02:21:49,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1119642.0, ans=0.1 2023-06-22 02:22:03,448 INFO [train.py:996] (0/4) Epoch 7, batch 3650, loss[loss=0.3355, simple_loss=0.4468, pruned_loss=0.1121, over 19934.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3265, pruned_loss=0.09257, over 4269880.42 frames. ], batch size: 702, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:22:28,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1119762.0, ans=0.125 2023-06-22 02:22:39,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1119762.0, ans=0.04949747468305833 2023-06-22 02:22:55,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1119822.0, ans=0.0 2023-06-22 02:23:06,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1119882.0, ans=0.05 2023-06-22 02:23:43,422 INFO [train.py:996] (0/4) Epoch 7, batch 3700, loss[loss=0.25, simple_loss=0.3258, pruned_loss=0.08714, over 21839.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3258, pruned_loss=0.09144, over 4275960.66 frames. ], batch size: 414, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:24:25,007 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.04 vs. limit=12.0 2023-06-22 02:24:30,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-22 02:24:52,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1120182.0, ans=0.0 2023-06-22 02:24:55,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1120182.0, ans=0.2 2023-06-22 02:25:01,664 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 3.250e+02 3.889e+02 4.859e+02 8.141e+02, threshold=7.777e+02, percent-clipped=0.0 2023-06-22 02:25:24,506 INFO [train.py:996] (0/4) Epoch 7, batch 3750, loss[loss=0.1448, simple_loss=0.1868, pruned_loss=0.05143, over 16872.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3237, pruned_loss=0.08988, over 4272690.68 frames. ], batch size: 60, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:25:25,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1120302.0, ans=0.05 2023-06-22 02:26:33,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-22 02:27:10,053 INFO [train.py:996] (0/4) Epoch 7, batch 3800, loss[loss=0.2526, simple_loss=0.3173, pruned_loss=0.09395, over 21375.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.321, pruned_loss=0.08825, over 4282664.89 frames. ], batch size: 176, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:27:11,363 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-22 02:27:15,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1120602.0, ans=0.0 2023-06-22 02:27:28,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1120602.0, ans=0.2 2023-06-22 02:27:50,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1120662.0, ans=0.0 2023-06-22 02:28:09,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1120782.0, ans=0.1 2023-06-22 02:28:11,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1120782.0, ans=0.125 2023-06-22 02:28:18,671 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.223e+02 3.848e+02 4.886e+02 9.152e+02, threshold=7.696e+02, percent-clipped=1.0 2023-06-22 02:28:37,631 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-22 02:28:46,258 INFO [train.py:996] (0/4) Epoch 7, batch 3850, loss[loss=0.2793, simple_loss=0.3133, pruned_loss=0.1226, over 21403.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3199, pruned_loss=0.08912, over 4281221.74 frames. ], batch size: 509, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:28:46,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1120902.0, ans=0.125 2023-06-22 02:29:04,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-22 02:29:23,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=6.0 2023-06-22 02:29:37,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1121022.0, ans=0.125 2023-06-22 02:30:25,865 INFO [train.py:996] (0/4) Epoch 7, batch 3900, loss[loss=0.3645, simple_loss=0.467, pruned_loss=0.131, over 21225.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3176, pruned_loss=0.08907, over 4276755.39 frames. ], batch size: 549, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:30:40,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1121202.0, ans=0.125 2023-06-22 02:31:36,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1121382.0, ans=0.04949747468305833 2023-06-22 02:31:38,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.017e+02 3.574e+02 4.086e+02 6.704e+02, threshold=7.148e+02, percent-clipped=0.0 2023-06-22 02:31:47,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1121442.0, ans=0.2 2023-06-22 02:31:59,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1121442.0, ans=0.2 2023-06-22 02:32:06,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1121442.0, ans=0.125 2023-06-22 02:32:12,153 INFO [train.py:996] (0/4) Epoch 7, batch 3950, loss[loss=0.2282, simple_loss=0.2992, pruned_loss=0.07864, over 20149.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3181, pruned_loss=0.08813, over 4267732.43 frames. ], batch size: 703, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:33:50,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1121742.0, ans=0.0 2023-06-22 02:33:53,242 INFO [train.py:996] (0/4) Epoch 7, batch 4000, loss[loss=0.2226, simple_loss=0.28, pruned_loss=0.08259, over 21522.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3119, pruned_loss=0.08487, over 4265585.72 frames. ], batch size: 195, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:33:56,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1121802.0, ans=0.0 2023-06-22 02:34:36,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.71 vs. limit=15.0 2023-06-22 02:35:00,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.938e+02 3.344e+02 4.048e+02 7.852e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-22 02:35:34,241 INFO [train.py:996] (0/4) Epoch 7, batch 4050, loss[loss=0.2225, simple_loss=0.3089, pruned_loss=0.0681, over 21719.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3124, pruned_loss=0.08433, over 4267067.86 frames. ], batch size: 389, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:35:45,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1122102.0, ans=0.2 2023-06-22 02:36:12,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1122162.0, ans=0.2 2023-06-22 02:36:21,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1122222.0, ans=0.2 2023-06-22 02:36:22,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.42 vs. limit=10.0 2023-06-22 02:36:26,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1122222.0, ans=0.0 2023-06-22 02:37:04,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1122342.0, ans=0.0 2023-06-22 02:37:13,279 INFO [train.py:996] (0/4) Epoch 7, batch 4100, loss[loss=0.2236, simple_loss=0.2952, pruned_loss=0.07598, over 21583.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3132, pruned_loss=0.08424, over 4273738.30 frames. ], batch size: 548, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:37:13,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1122402.0, ans=0.2 2023-06-22 02:37:39,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1122462.0, ans=0.0 2023-06-22 02:38:26,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.663e+02 2.992e+02 3.569e+02 4.943e+02, threshold=5.983e+02, percent-clipped=0.0 2023-06-22 02:38:53,890 INFO [train.py:996] (0/4) Epoch 7, batch 4150, loss[loss=0.2094, simple_loss=0.2975, pruned_loss=0.06066, over 21649.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3128, pruned_loss=0.08155, over 4271530.26 frames. ], batch size: 263, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:39:01,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1122702.0, ans=0.125 2023-06-22 02:39:13,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1122702.0, ans=0.125 2023-06-22 02:39:52,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.75 vs. limit=15.0 2023-06-22 02:40:17,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2023-06-22 02:40:45,133 INFO [train.py:996] (0/4) Epoch 7, batch 4200, loss[loss=0.2565, simple_loss=0.334, pruned_loss=0.0895, over 21304.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3135, pruned_loss=0.08221, over 4273105.51 frames. ], batch size: 548, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:41:07,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1123062.0, ans=0.1 2023-06-22 02:41:23,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1123122.0, ans=0.025 2023-06-22 02:41:56,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 3.116e+02 3.775e+02 4.930e+02 8.993e+02, threshold=7.550e+02, percent-clipped=12.0 2023-06-22 02:42:27,369 INFO [train.py:996] (0/4) Epoch 7, batch 4250, loss[loss=0.2528, simple_loss=0.374, pruned_loss=0.06583, over 19762.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3212, pruned_loss=0.08459, over 4270217.40 frames. ], batch size: 702, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:42:35,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1123302.0, ans=0.1 2023-06-22 02:44:04,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1123542.0, ans=0.125 2023-06-22 02:44:04,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1123542.0, ans=0.07 2023-06-22 02:44:10,815 INFO [train.py:996] (0/4) Epoch 7, batch 4300, loss[loss=0.2539, simple_loss=0.307, pruned_loss=0.1004, over 20222.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3258, pruned_loss=0.08637, over 4270339.02 frames. ], batch size: 702, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:44:36,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1123662.0, ans=0.0 2023-06-22 02:44:46,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1123662.0, ans=0.125 2023-06-22 02:45:01,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1123722.0, ans=0.0 2023-06-22 02:45:27,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1123782.0, ans=0.125 2023-06-22 02:45:31,296 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.519e+02 4.292e+02 5.383e+02 8.752e+02, threshold=8.584e+02, percent-clipped=3.0 2023-06-22 02:45:46,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1123842.0, ans=0.2 2023-06-22 02:45:52,096 INFO [train.py:996] (0/4) Epoch 7, batch 4350, loss[loss=0.2154, simple_loss=0.2792, pruned_loss=0.07582, over 21898.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3248, pruned_loss=0.08536, over 4268161.68 frames. ], batch size: 107, lr: 4.43e-03, grad_scale: 8.0 2023-06-22 02:46:36,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1124022.0, ans=0.1 2023-06-22 02:46:37,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-22 02:46:49,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1124022.0, ans=0.1 2023-06-22 02:47:03,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1124082.0, ans=0.0 2023-06-22 02:47:14,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1124082.0, ans=0.125 2023-06-22 02:47:17,018 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-22 02:47:39,368 INFO [train.py:996] (0/4) Epoch 7, batch 4400, loss[loss=0.2344, simple_loss=0.3303, pruned_loss=0.06926, over 21782.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3211, pruned_loss=0.08462, over 4265098.52 frames. ], batch size: 282, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:47:56,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1124202.0, ans=0.125 2023-06-22 02:48:51,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1124382.0, ans=0.125 2023-06-22 02:48:57,465 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.429e+02 3.105e+02 3.566e+02 4.210e+02 6.733e+02, threshold=7.132e+02, percent-clipped=0.0 2023-06-22 02:49:05,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1124442.0, ans=0.0 2023-06-22 02:49:17,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1124442.0, ans=0.125 2023-06-22 02:49:21,685 INFO [train.py:996] (0/4) Epoch 7, batch 4450, loss[loss=0.2726, simple_loss=0.3353, pruned_loss=0.105, over 21784.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3266, pruned_loss=0.08641, over 4266147.18 frames. ], batch size: 124, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:51:06,309 INFO [train.py:996] (0/4) Epoch 7, batch 4500, loss[loss=0.2311, simple_loss=0.3223, pruned_loss=0.07, over 21615.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3272, pruned_loss=0.08729, over 4275235.02 frames. ], batch size: 230, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:51:46,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1124922.0, ans=0.125 2023-06-22 02:52:14,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1124982.0, ans=0.0 2023-06-22 02:52:22,687 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.121e+02 3.546e+02 4.306e+02 8.092e+02, threshold=7.092e+02, percent-clipped=3.0 2023-06-22 02:52:47,552 INFO [train.py:996] (0/4) Epoch 7, batch 4550, loss[loss=0.218, simple_loss=0.2773, pruned_loss=0.07938, over 20714.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3291, pruned_loss=0.08732, over 4278262.38 frames. ], batch size: 607, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:53:40,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1125222.0, ans=0.04949747468305833 2023-06-22 02:53:43,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1125222.0, ans=10.0 2023-06-22 02:54:33,355 INFO [train.py:996] (0/4) Epoch 7, batch 4600, loss[loss=0.241, simple_loss=0.3155, pruned_loss=0.08323, over 21911.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3306, pruned_loss=0.08906, over 4279953.77 frames. ], batch size: 107, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:55:43,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.156e+02 3.559e+02 4.324e+02 6.713e+02, threshold=7.117e+02, percent-clipped=0.0 2023-06-22 02:55:58,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1125642.0, ans=0.125 2023-06-22 02:56:13,825 INFO [train.py:996] (0/4) Epoch 7, batch 4650, loss[loss=0.2191, simple_loss=0.2908, pruned_loss=0.0737, over 21883.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3266, pruned_loss=0.08819, over 4289267.19 frames. ], batch size: 118, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:56:16,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-22 02:56:31,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1125702.0, ans=0.0 2023-06-22 02:56:38,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1125762.0, ans=0.2 2023-06-22 02:56:59,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1125822.0, ans=0.125 2023-06-22 02:57:15,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1125882.0, ans=0.2 2023-06-22 02:57:50,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1125942.0, ans=0.0 2023-06-22 02:57:58,274 INFO [train.py:996] (0/4) Epoch 7, batch 4700, loss[loss=0.2084, simple_loss=0.2707, pruned_loss=0.07299, over 21595.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.317, pruned_loss=0.08537, over 4280376.17 frames. ], batch size: 263, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:58:02,627 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-22 02:58:02,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.81 vs. limit=15.0 2023-06-22 02:58:17,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1126062.0, ans=0.07 2023-06-22 02:58:19,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1126062.0, ans=0.125 2023-06-22 02:58:20,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1126062.0, ans=0.125 2023-06-22 02:58:40,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1126122.0, ans=0.125 2023-06-22 02:59:02,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-22 02:59:03,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.887e+02 3.239e+02 4.275e+02 6.571e+02, threshold=6.478e+02, percent-clipped=0.0 2023-06-22 02:59:31,474 INFO [train.py:996] (0/4) Epoch 7, batch 4750, loss[loss=0.2179, simple_loss=0.2769, pruned_loss=0.0794, over 21765.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3121, pruned_loss=0.08536, over 4269305.23 frames. ], batch size: 247, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:59:45,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1126302.0, ans=0.1 2023-06-22 03:00:52,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.96 vs. limit=10.0 2023-06-22 03:00:52,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1126542.0, ans=0.125 2023-06-22 03:01:16,471 INFO [train.py:996] (0/4) Epoch 7, batch 4800, loss[loss=0.2447, simple_loss=0.3128, pruned_loss=0.08833, over 21732.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3136, pruned_loss=0.08586, over 4271957.05 frames. ], batch size: 112, lr: 4.43e-03, grad_scale: 32.0 2023-06-22 03:01:46,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1126662.0, ans=0.1 2023-06-22 03:02:22,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-22 03:02:23,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.114e+02 3.560e+02 4.146e+02 5.866e+02, threshold=7.121e+02, percent-clipped=0.0 2023-06-22 03:02:56,171 INFO [train.py:996] (0/4) Epoch 7, batch 4850, loss[loss=0.2361, simple_loss=0.3053, pruned_loss=0.0834, over 21767.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3111, pruned_loss=0.08568, over 4268854.68 frames. ], batch size: 282, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:03:34,567 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-22 03:03:35,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1127022.0, ans=0.125 2023-06-22 03:03:55,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-22 03:04:30,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1127142.0, ans=0.1 2023-06-22 03:04:36,856 INFO [train.py:996] (0/4) Epoch 7, batch 4900, loss[loss=0.2543, simple_loss=0.3291, pruned_loss=0.08974, over 21345.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3158, pruned_loss=0.08664, over 4269715.89 frames. ], batch size: 549, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:05:13,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1127322.0, ans=0.05 2023-06-22 03:05:20,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-22 03:05:55,058 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.055e+02 3.401e+02 3.973e+02 6.495e+02, threshold=6.802e+02, percent-clipped=0.0 2023-06-22 03:06:13,338 INFO [train.py:996] (0/4) Epoch 7, batch 4950, loss[loss=0.2253, simple_loss=0.3241, pruned_loss=0.06321, over 21713.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3173, pruned_loss=0.0843, over 4265258.33 frames. ], batch size: 298, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:06:17,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-22 03:07:04,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1127682.0, ans=0.125 2023-06-22 03:07:27,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0 2023-06-22 03:07:27,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1127682.0, ans=0.0 2023-06-22 03:07:52,453 INFO [train.py:996] (0/4) Epoch 7, batch 5000, loss[loss=0.2145, simple_loss=0.2822, pruned_loss=0.07344, over 20124.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3178, pruned_loss=0.08176, over 4274488.92 frames. ], batch size: 702, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:08:08,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1127862.0, ans=0.125 2023-06-22 03:08:28,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1127922.0, ans=0.2 2023-06-22 03:08:42,011 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-188000.pt 2023-06-22 03:09:08,698 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.797e+02 3.102e+02 3.690e+02 6.361e+02, threshold=6.203e+02, percent-clipped=0.0 2023-06-22 03:09:30,779 INFO [train.py:996] (0/4) Epoch 7, batch 5050, loss[loss=0.2349, simple_loss=0.3537, pruned_loss=0.05811, over 20700.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3181, pruned_loss=0.08354, over 4283443.67 frames. ], batch size: 607, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:09:49,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=15.0 2023-06-22 03:10:33,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1128282.0, ans=0.09899494936611666 2023-06-22 03:10:46,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-22 03:11:05,897 INFO [train.py:996] (0/4) Epoch 7, batch 5100, loss[loss=0.2172, simple_loss=0.2826, pruned_loss=0.07597, over 21454.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3172, pruned_loss=0.08399, over 4283386.25 frames. ], batch size: 194, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:12:15,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-22 03:12:17,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.054e+02 3.634e+02 4.109e+02 8.042e+02, threshold=7.267e+02, percent-clipped=4.0 2023-06-22 03:12:40,007 INFO [train.py:996] (0/4) Epoch 7, batch 5150, loss[loss=0.2828, simple_loss=0.3382, pruned_loss=0.1137, over 21842.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3152, pruned_loss=0.08508, over 4289445.26 frames. ], batch size: 351, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:12:40,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1128702.0, ans=0.2 2023-06-22 03:12:53,977 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.62 vs. limit=15.0 2023-06-22 03:13:16,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-22 03:14:05,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.92 vs. limit=15.0 2023-06-22 03:14:15,500 INFO [train.py:996] (0/4) Epoch 7, batch 5200, loss[loss=0.2479, simple_loss=0.3383, pruned_loss=0.07873, over 21727.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3182, pruned_loss=0.08492, over 4289360.59 frames. ], batch size: 247, lr: 4.42e-03, grad_scale: 32.0 2023-06-22 03:14:34,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1129062.0, ans=0.125 2023-06-22 03:14:43,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1129062.0, ans=0.1 2023-06-22 03:14:45,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.32 vs. limit=12.0 2023-06-22 03:15:04,426 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:15:17,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-06-22 03:15:36,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-22 03:15:38,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.246e+02 3.920e+02 4.845e+02 8.696e+02, threshold=7.839e+02, percent-clipped=4.0 2023-06-22 03:15:45,269 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:15:54,280 INFO [train.py:996] (0/4) Epoch 7, batch 5250, loss[loss=0.1839, simple_loss=0.2517, pruned_loss=0.05806, over 21895.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3211, pruned_loss=0.08318, over 4286157.13 frames. ], batch size: 98, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:15:55,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-22 03:15:58,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-22 03:17:15,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1129542.0, ans=0.0 2023-06-22 03:17:15,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1129542.0, ans=0.0 2023-06-22 03:17:18,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1129542.0, ans=0.125 2023-06-22 03:17:28,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1129542.0, ans=0.125 2023-06-22 03:17:32,622 INFO [train.py:996] (0/4) Epoch 7, batch 5300, loss[loss=0.2504, simple_loss=0.3172, pruned_loss=0.09175, over 21881.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.32, pruned_loss=0.08341, over 4287874.28 frames. ], batch size: 371, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:17:36,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-22 03:17:38,208 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=22.5 2023-06-22 03:17:48,156 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2023-06-22 03:17:54,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=12.0 2023-06-22 03:18:49,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-22 03:18:54,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.014e+02 3.652e+02 4.152e+02 6.819e+02, threshold=7.305e+02, percent-clipped=0.0 2023-06-22 03:19:03,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1129842.0, ans=0.0 2023-06-22 03:19:09,873 INFO [train.py:996] (0/4) Epoch 7, batch 5350, loss[loss=0.2578, simple_loss=0.3217, pruned_loss=0.097, over 21743.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3193, pruned_loss=0.08518, over 4293392.18 frames. ], batch size: 389, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:19:39,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-22 03:19:48,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1130022.0, ans=0.125 2023-06-22 03:20:50,434 INFO [train.py:996] (0/4) Epoch 7, batch 5400, loss[loss=0.2576, simple_loss=0.3164, pruned_loss=0.09943, over 21354.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3166, pruned_loss=0.08559, over 4296016.53 frames. ], batch size: 144, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:20:54,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1130202.0, ans=0.0 2023-06-22 03:21:07,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1130262.0, ans=0.1 2023-06-22 03:21:07,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1130262.0, ans=0.2 2023-06-22 03:21:34,158 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:22:14,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.826e+02 3.234e+02 3.815e+02 6.268e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-22 03:22:16,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1130442.0, ans=0.125 2023-06-22 03:22:20,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1130442.0, ans=0.0 2023-06-22 03:22:27,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1130442.0, ans=0.0 2023-06-22 03:22:30,513 INFO [train.py:996] (0/4) Epoch 7, batch 5450, loss[loss=0.2144, simple_loss=0.2986, pruned_loss=0.06516, over 21429.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3169, pruned_loss=0.08391, over 4297498.25 frames. ], batch size: 194, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:23:32,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1130622.0, ans=0.125 2023-06-22 03:23:50,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1130682.0, ans=10.0 2023-06-22 03:24:12,388 INFO [train.py:996] (0/4) Epoch 7, batch 5500, loss[loss=0.256, simple_loss=0.3595, pruned_loss=0.07631, over 21210.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3211, pruned_loss=0.08099, over 4289158.84 frames. ], batch size: 548, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:24:13,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-22 03:24:30,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1130802.0, ans=0.125 2023-06-22 03:24:36,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.84 vs. limit=15.0 2023-06-22 03:24:42,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1130862.0, ans=0.2 2023-06-22 03:24:56,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1130862.0, ans=0.125 2023-06-22 03:25:00,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1130922.0, ans=0.1 2023-06-22 03:25:31,998 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.055e+02 3.640e+02 4.334e+02 7.311e+02, threshold=7.280e+02, percent-clipped=2.0 2023-06-22 03:25:57,932 INFO [train.py:996] (0/4) Epoch 7, batch 5550, loss[loss=0.1931, simple_loss=0.2892, pruned_loss=0.04854, over 21684.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3214, pruned_loss=0.07897, over 4287745.64 frames. ], batch size: 298, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:26:24,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1131162.0, ans=0.0 2023-06-22 03:26:41,359 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:27:09,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1131282.0, ans=0.2 2023-06-22 03:27:30,188 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:27:43,821 INFO [train.py:996] (0/4) Epoch 7, batch 5600, loss[loss=0.352, simple_loss=0.437, pruned_loss=0.1335, over 21525.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3175, pruned_loss=0.0758, over 4292103.06 frames. ], batch size: 471, lr: 4.42e-03, grad_scale: 32.0 2023-06-22 03:28:06,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1131462.0, ans=0.1 2023-06-22 03:29:03,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.861e+02 3.641e+02 4.597e+02 1.091e+03, threshold=7.283e+02, percent-clipped=6.0 2023-06-22 03:29:13,852 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=12.0 2023-06-22 03:29:18,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1131642.0, ans=0.1 2023-06-22 03:29:22,542 INFO [train.py:996] (0/4) Epoch 7, batch 5650, loss[loss=0.2694, simple_loss=0.3407, pruned_loss=0.09906, over 21728.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3211, pruned_loss=0.07822, over 4285370.76 frames. ], batch size: 389, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:29:32,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-22 03:29:59,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1131822.0, ans=0.0 2023-06-22 03:30:12,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1131822.0, ans=0.125 2023-06-22 03:30:25,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1131882.0, ans=0.125 2023-06-22 03:30:34,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.76 vs. limit=8.0 2023-06-22 03:30:45,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1131942.0, ans=0.0 2023-06-22 03:31:07,576 INFO [train.py:996] (0/4) Epoch 7, batch 5700, loss[loss=0.262, simple_loss=0.3196, pruned_loss=0.1022, over 21336.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3203, pruned_loss=0.08061, over 4292076.47 frames. ], batch size: 159, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:32:19,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-22 03:32:33,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.008e+02 3.484e+02 4.188e+02 7.295e+02, threshold=6.968e+02, percent-clipped=1.0 2023-06-22 03:32:48,522 INFO [train.py:996] (0/4) Epoch 7, batch 5750, loss[loss=0.2195, simple_loss=0.3085, pruned_loss=0.06527, over 21776.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3155, pruned_loss=0.07785, over 4285776.28 frames. ], batch size: 333, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:33:05,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1132362.0, ans=0.125 2023-06-22 03:33:37,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-22 03:34:03,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1132482.0, ans=0.125 2023-06-22 03:34:13,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-22 03:34:28,159 INFO [train.py:996] (0/4) Epoch 7, batch 5800, loss[loss=0.228, simple_loss=0.321, pruned_loss=0.06744, over 21754.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3152, pruned_loss=0.07649, over 4279541.89 frames. ], batch size: 282, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:35:51,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-22 03:35:55,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.844e+02 3.714e+02 4.784e+02 7.655e+02, threshold=7.428e+02, percent-clipped=1.0 2023-06-22 03:36:10,055 INFO [train.py:996] (0/4) Epoch 7, batch 5850, loss[loss=0.1957, simple_loss=0.3073, pruned_loss=0.04208, over 21643.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3116, pruned_loss=0.07206, over 4272665.14 frames. ], batch size: 414, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:36:14,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-22 03:36:40,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.29 vs. limit=6.0 2023-06-22 03:37:13,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1133022.0, ans=0.2 2023-06-22 03:37:49,873 INFO [train.py:996] (0/4) Epoch 7, batch 5900, loss[loss=0.206, simple_loss=0.2876, pruned_loss=0.06222, over 21643.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3044, pruned_loss=0.06694, over 4271289.70 frames. ], batch size: 263, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:37:59,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1133202.0, ans=0.0 2023-06-22 03:38:16,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1133262.0, ans=0.0 2023-06-22 03:39:08,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.538e+02 2.951e+02 3.905e+02 7.879e+02, threshold=5.902e+02, percent-clipped=2.0 2023-06-22 03:39:23,031 INFO [train.py:996] (0/4) Epoch 7, batch 5950, loss[loss=0.2921, simple_loss=0.3922, pruned_loss=0.09597, over 21193.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3045, pruned_loss=0.07066, over 4268237.97 frames. ], batch size: 548, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:40:34,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1133682.0, ans=0.2 2023-06-22 03:41:00,558 INFO [train.py:996] (0/4) Epoch 7, batch 6000, loss[loss=0.2144, simple_loss=0.2766, pruned_loss=0.07607, over 21807.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3014, pruned_loss=0.0739, over 4278249.57 frames. ], batch size: 107, lr: 4.41e-03, grad_scale: 32.0 2023-06-22 03:41:00,559 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 03:41:21,111 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2587, simple_loss=0.3532, pruned_loss=0.08209, over 1796401.00 frames. 2023-06-22 03:41:21,111 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-22 03:41:34,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1133802.0, ans=0.125 2023-06-22 03:41:46,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1133862.0, ans=0.09899494936611666 2023-06-22 03:41:59,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1133862.0, ans=0.2 2023-06-22 03:42:05,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1133922.0, ans=0.05 2023-06-22 03:42:13,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1133922.0, ans=0.125 2023-06-22 03:42:43,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.441e+02 4.226e+02 5.483e+02 1.064e+03, threshold=8.451e+02, percent-clipped=15.0 2023-06-22 03:43:01,832 INFO [train.py:996] (0/4) Epoch 7, batch 6050, loss[loss=0.2199, simple_loss=0.308, pruned_loss=0.06587, over 21383.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2982, pruned_loss=0.07435, over 4281922.62 frames. ], batch size: 471, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:43:21,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1134102.0, ans=0.125 2023-06-22 03:43:54,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1134222.0, ans=0.1 2023-06-22 03:44:39,795 INFO [train.py:996] (0/4) Epoch 7, batch 6100, loss[loss=0.2639, simple_loss=0.3209, pruned_loss=0.1035, over 21349.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2959, pruned_loss=0.07296, over 4284738.95 frames. ], batch size: 159, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:45:08,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1134462.0, ans=0.2 2023-06-22 03:45:16,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1134462.0, ans=0.125 2023-06-22 03:45:18,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1134462.0, ans=0.1 2023-06-22 03:45:19,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1134522.0, ans=0.07 2023-06-22 03:45:25,118 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:45:32,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1134522.0, ans=0.125 2023-06-22 03:45:32,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1134522.0, ans=0.2 2023-06-22 03:45:36,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1134582.0, ans=0.125 2023-06-22 03:46:00,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1134642.0, ans=0.125 2023-06-22 03:46:01,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.844e+02 3.267e+02 3.769e+02 7.598e+02, threshold=6.534e+02, percent-clipped=0.0 2023-06-22 03:46:10,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1134642.0, ans=0.1 2023-06-22 03:46:23,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1134702.0, ans=0.0 2023-06-22 03:46:24,543 INFO [train.py:996] (0/4) Epoch 7, batch 6150, loss[loss=0.2341, simple_loss=0.3041, pruned_loss=0.08202, over 21608.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2992, pruned_loss=0.07584, over 4287334.83 frames. ], batch size: 230, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:46:48,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1134762.0, ans=0.1 2023-06-22 03:46:51,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1134762.0, ans=0.125 2023-06-22 03:47:22,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-22 03:48:02,665 INFO [train.py:996] (0/4) Epoch 7, batch 6200, loss[loss=0.2436, simple_loss=0.3304, pruned_loss=0.07841, over 21860.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.304, pruned_loss=0.0772, over 4283643.50 frames. ], batch size: 371, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:48:44,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1135122.0, ans=0.0 2023-06-22 03:48:45,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-22 03:49:27,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=22.5 2023-06-22 03:49:28,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 2.933e+02 3.461e+02 4.493e+02 7.617e+02, threshold=6.923e+02, percent-clipped=2.0 2023-06-22 03:49:41,013 INFO [train.py:996] (0/4) Epoch 7, batch 6250, loss[loss=0.2168, simple_loss=0.319, pruned_loss=0.05733, over 21835.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3106, pruned_loss=0.07782, over 4277978.22 frames. ], batch size: 316, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:50:31,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1135422.0, ans=0.0 2023-06-22 03:51:07,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135542.0, ans=0.1 2023-06-22 03:51:25,837 INFO [train.py:996] (0/4) Epoch 7, batch 6300, loss[loss=0.2412, simple_loss=0.3074, pruned_loss=0.08752, over 21645.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.314, pruned_loss=0.07668, over 4286191.80 frames. ], batch size: 230, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:51:28,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135602.0, ans=0.1 2023-06-22 03:52:02,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1135722.0, ans=0.0 2023-06-22 03:52:13,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1135722.0, ans=0.0 2023-06-22 03:52:17,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1135782.0, ans=0.2 2023-06-22 03:52:35,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.47 vs. limit=22.5 2023-06-22 03:52:52,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.056e+02 3.560e+02 4.261e+02 7.497e+02, threshold=7.120e+02, percent-clipped=1.0 2023-06-22 03:52:52,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1135842.0, ans=0.125 2023-06-22 03:53:05,195 INFO [train.py:996] (0/4) Epoch 7, batch 6350, loss[loss=0.3046, simple_loss=0.3628, pruned_loss=0.1232, over 21801.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3165, pruned_loss=0.08028, over 4286284.47 frames. ], batch size: 441, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:53:47,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136022.0, ans=0.1 2023-06-22 03:53:47,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1136022.0, ans=0.95 2023-06-22 03:54:23,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1136082.0, ans=0.07 2023-06-22 03:54:29,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-22 03:54:37,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1136142.0, ans=0.125 2023-06-22 03:54:45,902 INFO [train.py:996] (0/4) Epoch 7, batch 6400, loss[loss=0.2852, simple_loss=0.3486, pruned_loss=0.1109, over 21705.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3224, pruned_loss=0.08509, over 4282048.79 frames. ], batch size: 351, lr: 4.41e-03, grad_scale: 32.0 2023-06-22 03:55:27,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=22.5 2023-06-22 03:56:10,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.525e+02 3.095e+02 3.612e+02 4.103e+02 7.644e+02, threshold=7.224e+02, percent-clipped=1.0 2023-06-22 03:56:21,502 INFO [train.py:996] (0/4) Epoch 7, batch 6450, loss[loss=0.3123, simple_loss=0.3679, pruned_loss=0.1283, over 21391.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3264, pruned_loss=0.08543, over 4281410.16 frames. ], batch size: 507, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:56:28,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136502.0, ans=0.1 2023-06-22 03:56:33,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1136502.0, ans=0.04949747468305833 2023-06-22 03:56:40,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1136562.0, ans=0.2 2023-06-22 03:57:35,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1136682.0, ans=0.2 2023-06-22 03:57:46,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-22 03:57:51,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1136742.0, ans=0.0 2023-06-22 03:57:52,016 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=15.0 2023-06-22 03:58:00,833 INFO [train.py:996] (0/4) Epoch 7, batch 6500, loss[loss=0.2351, simple_loss=0.3174, pruned_loss=0.07634, over 21577.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3199, pruned_loss=0.08385, over 4277008.83 frames. ], batch size: 441, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:58:09,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1136802.0, ans=0.125 2023-06-22 03:58:09,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1136802.0, ans=0.125 2023-06-22 03:58:18,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136862.0, ans=0.1 2023-06-22 03:58:28,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-22 03:58:53,764 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:58:56,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-22 03:58:56,944 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:59:29,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 3.034e+02 3.497e+02 4.367e+02 8.159e+02, threshold=6.993e+02, percent-clipped=3.0 2023-06-22 03:59:40,153 INFO [train.py:996] (0/4) Epoch 7, batch 6550, loss[loss=0.2866, simple_loss=0.3459, pruned_loss=0.1136, over 21857.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3181, pruned_loss=0.08324, over 4270122.40 frames. ], batch size: 371, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:00:35,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1137222.0, ans=0.05 2023-06-22 04:00:39,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1137222.0, ans=0.1 2023-06-22 04:01:19,398 INFO [train.py:996] (0/4) Epoch 7, batch 6600, loss[loss=0.1897, simple_loss=0.2508, pruned_loss=0.06434, over 21255.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.312, pruned_loss=0.08216, over 4273789.27 frames. ], batch size: 176, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:01:26,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1137402.0, ans=0.0 2023-06-22 04:02:00,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1137462.0, ans=0.125 2023-06-22 04:02:12,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1137522.0, ans=0.0 2023-06-22 04:02:39,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1137582.0, ans=0.125 2023-06-22 04:02:49,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.768e+02 3.180e+02 3.748e+02 5.312e+02, threshold=6.360e+02, percent-clipped=0.0 2023-06-22 04:02:59,302 INFO [train.py:996] (0/4) Epoch 7, batch 6650, loss[loss=0.2325, simple_loss=0.2991, pruned_loss=0.08299, over 21743.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3037, pruned_loss=0.07977, over 4275318.58 frames. ], batch size: 352, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:03:32,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1137762.0, ans=0.1 2023-06-22 04:03:48,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1137762.0, ans=0.125 2023-06-22 04:04:31,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1137942.0, ans=0.125 2023-06-22 04:04:38,298 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:04:39,345 INFO [train.py:996] (0/4) Epoch 7, batch 6700, loss[loss=0.22, simple_loss=0.2829, pruned_loss=0.0786, over 21742.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2986, pruned_loss=0.07973, over 4268398.78 frames. ], batch size: 124, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:05:27,554 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:06:08,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.940e+02 3.406e+02 4.084e+02 6.605e+02, threshold=6.813e+02, percent-clipped=1.0 2023-06-22 04:06:10,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1138242.0, ans=0.125 2023-06-22 04:06:17,960 INFO [train.py:996] (0/4) Epoch 7, batch 6750, loss[loss=0.2536, simple_loss=0.32, pruned_loss=0.09353, over 21834.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2963, pruned_loss=0.07989, over 4261436.61 frames. ], batch size: 371, lr: 4.40e-03, grad_scale: 8.0 2023-06-22 04:06:24,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-06-22 04:07:21,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1138422.0, ans=0.125 2023-06-22 04:07:25,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-22 04:07:45,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1138542.0, ans=0.0 2023-06-22 04:07:54,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1138602.0, ans=0.1 2023-06-22 04:07:55,323 INFO [train.py:996] (0/4) Epoch 7, batch 6800, loss[loss=0.2131, simple_loss=0.2819, pruned_loss=0.07215, over 21803.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2984, pruned_loss=0.08196, over 4270331.95 frames. ], batch size: 247, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:08:32,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1138662.0, ans=0.125 2023-06-22 04:09:01,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1138782.0, ans=0.125 2023-06-22 04:09:24,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 3.025e+02 3.541e+02 4.379e+02 6.653e+02, threshold=7.081e+02, percent-clipped=0.0 2023-06-22 04:09:27,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1138842.0, ans=0.1 2023-06-22 04:09:33,726 INFO [train.py:996] (0/4) Epoch 7, batch 6850, loss[loss=0.2249, simple_loss=0.2967, pruned_loss=0.07653, over 21858.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2985, pruned_loss=0.08286, over 4272691.34 frames. ], batch size: 316, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:10:26,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1139022.0, ans=0.0 2023-06-22 04:10:47,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1139082.0, ans=0.125 2023-06-22 04:10:58,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1139142.0, ans=0.125 2023-06-22 04:11:04,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1139142.0, ans=0.125 2023-06-22 04:11:14,111 INFO [train.py:996] (0/4) Epoch 7, batch 6900, loss[loss=0.2029, simple_loss=0.2863, pruned_loss=0.05979, over 21311.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3003, pruned_loss=0.08269, over 4270184.75 frames. ], batch size: 176, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:12:41,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.939e+02 3.532e+02 4.220e+02 8.926e+02, threshold=7.064e+02, percent-clipped=5.0 2023-06-22 04:12:55,508 INFO [train.py:996] (0/4) Epoch 7, batch 6950, loss[loss=0.2692, simple_loss=0.3465, pruned_loss=0.09595, over 21577.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3025, pruned_loss=0.07965, over 4270147.52 frames. ], batch size: 414, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:13:35,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=12.0 2023-06-22 04:13:41,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-22 04:13:49,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1139622.0, ans=0.125 2023-06-22 04:14:32,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1139742.0, ans=0.1 2023-06-22 04:14:35,507 INFO [train.py:996] (0/4) Epoch 7, batch 7000, loss[loss=0.2038, simple_loss=0.2675, pruned_loss=0.06998, over 21686.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3068, pruned_loss=0.08279, over 4260739.71 frames. ], batch size: 282, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:14:54,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1139802.0, ans=0.0 2023-06-22 04:15:15,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1139862.0, ans=0.2 2023-06-22 04:16:01,398 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.353e+02 3.073e+02 3.537e+02 4.508e+02 8.250e+02, threshold=7.073e+02, percent-clipped=4.0 2023-06-22 04:16:08,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1140042.0, ans=12.0 2023-06-22 04:16:10,905 INFO [train.py:996] (0/4) Epoch 7, batch 7050, loss[loss=0.2205, simple_loss=0.28, pruned_loss=0.08048, over 21559.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3046, pruned_loss=0.08195, over 4249337.02 frames. ], batch size: 441, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:16:25,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1140102.0, ans=0.0 2023-06-22 04:17:32,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1140342.0, ans=0.0 2023-06-22 04:17:50,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1140342.0, ans=0.0 2023-06-22 04:17:52,895 INFO [train.py:996] (0/4) Epoch 7, batch 7100, loss[loss=0.2441, simple_loss=0.3109, pruned_loss=0.08867, over 21355.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3097, pruned_loss=0.08381, over 4255162.23 frames. ], batch size: 131, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:18:32,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1140522.0, ans=0.2 2023-06-22 04:19:25,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.884e+02 3.183e+02 3.903e+02 6.649e+02, threshold=6.367e+02, percent-clipped=0.0 2023-06-22 04:19:30,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1140642.0, ans=0.2 2023-06-22 04:19:34,824 INFO [train.py:996] (0/4) Epoch 7, batch 7150, loss[loss=0.2609, simple_loss=0.3282, pruned_loss=0.09677, over 21609.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3054, pruned_loss=0.08068, over 4250857.96 frames. ], batch size: 263, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:19:49,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1140762.0, ans=0.125 2023-06-22 04:19:58,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.60 vs. limit=15.0 2023-06-22 04:20:34,361 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-06-22 04:21:13,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1141002.0, ans=0.0 2023-06-22 04:21:14,911 INFO [train.py:996] (0/4) Epoch 7, batch 7200, loss[loss=0.2535, simple_loss=0.3083, pruned_loss=0.09939, over 21241.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3085, pruned_loss=0.08346, over 4251416.82 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:21:24,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1141002.0, ans=0.05 2023-06-22 04:21:43,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-22 04:21:47,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1141122.0, ans=0.02 2023-06-22 04:21:48,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1141122.0, ans=0.125 2023-06-22 04:21:55,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1141122.0, ans=0.125 2023-06-22 04:22:19,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1141182.0, ans=0.0 2023-06-22 04:22:43,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1141242.0, ans=0.125 2023-06-22 04:22:44,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 2.920e+02 3.463e+02 4.119e+02 7.524e+02, threshold=6.925e+02, percent-clipped=3.0 2023-06-22 04:22:51,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-22 04:22:53,911 INFO [train.py:996] (0/4) Epoch 7, batch 7250, loss[loss=0.2273, simple_loss=0.2871, pruned_loss=0.08377, over 21182.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.305, pruned_loss=0.08318, over 4250834.90 frames. ], batch size: 176, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:23:42,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1141422.0, ans=0.5 2023-06-22 04:23:52,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-22 04:24:03,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1141482.0, ans=0.2 2023-06-22 04:24:12,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1141482.0, ans=0.125 2023-06-22 04:24:17,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1141542.0, ans=0.1 2023-06-22 04:24:32,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1141602.0, ans=0.2 2023-06-22 04:24:33,110 INFO [train.py:996] (0/4) Epoch 7, batch 7300, loss[loss=0.2157, simple_loss=0.2809, pruned_loss=0.07525, over 21814.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.299, pruned_loss=0.08189, over 4258208.81 frames. ], batch size: 352, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:24:35,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1141602.0, ans=0.125 2023-06-22 04:24:41,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1141602.0, ans=0.2 2023-06-22 04:24:54,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1141662.0, ans=0.0 2023-06-22 04:24:55,042 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-22 04:25:05,100 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:26:04,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.850e+02 3.371e+02 4.063e+02 7.936e+02, threshold=6.743e+02, percent-clipped=2.0 2023-06-22 04:26:13,069 INFO [train.py:996] (0/4) Epoch 7, batch 7350, loss[loss=0.3134, simple_loss=0.3687, pruned_loss=0.129, over 21811.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.2985, pruned_loss=0.0825, over 4263549.63 frames. ], batch size: 118, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:26:51,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1142022.0, ans=0.125 2023-06-22 04:26:53,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1142022.0, ans=0.2 2023-06-22 04:27:22,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1142082.0, ans=0.125 2023-06-22 04:27:49,478 INFO [train.py:996] (0/4) Epoch 7, batch 7400, loss[loss=0.2258, simple_loss=0.3206, pruned_loss=0.0655, over 21848.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3041, pruned_loss=0.08375, over 4263833.31 frames. ], batch size: 317, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:27:59,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1142202.0, ans=0.0 2023-06-22 04:28:01,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1142202.0, ans=0.0 2023-06-22 04:28:05,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1142262.0, ans=0.0 2023-06-22 04:28:21,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1142262.0, ans=0.0 2023-06-22 04:28:30,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1142322.0, ans=0.125 2023-06-22 04:28:53,094 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:29:21,615 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.129e+02 3.591e+02 4.476e+02 8.193e+02, threshold=7.182e+02, percent-clipped=3.0 2023-06-22 04:29:25,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1142442.0, ans=0.0 2023-06-22 04:29:29,616 INFO [train.py:996] (0/4) Epoch 7, batch 7450, loss[loss=0.2293, simple_loss=0.291, pruned_loss=0.08374, over 21880.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3024, pruned_loss=0.08346, over 4260563.84 frames. ], batch size: 373, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:30:09,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1142562.0, ans=0.125 2023-06-22 04:30:25,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1142622.0, ans=0.1 2023-06-22 04:30:45,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-06-22 04:30:49,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1142682.0, ans=0.125 2023-06-22 04:30:49,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1142682.0, ans=0.2 2023-06-22 04:31:10,695 INFO [train.py:996] (0/4) Epoch 7, batch 7500, loss[loss=0.3186, simple_loss=0.4101, pruned_loss=0.1135, over 21671.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3079, pruned_loss=0.08599, over 4261611.59 frames. ], batch size: 414, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:31:54,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1142862.0, ans=0.1 2023-06-22 04:31:59,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-22 04:32:21,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1142982.0, ans=0.04949747468305833 2023-06-22 04:32:43,209 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 3.393e+02 4.362e+02 5.679e+02 1.317e+03, threshold=8.723e+02, percent-clipped=9.0 2023-06-22 04:32:51,286 INFO [train.py:996] (0/4) Epoch 7, batch 7550, loss[loss=0.2304, simple_loss=0.3191, pruned_loss=0.07089, over 21725.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3146, pruned_loss=0.08376, over 4259097.84 frames. ], batch size: 247, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:33:11,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1143162.0, ans=0.0 2023-06-22 04:33:12,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1143162.0, ans=0.125 2023-06-22 04:33:58,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1143282.0, ans=0.0 2023-06-22 04:34:19,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1143342.0, ans=0.1 2023-06-22 04:34:30,119 INFO [train.py:996] (0/4) Epoch 7, batch 7600, loss[loss=0.196, simple_loss=0.2579, pruned_loss=0.06706, over 16783.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3146, pruned_loss=0.08295, over 4257632.22 frames. ], batch size: 60, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:34:36,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-22 04:35:07,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1143462.0, ans=0.0 2023-06-22 04:35:10,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1143522.0, ans=0.125 2023-06-22 04:35:56,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.783e+02 3.344e+02 4.108e+02 6.348e+02, threshold=6.687e+02, percent-clipped=0.0 2023-06-22 04:36:04,738 INFO [train.py:996] (0/4) Epoch 7, batch 7650, loss[loss=0.2233, simple_loss=0.3296, pruned_loss=0.05856, over 19885.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3136, pruned_loss=0.0847, over 4270778.64 frames. ], batch size: 703, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:37:08,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-22 04:37:44,858 INFO [train.py:996] (0/4) Epoch 7, batch 7700, loss[loss=0.3242, simple_loss=0.3818, pruned_loss=0.1333, over 21789.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3183, pruned_loss=0.08827, over 4278030.05 frames. ], batch size: 441, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:37:47,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1144002.0, ans=0.125 2023-06-22 04:39:15,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1144242.0, ans=0.1 2023-06-22 04:39:20,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 3.101e+02 3.626e+02 4.244e+02 7.117e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-22 04:39:33,828 INFO [train.py:996] (0/4) Epoch 7, batch 7750, loss[loss=0.1586, simple_loss=0.213, pruned_loss=0.05214, over 17188.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3228, pruned_loss=0.08747, over 4266137.24 frames. ], batch size: 62, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:39:48,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-22 04:40:33,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1144482.0, ans=0.2 2023-06-22 04:41:20,041 INFO [train.py:996] (0/4) Epoch 7, batch 7800, loss[loss=0.2475, simple_loss=0.3267, pruned_loss=0.08415, over 21619.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3231, pruned_loss=0.08775, over 4258642.26 frames. ], batch size: 389, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:41:30,619 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-22 04:41:31,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1144602.0, ans=0.125 2023-06-22 04:41:54,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1144722.0, ans=0.125 2023-06-22 04:42:37,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.647e+02 3.499e+02 4.174e+02 5.699e+02 9.171e+02, threshold=8.349e+02, percent-clipped=6.0 2023-06-22 04:42:49,108 INFO [train.py:996] (0/4) Epoch 7, batch 7850, loss[loss=0.2194, simple_loss=0.2842, pruned_loss=0.07729, over 21294.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3159, pruned_loss=0.0869, over 4258342.58 frames. ], batch size: 211, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:43:03,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-22 04:43:23,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1144962.0, ans=0.125 2023-06-22 04:43:46,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1145082.0, ans=0.125 2023-06-22 04:43:49,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1145082.0, ans=0.1 2023-06-22 04:43:54,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1145082.0, ans=0.125 2023-06-22 04:43:58,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1145082.0, ans=0.0 2023-06-22 04:44:23,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1145142.0, ans=0.0 2023-06-22 04:44:41,145 INFO [train.py:996] (0/4) Epoch 7, batch 7900, loss[loss=0.2602, simple_loss=0.3441, pruned_loss=0.08813, over 21698.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3111, pruned_loss=0.08596, over 4252471.68 frames. ], batch size: 298, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:45:30,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1145382.0, ans=0.125 2023-06-22 04:45:34,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1145382.0, ans=0.125 2023-06-22 04:46:17,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 3.136e+02 3.570e+02 4.500e+02 9.857e+02, threshold=7.139e+02, percent-clipped=1.0 2023-06-22 04:46:23,943 INFO [train.py:996] (0/4) Epoch 7, batch 7950, loss[loss=0.2619, simple_loss=0.3283, pruned_loss=0.09779, over 20670.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3182, pruned_loss=0.08586, over 4260013.37 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:46:24,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-22 04:46:24,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-22 04:46:44,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.24 vs. limit=15.0 2023-06-22 04:46:58,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1145622.0, ans=0.125 2023-06-22 04:47:07,007 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-22 04:47:46,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1145682.0, ans=0.125 2023-06-22 04:47:49,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1145742.0, ans=0.0 2023-06-22 04:48:05,928 INFO [train.py:996] (0/4) Epoch 7, batch 8000, loss[loss=0.2084, simple_loss=0.3238, pruned_loss=0.04652, over 20850.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3227, pruned_loss=0.08757, over 4263644.27 frames. ], batch size: 609, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:48:20,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1145802.0, ans=0.1 2023-06-22 04:48:59,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.13 vs. limit=12.0 2023-06-22 04:49:02,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1145922.0, ans=0.0 2023-06-22 04:49:26,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-06-22 04:49:43,961 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.592e+02 3.352e+02 3.873e+02 5.175e+02 9.395e+02, threshold=7.746e+02, percent-clipped=4.0 2023-06-22 04:49:50,538 INFO [train.py:996] (0/4) Epoch 7, batch 8050, loss[loss=0.2678, simple_loss=0.3507, pruned_loss=0.09244, over 21873.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3283, pruned_loss=0.08882, over 4269601.29 frames. ], batch size: 372, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:50:00,296 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:50:49,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1146222.0, ans=0.125 2023-06-22 04:51:12,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1146282.0, ans=0.0 2023-06-22 04:51:31,153 INFO [train.py:996] (0/4) Epoch 7, batch 8100, loss[loss=0.2545, simple_loss=0.3186, pruned_loss=0.09517, over 21808.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3269, pruned_loss=0.08977, over 4276064.05 frames. ], batch size: 247, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:52:25,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1146522.0, ans=0.125 2023-06-22 04:52:30,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1146522.0, ans=0.0 2023-06-22 04:52:38,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1146522.0, ans=0.125 2023-06-22 04:52:54,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1146582.0, ans=0.5 2023-06-22 04:53:20,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.528e+02 3.325e+02 3.912e+02 5.287e+02 8.623e+02, threshold=7.823e+02, percent-clipped=4.0 2023-06-22 04:53:29,636 INFO [train.py:996] (0/4) Epoch 7, batch 8150, loss[loss=0.2678, simple_loss=0.3753, pruned_loss=0.0802, over 21583.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3344, pruned_loss=0.09183, over 4276669.61 frames. ], batch size: 389, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:54:00,317 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:54:00,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1146762.0, ans=0.125 2023-06-22 04:54:15,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1146822.0, ans=0.0 2023-06-22 04:54:29,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-22 04:55:08,828 INFO [train.py:996] (0/4) Epoch 7, batch 8200, loss[loss=0.2024, simple_loss=0.2647, pruned_loss=0.07004, over 21647.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3267, pruned_loss=0.08853, over 4261666.12 frames. ], batch size: 247, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:55:24,181 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-06-22 04:55:46,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1147122.0, ans=0.125 2023-06-22 04:56:17,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-22 04:56:31,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1147242.0, ans=0.125 2023-06-22 04:56:38,815 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 2.942e+02 3.673e+02 4.818e+02 8.671e+02, threshold=7.346e+02, percent-clipped=2.0 2023-06-22 04:56:48,466 INFO [train.py:996] (0/4) Epoch 7, batch 8250, loss[loss=0.2419, simple_loss=0.3376, pruned_loss=0.07306, over 21734.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3245, pruned_loss=0.0878, over 4267106.31 frames. ], batch size: 282, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:57:21,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1147362.0, ans=0.125 2023-06-22 04:57:27,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-22 04:57:29,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1147422.0, ans=0.125 2023-06-22 04:57:44,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1147482.0, ans=0.0 2023-06-22 04:58:28,876 INFO [train.py:996] (0/4) Epoch 7, batch 8300, loss[loss=0.2024, simple_loss=0.2775, pruned_loss=0.06369, over 21178.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.321, pruned_loss=0.08517, over 4267407.30 frames. ], batch size: 176, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:59:06,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-22 04:59:30,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1147782.0, ans=0.0 2023-06-22 04:59:39,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1147782.0, ans=0.025 2023-06-22 04:59:39,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-22 05:00:04,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.891e+02 3.462e+02 4.321e+02 7.253e+02, threshold=6.923e+02, percent-clipped=0.0 2023-06-22 05:00:14,266 INFO [train.py:996] (0/4) Epoch 7, batch 8350, loss[loss=0.2159, simple_loss=0.2928, pruned_loss=0.06946, over 21870.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3202, pruned_loss=0.08347, over 4265339.91 frames. ], batch size: 373, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 05:00:25,288 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-22 05:01:22,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1148142.0, ans=0.0 2023-06-22 05:01:24,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1148142.0, ans=0.0 2023-06-22 05:01:49,574 INFO [train.py:996] (0/4) Epoch 7, batch 8400, loss[loss=0.2403, simple_loss=0.3728, pruned_loss=0.05385, over 20810.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3164, pruned_loss=0.08016, over 4268110.67 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 05:01:56,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1148202.0, ans=0.1 2023-06-22 05:01:57,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.65 vs. limit=6.0 2023-06-22 05:02:19,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1148322.0, ans=0.125 2023-06-22 05:02:36,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1148382.0, ans=0.125 2023-06-22 05:03:21,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-22 05:03:23,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.841e+02 3.519e+02 4.158e+02 9.923e+02, threshold=7.039e+02, percent-clipped=4.0 2023-06-22 05:03:28,540 INFO [train.py:996] (0/4) Epoch 7, batch 8450, loss[loss=0.2303, simple_loss=0.3489, pruned_loss=0.05585, over 20873.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3156, pruned_loss=0.07975, over 4275262.64 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 05:03:34,342 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-22 05:03:37,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=15.0 2023-06-22 05:04:08,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=12.0 2023-06-22 05:05:07,100 INFO [train.py:996] (0/4) Epoch 7, batch 8500, loss[loss=0.2729, simple_loss=0.3187, pruned_loss=0.1135, over 21340.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.311, pruned_loss=0.08129, over 4276529.41 frames. ], batch size: 473, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:05:31,936 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-22 05:05:48,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1148922.0, ans=0.0 2023-06-22 05:06:43,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.101e+02 3.750e+02 4.685e+02 7.391e+02, threshold=7.500e+02, percent-clipped=2.0 2023-06-22 05:06:47,882 INFO [train.py:996] (0/4) Epoch 7, batch 8550, loss[loss=0.2386, simple_loss=0.3061, pruned_loss=0.0856, over 21577.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3136, pruned_loss=0.08374, over 4273706.13 frames. ], batch size: 263, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:07:14,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1149162.0, ans=0.0 2023-06-22 05:08:16,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1149342.0, ans=0.0 2023-06-22 05:08:29,795 INFO [train.py:996] (0/4) Epoch 7, batch 8600, loss[loss=0.2917, simple_loss=0.3649, pruned_loss=0.1093, over 21384.00 frames. ], tot_loss[loss=0.247, simple_loss=0.321, pruned_loss=0.08646, over 4267776.30 frames. ], batch size: 176, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:09:01,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1149522.0, ans=0.125 2023-06-22 05:09:12,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=15.0 2023-06-22 05:09:23,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-22 05:09:53,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1149582.0, ans=0.125 2023-06-22 05:10:01,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-22 05:10:06,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 3.158e+02 3.954e+02 4.821e+02 7.985e+02, threshold=7.909e+02, percent-clipped=1.0 2023-06-22 05:10:10,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1149702.0, ans=0.125 2023-06-22 05:10:11,802 INFO [train.py:996] (0/4) Epoch 7, batch 8650, loss[loss=0.1901, simple_loss=0.2841, pruned_loss=0.04808, over 21584.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.326, pruned_loss=0.08694, over 4267782.79 frames. ], batch size: 230, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:10:20,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1149702.0, ans=0.0 2023-06-22 05:11:21,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1149882.0, ans=0.125 2023-06-22 05:11:43,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1149942.0, ans=0.0 2023-06-22 05:11:46,171 INFO [train.py:996] (0/4) Epoch 7, batch 8700, loss[loss=0.2197, simple_loss=0.2843, pruned_loss=0.07757, over 21868.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3174, pruned_loss=0.0832, over 4263766.59 frames. ], batch size: 373, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:11:47,411 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-22 05:12:13,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1150062.0, ans=0.0 2023-06-22 05:13:12,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1150242.0, ans=0.125 2023-06-22 05:13:21,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.826e+02 3.648e+02 4.622e+02 7.671e+02, threshold=7.296e+02, percent-clipped=0.0 2023-06-22 05:13:24,966 INFO [train.py:996] (0/4) Epoch 7, batch 8750, loss[loss=0.2359, simple_loss=0.2935, pruned_loss=0.08915, over 21671.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3148, pruned_loss=0.08363, over 4272787.03 frames. ], batch size: 230, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:13:56,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1150362.0, ans=0.025 2023-06-22 05:14:06,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150422.0, ans=0.1 2023-06-22 05:14:22,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1150422.0, ans=0.0 2023-06-22 05:14:48,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-22 05:14:52,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1150542.0, ans=0.2 2023-06-22 05:14:53,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1150542.0, ans=0.0 2023-06-22 05:15:02,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-22 05:15:06,515 INFO [train.py:996] (0/4) Epoch 7, batch 8800, loss[loss=0.3423, simple_loss=0.3999, pruned_loss=0.1423, over 21835.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3246, pruned_loss=0.08708, over 4277816.17 frames. ], batch size: 118, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:16:06,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1150722.0, ans=0.125 2023-06-22 05:16:12,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1150722.0, ans=0.125 2023-06-22 05:16:45,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.569e+02 4.667e+02 6.070e+02 1.023e+03, threshold=9.335e+02, percent-clipped=11.0 2023-06-22 05:16:47,139 INFO [train.py:996] (0/4) Epoch 7, batch 8850, loss[loss=0.253, simple_loss=0.3271, pruned_loss=0.08945, over 21710.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3307, pruned_loss=0.08946, over 4273606.02 frames. ], batch size: 124, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:16:47,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1150902.0, ans=0.1 2023-06-22 05:17:18,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.84 vs. limit=10.0 2023-06-22 05:17:35,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-22 05:18:33,256 INFO [train.py:996] (0/4) Epoch 7, batch 8900, loss[loss=0.2488, simple_loss=0.311, pruned_loss=0.09332, over 20012.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.324, pruned_loss=0.08761, over 4274355.07 frames. ], batch size: 702, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:19:59,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1151442.0, ans=0.125 2023-06-22 05:19:59,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1151442.0, ans=0.0 2023-06-22 05:20:00,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1151442.0, ans=0.2 2023-06-22 05:20:19,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.272e+02 4.158e+02 4.869e+02 9.581e+02, threshold=8.315e+02, percent-clipped=1.0 2023-06-22 05:20:19,308 INFO [train.py:996] (0/4) Epoch 7, batch 8950, loss[loss=0.2154, simple_loss=0.2816, pruned_loss=0.07457, over 21630.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3222, pruned_loss=0.0862, over 4268542.04 frames. ], batch size: 247, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:20:19,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1151502.0, ans=0.1 2023-06-22 05:20:45,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1151562.0, ans=0.125 2023-06-22 05:21:03,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1151622.0, ans=0.125 2023-06-22 05:21:06,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1151622.0, ans=0.125 2023-06-22 05:21:51,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1151742.0, ans=0.0 2023-06-22 05:21:58,750 INFO [train.py:996] (0/4) Epoch 7, batch 9000, loss[loss=0.2056, simple_loss=0.2552, pruned_loss=0.07795, over 21225.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3187, pruned_loss=0.08702, over 4273066.05 frames. ], batch size: 159, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:21:58,751 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 05:22:20,474 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2667, simple_loss=0.3612, pruned_loss=0.08614, over 1796401.00 frames. 2023-06-22 05:22:20,475 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24366MB 2023-06-22 05:22:41,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1151862.0, ans=0.125 2023-06-22 05:23:17,370 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-192000.pt 2023-06-22 05:23:37,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-22 05:23:57,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.838e+02 3.460e+02 4.383e+02 1.064e+03, threshold=6.920e+02, percent-clipped=1.0 2023-06-22 05:23:57,267 INFO [train.py:996] (0/4) Epoch 7, batch 9050, loss[loss=0.1892, simple_loss=0.2698, pruned_loss=0.05432, over 21286.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3148, pruned_loss=0.08419, over 4262409.79 frames. ], batch size: 551, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:23:59,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1152102.0, ans=0.035 2023-06-22 05:24:56,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1152282.0, ans=0.0 2023-06-22 05:25:27,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1152342.0, ans=0.0 2023-06-22 05:25:32,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1152342.0, ans=0.0 2023-06-22 05:25:38,188 INFO [train.py:996] (0/4) Epoch 7, batch 9100, loss[loss=0.217, simple_loss=0.3181, pruned_loss=0.05794, over 21741.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3201, pruned_loss=0.0865, over 4266714.32 frames. ], batch size: 351, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:25:40,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1152402.0, ans=0.0 2023-06-22 05:25:50,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1152402.0, ans=0.2 2023-06-22 05:25:53,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1152462.0, ans=0.04949747468305833 2023-06-22 05:26:30,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1152522.0, ans=0.125 2023-06-22 05:27:05,189 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-22 05:27:06,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1152642.0, ans=0.2 2023-06-22 05:27:10,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.90 vs. limit=22.5 2023-06-22 05:27:15,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1152642.0, ans=0.1 2023-06-22 05:27:18,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.226e+02 3.929e+02 4.790e+02 9.193e+02, threshold=7.858e+02, percent-clipped=7.0 2023-06-22 05:27:18,338 INFO [train.py:996] (0/4) Epoch 7, batch 9150, loss[loss=0.2438, simple_loss=0.3339, pruned_loss=0.07688, over 21767.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3244, pruned_loss=0.08429, over 4274737.86 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:27:19,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-06-22 05:27:38,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-06-22 05:27:46,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1152762.0, ans=0.125 2023-06-22 05:27:48,365 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:27:56,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1152822.0, ans=0.04949747468305833 2023-06-22 05:28:36,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1152882.0, ans=0.0 2023-06-22 05:28:45,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1152942.0, ans=0.2 2023-06-22 05:28:57,917 INFO [train.py:996] (0/4) Epoch 7, batch 9200, loss[loss=0.286, simple_loss=0.3561, pruned_loss=0.108, over 21773.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3247, pruned_loss=0.08312, over 4276659.92 frames. ], batch size: 124, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:29:42,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1153122.0, ans=0.125 2023-06-22 05:30:21,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1153242.0, ans=0.1 2023-06-22 05:30:27,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-22 05:30:38,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 3.168e+02 3.950e+02 4.666e+02 8.453e+02, threshold=7.900e+02, percent-clipped=2.0 2023-06-22 05:30:38,565 INFO [train.py:996] (0/4) Epoch 7, batch 9250, loss[loss=0.2922, simple_loss=0.3511, pruned_loss=0.1167, over 21706.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3264, pruned_loss=0.08554, over 4282259.76 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:30:39,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1153302.0, ans=0.2 2023-06-22 05:30:40,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1153302.0, ans=0.1 2023-06-22 05:31:05,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1153362.0, ans=0.0 2023-06-22 05:31:10,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-22 05:31:11,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1153362.0, ans=0.0 2023-06-22 05:31:13,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1153362.0, ans=0.5 2023-06-22 05:31:17,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1153362.0, ans=0.2 2023-06-22 05:32:24,627 INFO [train.py:996] (0/4) Epoch 7, batch 9300, loss[loss=0.2314, simple_loss=0.3259, pruned_loss=0.06843, over 20706.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3193, pruned_loss=0.08532, over 4276136.27 frames. ], batch size: 607, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:32:43,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1153602.0, ans=0.125 2023-06-22 05:32:45,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1153662.0, ans=0.125 2023-06-22 05:32:58,451 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:33:42,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1153782.0, ans=0.125 2023-06-22 05:34:11,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.246e+02 3.753e+02 4.890e+02 7.964e+02, threshold=7.506e+02, percent-clipped=1.0 2023-06-22 05:34:11,619 INFO [train.py:996] (0/4) Epoch 7, batch 9350, loss[loss=0.3199, simple_loss=0.3779, pruned_loss=0.131, over 21324.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3271, pruned_loss=0.08638, over 4278484.62 frames. ], batch size: 507, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:34:33,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1153962.0, ans=0.125 2023-06-22 05:34:50,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1153962.0, ans=0.0 2023-06-22 05:34:50,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1153962.0, ans=0.125 2023-06-22 05:34:55,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1154022.0, ans=15.0 2023-06-22 05:35:31,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1154142.0, ans=0.125 2023-06-22 05:35:31,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1154142.0, ans=0.0 2023-06-22 05:35:42,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-22 05:35:46,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1154142.0, ans=0.125 2023-06-22 05:35:47,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-22 05:35:52,128 INFO [train.py:996] (0/4) Epoch 7, batch 9400, loss[loss=0.2452, simple_loss=0.3138, pruned_loss=0.0883, over 21478.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3314, pruned_loss=0.08703, over 4265143.87 frames. ], batch size: 389, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:36:33,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1154322.0, ans=0.07 2023-06-22 05:36:37,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1154322.0, ans=0.05 2023-06-22 05:36:52,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1154382.0, ans=0.125 2023-06-22 05:37:32,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.206e+02 3.638e+02 4.538e+02 8.694e+02, threshold=7.276e+02, percent-clipped=2.0 2023-06-22 05:37:32,790 INFO [train.py:996] (0/4) Epoch 7, batch 9450, loss[loss=0.2039, simple_loss=0.2635, pruned_loss=0.07215, over 21558.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3228, pruned_loss=0.08569, over 4265332.21 frames. ], batch size: 263, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:37:44,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1154502.0, ans=0.1 2023-06-22 05:37:54,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1154562.0, ans=0.125 2023-06-22 05:37:57,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1154562.0, ans=0.125 2023-06-22 05:38:24,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1154622.0, ans=0.125 2023-06-22 05:38:30,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1154682.0, ans=0.125 2023-06-22 05:38:51,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1154742.0, ans=0.125 2023-06-22 05:39:11,644 INFO [train.py:996] (0/4) Epoch 7, batch 9500, loss[loss=0.2322, simple_loss=0.29, pruned_loss=0.08714, over 21493.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3131, pruned_loss=0.08397, over 4262935.46 frames. ], batch size: 391, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:39:28,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1154862.0, ans=0.125 2023-06-22 05:40:48,503 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-22 05:40:52,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.257e+02 3.921e+02 4.943e+02 1.018e+03, threshold=7.842e+02, percent-clipped=7.0 2023-06-22 05:40:52,044 INFO [train.py:996] (0/4) Epoch 7, batch 9550, loss[loss=0.2888, simple_loss=0.3532, pruned_loss=0.1122, over 21600.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3206, pruned_loss=0.08732, over 4263724.94 frames. ], batch size: 389, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:41:33,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1155222.0, ans=0.0 2023-06-22 05:41:41,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1155222.0, ans=0.1 2023-06-22 05:42:25,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1155402.0, ans=0.125 2023-06-22 05:42:26,566 INFO [train.py:996] (0/4) Epoch 7, batch 9600, loss[loss=0.2231, simple_loss=0.2923, pruned_loss=0.077, over 21519.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3222, pruned_loss=0.08851, over 4270585.86 frames. ], batch size: 548, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:42:28,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1155402.0, ans=0.025 2023-06-22 05:43:03,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1155462.0, ans=0.0 2023-06-22 05:43:37,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1155582.0, ans=0.02 2023-06-22 05:43:49,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-22 05:44:06,675 INFO [train.py:996] (0/4) Epoch 7, batch 9650, loss[loss=0.2842, simple_loss=0.3974, pruned_loss=0.0855, over 20837.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3208, pruned_loss=0.08793, over 4276084.02 frames. ], batch size: 608, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:44:08,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 3.176e+02 3.740e+02 4.596e+02 7.915e+02, threshold=7.479e+02, percent-clipped=1.0 2023-06-22 05:44:33,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1155762.0, ans=0.2 2023-06-22 05:44:59,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1155822.0, ans=0.2 2023-06-22 05:45:27,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1155882.0, ans=6.0 2023-06-22 05:45:51,593 INFO [train.py:996] (0/4) Epoch 7, batch 9700, loss[loss=0.2874, simple_loss=0.3471, pruned_loss=0.1138, over 21588.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3255, pruned_loss=0.08838, over 4270804.46 frames. ], batch size: 471, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:45:57,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1156002.0, ans=0.04949747468305833 2023-06-22 05:46:29,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1156122.0, ans=0.125 2023-06-22 05:47:07,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1156182.0, ans=0.1 2023-06-22 05:47:10,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1156242.0, ans=0.125 2023-06-22 05:47:14,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1156242.0, ans=0.0 2023-06-22 05:47:35,171 INFO [train.py:996] (0/4) Epoch 7, batch 9750, loss[loss=0.3054, simple_loss=0.3757, pruned_loss=0.1175, over 21888.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3193, pruned_loss=0.08737, over 4275330.00 frames. ], batch size: 107, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:47:36,499 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.485e+02 3.111e+02 3.618e+02 4.143e+02 7.836e+02, threshold=7.236e+02, percent-clipped=1.0 2023-06-22 05:48:14,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1156422.0, ans=0.0 2023-06-22 05:48:38,077 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-22 05:48:47,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156542.0, ans=0.1 2023-06-22 05:49:01,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1156542.0, ans=0.0 2023-06-22 05:49:08,238 INFO [train.py:996] (0/4) Epoch 7, batch 9800, loss[loss=0.2255, simple_loss=0.2994, pruned_loss=0.07577, over 21943.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3196, pruned_loss=0.08787, over 4258682.64 frames. ], batch size: 316, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:50:26,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=15.0 2023-06-22 05:50:42,014 INFO [train.py:996] (0/4) Epoch 7, batch 9850, loss[loss=0.2255, simple_loss=0.2838, pruned_loss=0.08355, over 21686.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3156, pruned_loss=0.08813, over 4260270.50 frames. ], batch size: 282, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:50:43,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 3.145e+02 3.713e+02 4.993e+02 9.640e+02, threshold=7.425e+02, percent-clipped=7.0 2023-06-22 05:51:24,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1157022.0, ans=0.2 2023-06-22 05:52:21,265 INFO [train.py:996] (0/4) Epoch 7, batch 9900, loss[loss=0.2717, simple_loss=0.3412, pruned_loss=0.1011, over 21464.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3118, pruned_loss=0.08768, over 4243879.76 frames. ], batch size: 131, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:52:29,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1157202.0, ans=0.125 2023-06-22 05:53:24,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1157382.0, ans=0.125 2023-06-22 05:53:49,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1157442.0, ans=0.125 2023-06-22 05:53:49,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1157442.0, ans=0.2 2023-06-22 05:53:54,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1157442.0, ans=0.125 2023-06-22 05:54:06,756 INFO [train.py:996] (0/4) Epoch 7, batch 9950, loss[loss=0.2582, simple_loss=0.3288, pruned_loss=0.09382, over 21561.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3135, pruned_loss=0.08942, over 4251937.02 frames. ], batch size: 389, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:54:08,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.677e+02 3.173e+02 3.693e+02 4.396e+02 6.940e+02, threshold=7.386e+02, percent-clipped=0.0 2023-06-22 05:55:54,217 INFO [train.py:996] (0/4) Epoch 7, batch 10000, loss[loss=0.2402, simple_loss=0.2878, pruned_loss=0.09627, over 21556.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3093, pruned_loss=0.08758, over 4248339.97 frames. ], batch size: 263, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:56:09,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1157862.0, ans=0.0 2023-06-22 05:56:12,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1157862.0, ans=0.125 2023-06-22 05:56:12,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1157862.0, ans=0.2 2023-06-22 05:56:51,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1157982.0, ans=0.0 2023-06-22 05:57:35,241 INFO [train.py:996] (0/4) Epoch 7, batch 10050, loss[loss=0.2558, simple_loss=0.327, pruned_loss=0.09226, over 20691.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3095, pruned_loss=0.08675, over 4258217.08 frames. ], batch size: 607, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:57:36,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.023e+02 3.455e+02 4.258e+02 6.801e+02, threshold=6.910e+02, percent-clipped=0.0 2023-06-22 05:57:43,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1158102.0, ans=0.0 2023-06-22 05:58:09,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1158162.0, ans=0.125 2023-06-22 05:58:19,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1158222.0, ans=0.0 2023-06-22 05:59:03,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1158342.0, ans=0.2 2023-06-22 05:59:09,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1158342.0, ans=0.125 2023-06-22 05:59:15,543 INFO [train.py:996] (0/4) Epoch 7, batch 10100, loss[loss=0.2495, simple_loss=0.3295, pruned_loss=0.08474, over 21685.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3067, pruned_loss=0.08459, over 4257693.08 frames. ], batch size: 389, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:59:27,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1158402.0, ans=0.125 2023-06-22 06:00:10,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1158522.0, ans=0.2 2023-06-22 06:00:55,660 INFO [train.py:996] (0/4) Epoch 7, batch 10150, loss[loss=0.241, simple_loss=0.3082, pruned_loss=0.08689, over 21784.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3127, pruned_loss=0.08656, over 4264153.69 frames. ], batch size: 118, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 06:00:58,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.244e+02 3.861e+02 4.882e+02 7.298e+02, threshold=7.722e+02, percent-clipped=2.0 2023-06-22 06:01:01,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1158702.0, ans=0.0 2023-06-22 06:02:23,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1158942.0, ans=0.02 2023-06-22 06:02:35,519 INFO [train.py:996] (0/4) Epoch 7, batch 10200, loss[loss=0.2708, simple_loss=0.3471, pruned_loss=0.09724, over 21509.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3113, pruned_loss=0.08419, over 4266874.56 frames. ], batch size: 441, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 06:03:47,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1159182.0, ans=0.1 2023-06-22 06:04:02,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1159242.0, ans=0.025 2023-06-22 06:04:14,106 INFO [train.py:996] (0/4) Epoch 7, batch 10250, loss[loss=0.1756, simple_loss=0.2618, pruned_loss=0.04469, over 21464.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3054, pruned_loss=0.07809, over 4271936.86 frames. ], batch size: 212, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:04:17,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.655e+02 3.086e+02 4.104e+02 7.872e+02, threshold=6.172e+02, percent-clipped=2.0 2023-06-22 06:04:36,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1159362.0, ans=0.2 2023-06-22 06:04:59,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1159422.0, ans=0.0 2023-06-22 06:05:06,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-06-22 06:05:38,133 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.37 vs. limit=10.0 2023-06-22 06:05:47,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1159542.0, ans=0.125 2023-06-22 06:06:01,476 INFO [train.py:996] (0/4) Epoch 7, batch 10300, loss[loss=0.2543, simple_loss=0.3329, pruned_loss=0.08788, over 21237.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3084, pruned_loss=0.07915, over 4266953.10 frames. ], batch size: 159, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:06:07,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1159602.0, ans=0.125 2023-06-22 06:06:44,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1159722.0, ans=0.125 2023-06-22 06:06:49,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1159722.0, ans=0.2 2023-06-22 06:07:01,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1159722.0, ans=0.125 2023-06-22 06:07:12,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1159782.0, ans=0.0 2023-06-22 06:07:31,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1159842.0, ans=0.0 2023-06-22 06:07:44,249 INFO [train.py:996] (0/4) Epoch 7, batch 10350, loss[loss=0.2266, simple_loss=0.3109, pruned_loss=0.07111, over 19973.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3105, pruned_loss=0.07994, over 4261214.32 frames. ], batch size: 703, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:07:47,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.368e+02 3.957e+02 4.921e+02 8.307e+02, threshold=7.914e+02, percent-clipped=7.0 2023-06-22 06:08:40,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1160022.0, ans=0.125 2023-06-22 06:08:51,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1160082.0, ans=0.0 2023-06-22 06:09:01,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1160082.0, ans=0.0 2023-06-22 06:09:03,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1160082.0, ans=0.125 2023-06-22 06:09:31,098 INFO [train.py:996] (0/4) Epoch 7, batch 10400, loss[loss=0.2943, simple_loss=0.3587, pruned_loss=0.1149, over 21517.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3062, pruned_loss=0.07975, over 4261725.00 frames. ], batch size: 509, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:09:44,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1160202.0, ans=0.125 2023-06-22 06:11:13,839 INFO [train.py:996] (0/4) Epoch 7, batch 10450, loss[loss=0.305, simple_loss=0.364, pruned_loss=0.123, over 21334.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3098, pruned_loss=0.08138, over 4259134.35 frames. ], batch size: 549, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:11:16,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.616e+02 3.263e+02 3.725e+02 4.769e+02 8.321e+02, threshold=7.450e+02, percent-clipped=2.0 2023-06-22 06:12:10,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1160622.0, ans=0.2 2023-06-22 06:12:58,247 INFO [train.py:996] (0/4) Epoch 7, batch 10500, loss[loss=0.2048, simple_loss=0.2872, pruned_loss=0.0612, over 21764.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.312, pruned_loss=0.08126, over 4255743.91 frames. ], batch size: 351, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:13:29,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1160862.0, ans=0.125 2023-06-22 06:13:32,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-22 06:14:37,830 INFO [train.py:996] (0/4) Epoch 7, batch 10550, loss[loss=0.2024, simple_loss=0.2699, pruned_loss=0.06743, over 21650.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3061, pruned_loss=0.08124, over 4259199.55 frames. ], batch size: 333, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:14:40,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.300e+02 2.934e+02 3.554e+02 4.294e+02 7.411e+02, threshold=7.109e+02, percent-clipped=0.0 2023-06-22 06:15:11,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1161162.0, ans=0.2 2023-06-22 06:15:18,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1161222.0, ans=0.0 2023-06-22 06:15:42,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1161282.0, ans=0.0 2023-06-22 06:15:50,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1161282.0, ans=0.0 2023-06-22 06:16:14,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1161342.0, ans=0.125 2023-06-22 06:16:19,500 INFO [train.py:996] (0/4) Epoch 7, batch 10600, loss[loss=0.2004, simple_loss=0.2738, pruned_loss=0.06344, over 21784.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3022, pruned_loss=0.08008, over 4255412.78 frames. ], batch size: 351, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:16:53,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1161462.0, ans=0.125 2023-06-22 06:17:08,831 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-22 06:17:55,848 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-22 06:18:06,272 INFO [train.py:996] (0/4) Epoch 7, batch 10650, loss[loss=0.2757, simple_loss=0.362, pruned_loss=0.09473, over 21548.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3051, pruned_loss=0.07906, over 4256933.06 frames. ], batch size: 471, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:18:11,063 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 3.046e+02 3.763e+02 4.720e+02 8.386e+02, threshold=7.526e+02, percent-clipped=4.0 2023-06-22 06:18:20,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-06-22 06:18:39,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161762.0, ans=0.1 2023-06-22 06:19:07,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-22 06:19:25,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1161942.0, ans=0.0 2023-06-22 06:19:47,777 INFO [train.py:996] (0/4) Epoch 7, batch 10700, loss[loss=0.3436, simple_loss=0.3875, pruned_loss=0.1499, over 21327.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3063, pruned_loss=0.07977, over 4251995.79 frames. ], batch size: 507, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:20:18,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-22 06:20:22,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1162062.0, ans=0.125 2023-06-22 06:20:25,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-22 06:20:26,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1162122.0, ans=0.125 2023-06-22 06:20:41,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1162122.0, ans=0.125 2023-06-22 06:21:16,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1162242.0, ans=0.125 2023-06-22 06:21:30,737 INFO [train.py:996] (0/4) Epoch 7, batch 10750, loss[loss=0.2589, simple_loss=0.3348, pruned_loss=0.09146, over 21320.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3178, pruned_loss=0.08492, over 4252377.51 frames. ], batch size: 176, lr: 4.36e-03, grad_scale: 8.0 2023-06-22 06:21:31,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=15.0 2023-06-22 06:21:42,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 3.648e+02 4.416e+02 6.142e+02 1.061e+03, threshold=8.833e+02, percent-clipped=11.0 2023-06-22 06:21:48,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-22 06:21:52,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1162362.0, ans=0.125 2023-06-22 06:22:05,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1162362.0, ans=0.125 2023-06-22 06:22:10,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1162362.0, ans=0.125 2023-06-22 06:22:35,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2023-06-22 06:22:59,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1162542.0, ans=0.125 2023-06-22 06:23:01,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.62 vs. limit=10.0 2023-06-22 06:23:17,176 INFO [train.py:996] (0/4) Epoch 7, batch 10800, loss[loss=0.257, simple_loss=0.3342, pruned_loss=0.08988, over 21811.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3201, pruned_loss=0.08406, over 4258396.34 frames. ], batch size: 282, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:23:23,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-22 06:23:57,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1162722.0, ans=0.125 2023-06-22 06:24:18,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1162782.0, ans=0.125 2023-06-22 06:24:23,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1162782.0, ans=0.0 2023-06-22 06:24:35,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1162842.0, ans=0.02 2023-06-22 06:24:36,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-22 06:24:44,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-22 06:24:56,616 INFO [train.py:996] (0/4) Epoch 7, batch 10850, loss[loss=0.1935, simple_loss=0.2548, pruned_loss=0.06607, over 20776.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3204, pruned_loss=0.08431, over 4262006.28 frames. ], batch size: 609, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:25:07,915 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.182e+02 4.104e+02 5.003e+02 8.249e+02, threshold=8.208e+02, percent-clipped=0.0 2023-06-22 06:25:40,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1163022.0, ans=0.125 2023-06-22 06:25:41,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1163022.0, ans=0.125 2023-06-22 06:25:52,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.95 vs. limit=10.0 2023-06-22 06:26:00,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-22 06:26:03,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1163082.0, ans=0.1 2023-06-22 06:26:08,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1163082.0, ans=0.125 2023-06-22 06:26:11,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1163082.0, ans=0.125 2023-06-22 06:26:41,645 INFO [train.py:996] (0/4) Epoch 7, batch 10900, loss[loss=0.2238, simple_loss=0.3003, pruned_loss=0.07363, over 21222.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3154, pruned_loss=0.08285, over 4247320.25 frames. ], batch size: 176, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:27:27,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1163322.0, ans=0.0 2023-06-22 06:27:27,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-22 06:27:38,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-22 06:27:42,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1163382.0, ans=0.0 2023-06-22 06:28:16,279 INFO [train.py:996] (0/4) Epoch 7, batch 10950, loss[loss=0.2086, simple_loss=0.2738, pruned_loss=0.0717, over 21708.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.312, pruned_loss=0.08111, over 4237348.13 frames. ], batch size: 300, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:28:18,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1163502.0, ans=0.0 2023-06-22 06:28:19,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1163502.0, ans=0.2 2023-06-22 06:28:27,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.461e+02 3.282e+02 3.918e+02 4.735e+02 6.803e+02, threshold=7.835e+02, percent-clipped=0.0 2023-06-22 06:29:14,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-22 06:29:17,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1163682.0, ans=0.04949747468305833 2023-06-22 06:29:24,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-22 06:29:41,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1163742.0, ans=0.125 2023-06-22 06:29:50,135 INFO [train.py:996] (0/4) Epoch 7, batch 11000, loss[loss=0.2617, simple_loss=0.32, pruned_loss=0.1017, over 21951.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3106, pruned_loss=0.08149, over 4246436.56 frames. ], batch size: 316, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:29:51,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-22 06:30:08,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-22 06:30:48,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1163922.0, ans=0.125 2023-06-22 06:31:15,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1164042.0, ans=0.125 2023-06-22 06:31:18,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1164042.0, ans=0.1 2023-06-22 06:31:30,333 INFO [train.py:996] (0/4) Epoch 7, batch 11050, loss[loss=0.2342, simple_loss=0.3066, pruned_loss=0.08093, over 20650.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3085, pruned_loss=0.08342, over 4248515.13 frames. ], batch size: 607, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:31:40,920 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.144e+02 3.658e+02 4.347e+02 7.948e+02, threshold=7.316e+02, percent-clipped=1.0 2023-06-22 06:31:47,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1164102.0, ans=0.0 2023-06-22 06:31:55,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-22 06:31:56,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1164162.0, ans=0.0 2023-06-22 06:32:06,804 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:32:11,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-22 06:32:17,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1164222.0, ans=0.0 2023-06-22 06:32:26,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1164282.0, ans=0.0 2023-06-22 06:32:39,698 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-22 06:32:49,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-22 06:33:00,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1164342.0, ans=0.035 2023-06-22 06:33:02,563 INFO [train.py:996] (0/4) Epoch 7, batch 11100, loss[loss=0.2007, simple_loss=0.2662, pruned_loss=0.06759, over 15366.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3066, pruned_loss=0.08363, over 4248184.15 frames. ], batch size: 60, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:33:04,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1164402.0, ans=0.0 2023-06-22 06:34:35,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1164642.0, ans=0.0 2023-06-22 06:34:42,431 INFO [train.py:996] (0/4) Epoch 7, batch 11150, loss[loss=0.2225, simple_loss=0.3191, pruned_loss=0.06294, over 21778.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3053, pruned_loss=0.08233, over 4241366.94 frames. ], batch size: 282, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:34:48,752 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.401e+02 2.924e+02 3.298e+02 3.958e+02 6.309e+02, threshold=6.596e+02, percent-clipped=0.0 2023-06-22 06:35:07,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1164762.0, ans=0.1 2023-06-22 06:35:09,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1164762.0, ans=0.04949747468305833 2023-06-22 06:36:03,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1164942.0, ans=0.125 2023-06-22 06:36:10,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-22 06:36:16,939 INFO [train.py:996] (0/4) Epoch 7, batch 11200, loss[loss=0.2276, simple_loss=0.2784, pruned_loss=0.08845, over 21615.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3034, pruned_loss=0.08216, over 4247041.28 frames. ], batch size: 247, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:36:55,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1165062.0, ans=0.125 2023-06-22 06:36:58,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=12.0 2023-06-22 06:37:16,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1165182.0, ans=0.2 2023-06-22 06:37:35,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1165242.0, ans=0.0 2023-06-22 06:37:52,333 INFO [train.py:996] (0/4) Epoch 7, batch 11250, loss[loss=0.2371, simple_loss=0.3228, pruned_loss=0.07573, over 21626.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3025, pruned_loss=0.08207, over 4247089.24 frames. ], batch size: 230, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:37:58,489 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.901e+02 3.332e+02 3.824e+02 5.999e+02, threshold=6.664e+02, percent-clipped=0.0 2023-06-22 06:38:48,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1165422.0, ans=0.125 2023-06-22 06:39:17,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1165542.0, ans=0.07 2023-06-22 06:39:22,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-22 06:39:31,712 INFO [train.py:996] (0/4) Epoch 7, batch 11300, loss[loss=0.231, simple_loss=0.2936, pruned_loss=0.08415, over 21786.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3041, pruned_loss=0.08291, over 4251957.77 frames. ], batch size: 247, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:39:35,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1165602.0, ans=0.07 2023-06-22 06:40:25,453 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-22 06:40:33,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-22 06:41:03,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-22 06:41:11,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1165902.0, ans=0.2 2023-06-22 06:41:12,131 INFO [train.py:996] (0/4) Epoch 7, batch 11350, loss[loss=0.2509, simple_loss=0.3247, pruned_loss=0.08852, over 21427.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3052, pruned_loss=0.08182, over 4258071.95 frames. ], batch size: 194, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:41:23,499 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.937e+02 3.595e+02 4.319e+02 9.423e+02, threshold=7.190e+02, percent-clipped=3.0 2023-06-22 06:41:32,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1165962.0, ans=0.0 2023-06-22 06:41:34,399 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2023-06-22 06:42:39,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1166142.0, ans=0.125 2023-06-22 06:42:43,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1166142.0, ans=15.0 2023-06-22 06:42:59,026 INFO [train.py:996] (0/4) Epoch 7, batch 11400, loss[loss=0.24, simple_loss=0.3185, pruned_loss=0.08072, over 21610.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3115, pruned_loss=0.08474, over 4264399.37 frames. ], batch size: 230, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:44:09,543 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:44:21,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1166442.0, ans=0.0 2023-06-22 06:44:37,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1166442.0, ans=0.025 2023-06-22 06:44:40,315 INFO [train.py:996] (0/4) Epoch 7, batch 11450, loss[loss=0.2172, simple_loss=0.2965, pruned_loss=0.06898, over 21865.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.31, pruned_loss=0.08248, over 4265894.06 frames. ], batch size: 316, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:44:52,088 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.083e+02 3.885e+02 5.108e+02 7.985e+02, threshold=7.771e+02, percent-clipped=2.0 2023-06-22 06:45:02,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1166502.0, ans=0.125 2023-06-22 06:45:07,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.05 vs. limit=6.0 2023-06-22 06:45:36,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1166622.0, ans=0.125 2023-06-22 06:46:21,797 INFO [train.py:996] (0/4) Epoch 7, batch 11500, loss[loss=0.1944, simple_loss=0.2927, pruned_loss=0.04812, over 21780.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3149, pruned_loss=0.08417, over 4269027.33 frames. ], batch size: 282, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:46:30,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1166802.0, ans=0.125 2023-06-22 06:46:46,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1166862.0, ans=0.0 2023-06-22 06:47:04,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-22 06:48:14,277 INFO [train.py:996] (0/4) Epoch 7, batch 11550, loss[loss=0.2522, simple_loss=0.3481, pruned_loss=0.07814, over 21884.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3202, pruned_loss=0.08364, over 4267384.94 frames. ], batch size: 316, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:48:21,204 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.086e+02 3.744e+02 4.289e+02 8.491e+02, threshold=7.488e+02, percent-clipped=1.0 2023-06-22 06:48:31,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1167162.0, ans=0.0 2023-06-22 06:48:52,776 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=22.5 2023-06-22 06:49:05,567 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:49:11,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1167222.0, ans=0.035 2023-06-22 06:49:56,245 INFO [train.py:996] (0/4) Epoch 7, batch 11600, loss[loss=0.264, simple_loss=0.3626, pruned_loss=0.08272, over 21798.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3354, pruned_loss=0.08597, over 4273612.89 frames. ], batch size: 282, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:50:05,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1167402.0, ans=0.0 2023-06-22 06:50:58,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-06-22 06:51:37,748 INFO [train.py:996] (0/4) Epoch 7, batch 11650, loss[loss=0.2572, simple_loss=0.3289, pruned_loss=0.09277, over 21374.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3421, pruned_loss=0.08675, over 4269501.63 frames. ], batch size: 194, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:51:52,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.531e+02 4.483e+02 5.705e+02 9.764e+02, threshold=8.966e+02, percent-clipped=9.0 2023-06-22 06:51:58,452 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-22 06:52:30,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1167822.0, ans=0.125 2023-06-22 06:52:56,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1167882.0, ans=0.2 2023-06-22 06:53:18,240 INFO [train.py:996] (0/4) Epoch 7, batch 11700, loss[loss=0.2163, simple_loss=0.2789, pruned_loss=0.0769, over 21605.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3342, pruned_loss=0.08707, over 4265211.31 frames. ], batch size: 298, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:53:45,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=12.0 2023-06-22 06:54:08,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1168122.0, ans=0.125 2023-06-22 06:54:27,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1168182.0, ans=0.2 2023-06-22 06:54:51,706 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-22 06:54:56,775 INFO [train.py:996] (0/4) Epoch 7, batch 11750, loss[loss=0.2496, simple_loss=0.3229, pruned_loss=0.08814, over 21811.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3249, pruned_loss=0.08673, over 4263165.92 frames. ], batch size: 372, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:55:11,708 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.565e+02 3.104e+02 3.664e+02 4.523e+02 8.929e+02, threshold=7.328e+02, percent-clipped=0.0 2023-06-22 06:55:13,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1168302.0, ans=0.1 2023-06-22 06:55:21,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1168362.0, ans=0.125 2023-06-22 06:55:23,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1168362.0, ans=0.05 2023-06-22 06:55:27,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-22 06:55:44,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1168422.0, ans=0.125 2023-06-22 06:56:01,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1168482.0, ans=0.05 2023-06-22 06:56:23,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1168542.0, ans=0.0 2023-06-22 06:56:44,783 INFO [train.py:996] (0/4) Epoch 7, batch 11800, loss[loss=0.2751, simple_loss=0.3575, pruned_loss=0.09637, over 21421.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3258, pruned_loss=0.08889, over 4258817.47 frames. ], batch size: 471, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:57:42,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1168722.0, ans=0.0 2023-06-22 06:58:04,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1168842.0, ans=0.125 2023-06-22 06:58:25,539 INFO [train.py:996] (0/4) Epoch 7, batch 11850, loss[loss=0.2956, simple_loss=0.4079, pruned_loss=0.09168, over 19785.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3263, pruned_loss=0.08735, over 4259252.48 frames. ], batch size: 702, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:58:39,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.224e+02 3.745e+02 4.482e+02 9.714e+02, threshold=7.491e+02, percent-clipped=2.0 2023-06-22 06:58:59,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1168962.0, ans=0.0 2023-06-22 07:00:12,057 INFO [train.py:996] (0/4) Epoch 7, batch 11900, loss[loss=0.2041, simple_loss=0.2941, pruned_loss=0.05706, over 21836.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3274, pruned_loss=0.08525, over 4258778.80 frames. ], batch size: 316, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 07:00:18,327 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-22 07:00:37,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1169262.0, ans=0.2 2023-06-22 07:00:58,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1169322.0, ans=0.2 2023-06-22 07:01:01,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-22 07:01:07,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1169382.0, ans=0.0 2023-06-22 07:01:44,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1169442.0, ans=0.2 2023-06-22 07:01:48,758 INFO [train.py:996] (0/4) Epoch 7, batch 11950, loss[loss=0.2798, simple_loss=0.3659, pruned_loss=0.09687, over 21448.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3278, pruned_loss=0.08195, over 4264703.80 frames. ], batch size: 507, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 07:01:58,130 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 3.043e+02 3.599e+02 4.818e+02 9.282e+02, threshold=7.198e+02, percent-clipped=3.0 2023-06-22 07:02:12,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1169562.0, ans=0.1 2023-06-22 07:02:28,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1169622.0, ans=0.5 2023-06-22 07:03:03,180 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-22 07:03:27,379 INFO [train.py:996] (0/4) Epoch 7, batch 12000, loss[loss=0.2355, simple_loss=0.2871, pruned_loss=0.09191, over 21202.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.321, pruned_loss=0.07983, over 4243147.78 frames. ], batch size: 144, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:03:27,380 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 07:03:43,843 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2652, simple_loss=0.3601, pruned_loss=0.08515, over 1796401.00 frames. 2023-06-22 07:03:43,844 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 07:04:31,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1169922.0, ans=0.1 2023-06-22 07:04:35,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=12.0 2023-06-22 07:04:39,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1169982.0, ans=0.125 2023-06-22 07:05:23,296 INFO [train.py:996] (0/4) Epoch 7, batch 12050, loss[loss=0.3024, simple_loss=0.3582, pruned_loss=0.1233, over 21892.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3156, pruned_loss=0.08127, over 4254501.89 frames. ], batch size: 118, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:05:37,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 3.086e+02 3.580e+02 4.845e+02 1.189e+03, threshold=7.160e+02, percent-clipped=3.0 2023-06-22 07:05:39,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1170102.0, ans=0.125 2023-06-22 07:05:58,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1170162.0, ans=0.5 2023-06-22 07:05:58,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1170162.0, ans=0.2 2023-06-22 07:06:39,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1170282.0, ans=0.0 2023-06-22 07:07:09,360 INFO [train.py:996] (0/4) Epoch 7, batch 12100, loss[loss=0.2692, simple_loss=0.3352, pruned_loss=0.1016, over 21756.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3211, pruned_loss=0.08646, over 4262091.79 frames. ], batch size: 351, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:07:28,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1170402.0, ans=0.125 2023-06-22 07:07:38,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1170462.0, ans=0.0 2023-06-22 07:07:43,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1170462.0, ans=0.2 2023-06-22 07:08:22,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-22 07:08:57,260 INFO [train.py:996] (0/4) Epoch 7, batch 12150, loss[loss=0.2193, simple_loss=0.3076, pruned_loss=0.06548, over 21729.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3241, pruned_loss=0.08531, over 4267723.44 frames. ], batch size: 247, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:09:07,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.401e+02 4.092e+02 5.164e+02 8.690e+02, threshold=8.185e+02, percent-clipped=4.0 2023-06-22 07:10:20,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1170942.0, ans=0.125 2023-06-22 07:10:35,922 INFO [train.py:996] (0/4) Epoch 7, batch 12200, loss[loss=0.2433, simple_loss=0.2963, pruned_loss=0.09517, over 21222.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3195, pruned_loss=0.08468, over 4262813.35 frames. ], batch size: 160, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:10:58,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-22 07:12:00,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1171242.0, ans=0.2 2023-06-22 07:12:13,471 INFO [train.py:996] (0/4) Epoch 7, batch 12250, loss[loss=0.1959, simple_loss=0.283, pruned_loss=0.05443, over 21678.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3124, pruned_loss=0.08177, over 4271083.66 frames. ], batch size: 391, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:12:16,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1171302.0, ans=0.125 2023-06-22 07:12:24,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 3.345e+02 4.478e+02 6.132e+02 1.246e+03, threshold=8.957e+02, percent-clipped=10.0 2023-06-22 07:13:08,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1171422.0, ans=0.125 2023-06-22 07:13:43,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1171542.0, ans=0.0 2023-06-22 07:13:52,904 INFO [train.py:996] (0/4) Epoch 7, batch 12300, loss[loss=0.2937, simple_loss=0.3789, pruned_loss=0.1042, over 21695.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3048, pruned_loss=0.07613, over 4276158.99 frames. ], batch size: 414, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:14:37,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-22 07:14:50,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1171782.0, ans=0.125 2023-06-22 07:15:26,388 INFO [train.py:996] (0/4) Epoch 7, batch 12350, loss[loss=0.2513, simple_loss=0.3202, pruned_loss=0.09119, over 20822.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.312, pruned_loss=0.07816, over 4277735.97 frames. ], batch size: 608, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:15:34,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1171902.0, ans=0.0 2023-06-22 07:15:37,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.594e+02 3.277e+02 4.549e+02 8.356e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-22 07:15:37,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1171902.0, ans=0.125 2023-06-22 07:16:03,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=15.0 2023-06-22 07:16:22,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1172082.0, ans=0.125 2023-06-22 07:17:05,126 INFO [train.py:996] (0/4) Epoch 7, batch 12400, loss[loss=0.245, simple_loss=0.3053, pruned_loss=0.09233, over 21804.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3153, pruned_loss=0.08182, over 4282955.01 frames. ], batch size: 247, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:17:19,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1172262.0, ans=0.0 2023-06-22 07:17:48,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1172322.0, ans=0.125 2023-06-22 07:18:35,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-22 07:18:44,326 INFO [train.py:996] (0/4) Epoch 7, batch 12450, loss[loss=0.2922, simple_loss=0.354, pruned_loss=0.1152, over 21256.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3188, pruned_loss=0.08511, over 4284881.51 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:18:49,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-22 07:18:50,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1172502.0, ans=0.125 2023-06-22 07:19:01,079 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.229e+02 3.781e+02 4.439e+02 8.175e+02, threshold=7.562e+02, percent-clipped=5.0 2023-06-22 07:19:41,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1172622.0, ans=0.125 2023-06-22 07:20:25,297 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.17 vs. limit=12.0 2023-06-22 07:20:32,536 INFO [train.py:996] (0/4) Epoch 7, batch 12500, loss[loss=0.3034, simple_loss=0.3794, pruned_loss=0.1137, over 21231.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3293, pruned_loss=0.08833, over 4285954.68 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:21:44,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1172982.0, ans=0.0 2023-06-22 07:21:54,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1173042.0, ans=0.2 2023-06-22 07:22:10,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1173042.0, ans=0.0 2023-06-22 07:22:14,672 INFO [train.py:996] (0/4) Epoch 7, batch 12550, loss[loss=0.2434, simple_loss=0.3238, pruned_loss=0.0815, over 21609.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3328, pruned_loss=0.09028, over 4282009.97 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:22:31,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1173102.0, ans=0.0 2023-06-22 07:22:31,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1173102.0, ans=0.1 2023-06-22 07:22:32,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.611e+02 3.175e+02 3.622e+02 4.685e+02 7.876e+02, threshold=7.244e+02, percent-clipped=1.0 2023-06-22 07:24:00,213 INFO [train.py:996] (0/4) Epoch 7, batch 12600, loss[loss=0.1935, simple_loss=0.2831, pruned_loss=0.05201, over 21613.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3308, pruned_loss=0.08814, over 4267295.63 frames. ], batch size: 230, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:24:25,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1173462.0, ans=0.1 2023-06-22 07:24:43,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1173522.0, ans=0.0 2023-06-22 07:24:46,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1173522.0, ans=0.125 2023-06-22 07:25:38,779 INFO [train.py:996] (0/4) Epoch 7, batch 12650, loss[loss=0.2511, simple_loss=0.3131, pruned_loss=0.09452, over 21335.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3219, pruned_loss=0.08348, over 4269598.23 frames. ], batch size: 159, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:25:51,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.154e+02 3.639e+02 4.446e+02 1.064e+03, threshold=7.278e+02, percent-clipped=5.0 2023-06-22 07:26:00,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1173762.0, ans=0.2 2023-06-22 07:26:04,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1173762.0, ans=0.0 2023-06-22 07:26:08,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1173762.0, ans=0.1 2023-06-22 07:26:13,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-22 07:26:27,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-22 07:26:49,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1173882.0, ans=0.2 2023-06-22 07:26:53,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-06-22 07:27:04,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1173942.0, ans=0.125 2023-06-22 07:27:13,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1173942.0, ans=0.0 2023-06-22 07:27:19,819 INFO [train.py:996] (0/4) Epoch 7, batch 12700, loss[loss=0.2392, simple_loss=0.3077, pruned_loss=0.08534, over 21124.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3207, pruned_loss=0.08616, over 4271369.90 frames. ], batch size: 608, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:27:32,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-22 07:28:22,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1174182.0, ans=0.125 2023-06-22 07:28:49,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1174242.0, ans=0.125 2023-06-22 07:29:00,203 INFO [train.py:996] (0/4) Epoch 7, batch 12750, loss[loss=0.2242, simple_loss=0.3132, pruned_loss=0.0676, over 21684.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3216, pruned_loss=0.08633, over 4267972.58 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:29:17,970 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.566e+02 3.268e+02 3.639e+02 4.556e+02 7.416e+02, threshold=7.278e+02, percent-clipped=1.0 2023-06-22 07:29:21,322 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:29:27,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1174362.0, ans=0.125 2023-06-22 07:29:55,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-22 07:30:03,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1174482.0, ans=0.2 2023-06-22 07:30:30,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1174542.0, ans=0.0 2023-06-22 07:30:31,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1174542.0, ans=0.2 2023-06-22 07:30:44,412 INFO [train.py:996] (0/4) Epoch 7, batch 12800, loss[loss=0.2319, simple_loss=0.306, pruned_loss=0.07892, over 21230.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3209, pruned_loss=0.08642, over 4267043.14 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:30:58,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1174602.0, ans=0.025 2023-06-22 07:30:58,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-22 07:31:24,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1174722.0, ans=0.125 2023-06-22 07:32:25,157 INFO [train.py:996] (0/4) Epoch 7, batch 12850, loss[loss=0.2271, simple_loss=0.3234, pruned_loss=0.06539, over 21746.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3238, pruned_loss=0.08874, over 4269233.42 frames. ], batch size: 351, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:32:39,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.112e+02 3.554e+02 4.407e+02 7.373e+02, threshold=7.108e+02, percent-clipped=1.0 2023-06-22 07:34:06,265 INFO [train.py:996] (0/4) Epoch 7, batch 12900, loss[loss=0.21, simple_loss=0.2849, pruned_loss=0.06751, over 21066.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3209, pruned_loss=0.08505, over 4268652.10 frames. ], batch size: 608, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:34:28,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1175262.0, ans=0.125 2023-06-22 07:34:30,356 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:34:39,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1175262.0, ans=0.125 2023-06-22 07:35:53,275 INFO [train.py:996] (0/4) Epoch 7, batch 12950, loss[loss=0.2374, simple_loss=0.3068, pruned_loss=0.08398, over 21509.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.319, pruned_loss=0.08319, over 4265102.06 frames. ], batch size: 131, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:35:59,005 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.69 vs. limit=22.5 2023-06-22 07:36:11,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1175502.0, ans=0.125 2023-06-22 07:36:12,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 2.893e+02 3.599e+02 4.715e+02 8.391e+02, threshold=7.198e+02, percent-clipped=5.0 2023-06-22 07:36:14,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1175562.0, ans=0.0 2023-06-22 07:36:45,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1175622.0, ans=0.1 2023-06-22 07:37:15,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1175742.0, ans=0.125 2023-06-22 07:37:33,427 INFO [train.py:996] (0/4) Epoch 7, batch 13000, loss[loss=0.192, simple_loss=0.2719, pruned_loss=0.05609, over 21770.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3193, pruned_loss=0.08373, over 4274611.55 frames. ], batch size: 282, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:37:47,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1175802.0, ans=0.0 2023-06-22 07:37:57,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1175862.0, ans=0.125 2023-06-22 07:38:07,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1175862.0, ans=0.1 2023-06-22 07:38:37,756 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-196000.pt 2023-06-22 07:38:45,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1175982.0, ans=0.125 2023-06-22 07:38:53,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1176042.0, ans=0.0 2023-06-22 07:38:59,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1176042.0, ans=0.0 2023-06-22 07:39:07,141 INFO [train.py:996] (0/4) Epoch 7, batch 13050, loss[loss=0.2637, simple_loss=0.3257, pruned_loss=0.1008, over 21769.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3156, pruned_loss=0.08033, over 4266331.81 frames. ], batch size: 441, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:39:30,667 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.860e+02 3.531e+02 4.680e+02 1.133e+03, threshold=7.061e+02, percent-clipped=2.0 2023-06-22 07:40:00,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-22 07:40:15,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1176282.0, ans=0.125 2023-06-22 07:40:36,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1176342.0, ans=0.125 2023-06-22 07:40:56,527 INFO [train.py:996] (0/4) Epoch 7, batch 13100, loss[loss=0.279, simple_loss=0.3543, pruned_loss=0.1018, over 21785.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3188, pruned_loss=0.08115, over 4269478.37 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:40:57,892 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.92 vs. limit=5.0 2023-06-22 07:41:41,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1176522.0, ans=0.0 2023-06-22 07:42:42,767 INFO [train.py:996] (0/4) Epoch 7, batch 13150, loss[loss=0.2182, simple_loss=0.2831, pruned_loss=0.07668, over 21184.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.322, pruned_loss=0.08437, over 4272250.37 frames. ], batch size: 143, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:42:51,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1176702.0, ans=0.2 2023-06-22 07:42:53,402 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.48 vs. limit=10.0 2023-06-22 07:43:01,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1176702.0, ans=0.2 2023-06-22 07:43:01,882 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 3.619e+02 4.512e+02 5.792e+02 9.632e+02, threshold=9.025e+02, percent-clipped=11.0 2023-06-22 07:43:24,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1176822.0, ans=0.125 2023-06-22 07:43:24,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1176822.0, ans=0.1 2023-06-22 07:43:32,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1176822.0, ans=0.0 2023-06-22 07:43:49,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1176882.0, ans=0.125 2023-06-22 07:44:03,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1176942.0, ans=0.1 2023-06-22 07:44:14,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1176942.0, ans=0.125 2023-06-22 07:44:14,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1176942.0, ans=0.125 2023-06-22 07:44:18,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1176942.0, ans=0.015 2023-06-22 07:44:24,044 INFO [train.py:996] (0/4) Epoch 7, batch 13200, loss[loss=0.2418, simple_loss=0.3117, pruned_loss=0.08595, over 21848.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3189, pruned_loss=0.08399, over 4265950.34 frames. ], batch size: 282, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:44:45,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1177062.0, ans=0.125 2023-06-22 07:44:45,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1177062.0, ans=0.0 2023-06-22 07:44:55,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1177062.0, ans=0.125 2023-06-22 07:45:08,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1177122.0, ans=0.125 2023-06-22 07:45:13,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1177122.0, ans=0.0 2023-06-22 07:45:24,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1177182.0, ans=0.0 2023-06-22 07:46:09,412 INFO [train.py:996] (0/4) Epoch 7, batch 13250, loss[loss=0.2667, simple_loss=0.3509, pruned_loss=0.09128, over 20680.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3193, pruned_loss=0.08571, over 4269558.35 frames. ], batch size: 607, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:46:24,187 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.563e+02 3.281e+02 4.048e+02 5.234e+02 8.486e+02, threshold=8.096e+02, percent-clipped=0.0 2023-06-22 07:46:34,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-22 07:46:41,519 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-22 07:46:44,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-22 07:47:46,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1177542.0, ans=0.1 2023-06-22 07:47:50,917 INFO [train.py:996] (0/4) Epoch 7, batch 13300, loss[loss=0.2684, simple_loss=0.3471, pruned_loss=0.09482, over 21685.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3201, pruned_loss=0.08398, over 4269933.61 frames. ], batch size: 298, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:48:27,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1177662.0, ans=0.1 2023-06-22 07:48:36,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1177722.0, ans=0.2 2023-06-22 07:48:56,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1177782.0, ans=0.0 2023-06-22 07:49:28,840 INFO [train.py:996] (0/4) Epoch 7, batch 13350, loss[loss=0.2984, simple_loss=0.3628, pruned_loss=0.1171, over 21745.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3251, pruned_loss=0.08638, over 4267858.89 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:49:34,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1177902.0, ans=0.125 2023-06-22 07:49:43,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 3.139e+02 3.531e+02 4.158e+02 7.079e+02, threshold=7.062e+02, percent-clipped=0.0 2023-06-22 07:50:14,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1178022.0, ans=0.2 2023-06-22 07:50:43,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-22 07:50:54,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1178142.0, ans=0.0 2023-06-22 07:51:08,307 INFO [train.py:996] (0/4) Epoch 7, batch 13400, loss[loss=0.2732, simple_loss=0.3505, pruned_loss=0.09797, over 21320.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3271, pruned_loss=0.08843, over 4271027.13 frames. ], batch size: 548, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:52:02,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1178322.0, ans=0.1 2023-06-22 07:52:07,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1178322.0, ans=0.1 2023-06-22 07:52:15,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1178382.0, ans=0.0 2023-06-22 07:52:44,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1178442.0, ans=0.125 2023-06-22 07:52:48,441 INFO [train.py:996] (0/4) Epoch 7, batch 13450, loss[loss=0.2338, simple_loss=0.2935, pruned_loss=0.08703, over 21589.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3278, pruned_loss=0.09092, over 4274221.33 frames. ], batch size: 230, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:53:12,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.492e+02 3.365e+02 3.946e+02 4.575e+02 8.284e+02, threshold=7.892e+02, percent-clipped=1.0 2023-06-22 07:53:50,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-22 07:54:04,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.17 vs. limit=8.0 2023-06-22 07:54:28,502 INFO [train.py:996] (0/4) Epoch 7, batch 13500, loss[loss=0.267, simple_loss=0.3379, pruned_loss=0.09799, over 21922.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3186, pruned_loss=0.08834, over 4268889.23 frames. ], batch size: 317, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:55:02,275 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:55:34,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=22.5 2023-06-22 07:55:42,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1178982.0, ans=0.0 2023-06-22 07:55:53,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1179042.0, ans=0.125 2023-06-22 07:55:53,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1179042.0, ans=0.0 2023-06-22 07:56:15,649 INFO [train.py:996] (0/4) Epoch 7, batch 13550, loss[loss=0.2582, simple_loss=0.3543, pruned_loss=0.08102, over 20703.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.322, pruned_loss=0.08727, over 4268101.58 frames. ], batch size: 607, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:56:34,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1179162.0, ans=0.125 2023-06-22 07:56:36,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.476e+02 3.442e+02 4.149e+02 5.236e+02 8.278e+02, threshold=8.298e+02, percent-clipped=4.0 2023-06-22 07:56:54,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1179162.0, ans=10.0 2023-06-22 07:57:17,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1179282.0, ans=0.125 2023-06-22 07:57:32,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1179342.0, ans=0.0 2023-06-22 07:57:54,940 INFO [train.py:996] (0/4) Epoch 7, batch 13600, loss[loss=0.2323, simple_loss=0.3014, pruned_loss=0.08159, over 21308.00 frames. ], tot_loss[loss=0.25, simple_loss=0.324, pruned_loss=0.08797, over 4275156.27 frames. ], batch size: 159, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:58:29,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1179462.0, ans=0.125 2023-06-22 07:58:32,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-22 07:58:40,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1179522.0, ans=0.125 2023-06-22 07:59:24,189 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-22 07:59:34,122 INFO [train.py:996] (0/4) Epoch 7, batch 13650, loss[loss=0.2085, simple_loss=0.2779, pruned_loss=0.06954, over 21637.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3202, pruned_loss=0.08496, over 4272323.07 frames. ], batch size: 332, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:59:42,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1179702.0, ans=0.1 2023-06-22 07:59:44,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1179702.0, ans=0.0 2023-06-22 07:59:54,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.926e+02 3.440e+02 4.459e+02 9.365e+02, threshold=6.879e+02, percent-clipped=1.0 2023-06-22 08:00:19,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1179822.0, ans=0.125 2023-06-22 08:00:27,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1179822.0, ans=0.1 2023-06-22 08:00:43,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=22.5 2023-06-22 08:00:54,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1179942.0, ans=0.0 2023-06-22 08:00:56,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1179942.0, ans=0.125 2023-06-22 08:01:13,451 INFO [train.py:996] (0/4) Epoch 7, batch 13700, loss[loss=0.2302, simple_loss=0.3121, pruned_loss=0.07413, over 21743.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3152, pruned_loss=0.08488, over 4274228.13 frames. ], batch size: 351, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:02:22,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1180182.0, ans=0.125 2023-06-22 08:02:59,603 INFO [train.py:996] (0/4) Epoch 7, batch 13750, loss[loss=0.2994, simple_loss=0.3713, pruned_loss=0.1138, over 21399.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3136, pruned_loss=0.0844, over 4265531.08 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:03:01,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1180302.0, ans=0.05 2023-06-22 08:03:15,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1180302.0, ans=0.125 2023-06-22 08:03:23,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.446e+02 3.304e+02 4.106e+02 4.985e+02 1.123e+03, threshold=8.212e+02, percent-clipped=9.0 2023-06-22 08:03:23,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1180362.0, ans=0.1 2023-06-22 08:04:36,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1180542.0, ans=0.125 2023-06-22 08:04:46,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1180542.0, ans=0.125 2023-06-22 08:04:48,543 INFO [train.py:996] (0/4) Epoch 7, batch 13800, loss[loss=0.346, simple_loss=0.4378, pruned_loss=0.1271, over 21523.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3215, pruned_loss=0.08435, over 4270325.98 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:05:46,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1180782.0, ans=0.09899494936611666 2023-06-22 08:06:22,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1180842.0, ans=0.2 2023-06-22 08:06:22,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2023-06-22 08:06:27,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=22.5 2023-06-22 08:06:29,568 INFO [train.py:996] (0/4) Epoch 7, batch 13850, loss[loss=0.1964, simple_loss=0.2631, pruned_loss=0.06483, over 20788.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3257, pruned_loss=0.08498, over 4262677.55 frames. ], batch size: 608, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:06:33,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1180902.0, ans=0.2 2023-06-22 08:06:51,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 3.586e+02 4.613e+02 6.020e+02 1.189e+03, threshold=9.227e+02, percent-clipped=5.0 2023-06-22 08:07:19,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-22 08:07:53,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1181142.0, ans=0.2 2023-06-22 08:08:08,740 INFO [train.py:996] (0/4) Epoch 7, batch 13900, loss[loss=0.3287, simple_loss=0.3638, pruned_loss=0.1469, over 21723.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3299, pruned_loss=0.08901, over 4264908.68 frames. ], batch size: 508, lr: 4.32e-03, grad_scale: 8.0 2023-06-22 08:09:49,812 INFO [train.py:996] (0/4) Epoch 7, batch 13950, loss[loss=0.2789, simple_loss=0.3498, pruned_loss=0.104, over 21797.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3309, pruned_loss=0.09082, over 4273621.91 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 8.0 2023-06-22 08:10:18,733 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 3.399e+02 3.923e+02 4.848e+02 6.986e+02, threshold=7.845e+02, percent-clipped=0.0 2023-06-22 08:10:44,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1181622.0, ans=0.0 2023-06-22 08:11:28,477 INFO [train.py:996] (0/4) Epoch 7, batch 14000, loss[loss=0.2648, simple_loss=0.3521, pruned_loss=0.0887, over 21576.00 frames. ], tot_loss[loss=0.252, simple_loss=0.328, pruned_loss=0.088, over 4274008.19 frames. ], batch size: 471, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:11:32,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1181802.0, ans=0.0 2023-06-22 08:12:24,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=22.5 2023-06-22 08:12:40,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1181982.0, ans=0.125 2023-06-22 08:13:01,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1182042.0, ans=0.0 2023-06-22 08:13:10,925 INFO [train.py:996] (0/4) Epoch 7, batch 14050, loss[loss=0.1892, simple_loss=0.2604, pruned_loss=0.05902, over 15360.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3221, pruned_loss=0.08398, over 4267086.99 frames. ], batch size: 60, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:13:32,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1182162.0, ans=0.125 2023-06-22 08:13:34,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.004e+02 3.495e+02 4.384e+02 1.047e+03, threshold=6.990e+02, percent-clipped=3.0 2023-06-22 08:14:49,757 INFO [train.py:996] (0/4) Epoch 7, batch 14100, loss[loss=0.2493, simple_loss=0.3662, pruned_loss=0.06619, over 19773.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3154, pruned_loss=0.08304, over 4265591.37 frames. ], batch size: 702, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:14:58,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1182402.0, ans=0.125 2023-06-22 08:15:22,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-22 08:15:23,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1182462.0, ans=0.0 2023-06-22 08:15:40,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-22 08:16:21,816 INFO [train.py:996] (0/4) Epoch 7, batch 14150, loss[loss=0.2525, simple_loss=0.3448, pruned_loss=0.08012, over 21609.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3176, pruned_loss=0.08382, over 4272812.07 frames. ], batch size: 389, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:16:31,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1182702.0, ans=0.05 2023-06-22 08:16:32,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1182702.0, ans=0.125 2023-06-22 08:16:44,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.878e+02 3.254e+02 3.924e+02 9.436e+02, threshold=6.508e+02, percent-clipped=4.0 2023-06-22 08:16:52,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1182762.0, ans=0.125 2023-06-22 08:17:36,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1182942.0, ans=0.035 2023-06-22 08:17:51,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.49 vs. limit=15.0 2023-06-22 08:17:53,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1182942.0, ans=10.0 2023-06-22 08:17:53,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1182942.0, ans=0.125 2023-06-22 08:17:57,565 INFO [train.py:996] (0/4) Epoch 7, batch 14200, loss[loss=0.2269, simple_loss=0.2973, pruned_loss=0.07829, over 21803.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3153, pruned_loss=0.08174, over 4267371.88 frames. ], batch size: 371, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:18:03,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-22 08:19:36,415 INFO [train.py:996] (0/4) Epoch 7, batch 14250, loss[loss=0.2565, simple_loss=0.3131, pruned_loss=0.0999, over 21188.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.309, pruned_loss=0.08129, over 4262819.72 frames. ], batch size: 143, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:19:39,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=22.5 2023-06-22 08:19:55,871 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.870e+02 3.314e+02 3.996e+02 6.865e+02, threshold=6.627e+02, percent-clipped=2.0 2023-06-22 08:20:19,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=22.5 2023-06-22 08:20:39,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1183482.0, ans=0.07 2023-06-22 08:21:16,225 INFO [train.py:996] (0/4) Epoch 7, batch 14300, loss[loss=0.2167, simple_loss=0.2858, pruned_loss=0.07374, over 21834.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3132, pruned_loss=0.08206, over 4256956.16 frames. ], batch size: 118, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:21:21,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1183602.0, ans=0.1 2023-06-22 08:21:24,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1183602.0, ans=0.125 2023-06-22 08:22:11,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1183722.0, ans=0.125 2023-06-22 08:22:39,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1183842.0, ans=0.125 2023-06-22 08:22:47,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1183842.0, ans=0.1 2023-06-22 08:22:56,613 INFO [train.py:996] (0/4) Epoch 7, batch 14350, loss[loss=0.2139, simple_loss=0.2757, pruned_loss=0.07603, over 21357.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3184, pruned_loss=0.08326, over 4249074.17 frames. ], batch size: 131, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:23:10,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1183962.0, ans=0.1 2023-06-22 08:23:15,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 3.408e+02 4.555e+02 6.047e+02 1.523e+03, threshold=9.110e+02, percent-clipped=21.0 2023-06-22 08:23:17,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-22 08:24:24,732 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:24:34,930 INFO [train.py:996] (0/4) Epoch 7, batch 14400, loss[loss=0.1992, simple_loss=0.2685, pruned_loss=0.06499, over 21471.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3161, pruned_loss=0.08407, over 4260237.42 frames. ], batch size: 212, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:24:37,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1184202.0, ans=0.125 2023-06-22 08:24:47,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1184202.0, ans=0.2 2023-06-22 08:24:58,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1184262.0, ans=0.125 2023-06-22 08:25:00,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=12.0 2023-06-22 08:25:31,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1184322.0, ans=0.125 2023-06-22 08:25:40,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1184382.0, ans=0.0 2023-06-22 08:25:51,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1184442.0, ans=0.125 2023-06-22 08:26:11,503 INFO [train.py:996] (0/4) Epoch 7, batch 14450, loss[loss=0.223, simple_loss=0.2866, pruned_loss=0.07972, over 21335.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3106, pruned_loss=0.08425, over 4264069.80 frames. ], batch size: 144, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:26:29,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1184562.0, ans=0.125 2023-06-22 08:26:30,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 2.987e+02 3.327e+02 4.057e+02 7.605e+02, threshold=6.653e+02, percent-clipped=0.0 2023-06-22 08:26:44,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1184562.0, ans=0.125 2023-06-22 08:27:52,130 INFO [train.py:996] (0/4) Epoch 7, batch 14500, loss[loss=0.2078, simple_loss=0.2939, pruned_loss=0.06083, over 21607.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3068, pruned_loss=0.0837, over 4263554.03 frames. ], batch size: 263, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:29:13,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1185042.0, ans=0.0 2023-06-22 08:29:28,330 INFO [train.py:996] (0/4) Epoch 7, batch 14550, loss[loss=0.2601, simple_loss=0.3337, pruned_loss=0.09327, over 21673.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3124, pruned_loss=0.08562, over 4265016.45 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:29:30,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1185102.0, ans=0.125 2023-06-22 08:29:33,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1185102.0, ans=0.5 2023-06-22 08:29:57,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 3.217e+02 4.103e+02 5.336e+02 9.308e+02, threshold=8.206e+02, percent-clipped=6.0 2023-06-22 08:30:54,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1185342.0, ans=0.125 2023-06-22 08:31:09,744 INFO [train.py:996] (0/4) Epoch 7, batch 14600, loss[loss=0.1816, simple_loss=0.2354, pruned_loss=0.06395, over 20844.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3204, pruned_loss=0.08959, over 4265298.81 frames. ], batch size: 608, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:32:06,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1185522.0, ans=0.2 2023-06-22 08:32:10,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1185582.0, ans=0.0 2023-06-22 08:32:21,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1185582.0, ans=0.2 2023-06-22 08:32:26,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1185582.0, ans=0.2 2023-06-22 08:32:29,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1185582.0, ans=0.125 2023-06-22 08:32:48,050 INFO [train.py:996] (0/4) Epoch 7, batch 14650, loss[loss=0.2641, simple_loss=0.3564, pruned_loss=0.08595, over 21237.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3241, pruned_loss=0.08969, over 4268572.05 frames. ], batch size: 548, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:32:48,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1185702.0, ans=0.125 2023-06-22 08:33:22,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.921e+02 3.378e+02 4.532e+02 7.463e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-22 08:33:36,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1185822.0, ans=0.125 2023-06-22 08:34:28,144 INFO [train.py:996] (0/4) Epoch 7, batch 14700, loss[loss=0.323, simple_loss=0.4072, pruned_loss=0.1193, over 21541.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3175, pruned_loss=0.08339, over 4272931.08 frames. ], batch size: 508, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:34:48,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1186002.0, ans=0.1 2023-06-22 08:34:52,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1186002.0, ans=0.2 2023-06-22 08:35:15,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1186122.0, ans=0.0 2023-06-22 08:35:19,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1186122.0, ans=0.0 2023-06-22 08:35:54,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1186242.0, ans=0.04949747468305833 2023-06-22 08:36:19,348 INFO [train.py:996] (0/4) Epoch 7, batch 14750, loss[loss=0.3033, simple_loss=0.3695, pruned_loss=0.1186, over 21749.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3224, pruned_loss=0.08592, over 4267424.23 frames. ], batch size: 441, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:36:36,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1186302.0, ans=0.125 2023-06-22 08:36:45,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 3.126e+02 3.786e+02 4.508e+02 7.747e+02, threshold=7.572e+02, percent-clipped=1.0 2023-06-22 08:36:57,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1186362.0, ans=0.0 2023-06-22 08:36:58,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-22 08:37:07,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1186422.0, ans=0.125 2023-06-22 08:37:14,821 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.657e-03 2023-06-22 08:38:03,848 INFO [train.py:996] (0/4) Epoch 7, batch 14800, loss[loss=0.2896, simple_loss=0.3475, pruned_loss=0.1158, over 21585.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.334, pruned_loss=0.09081, over 4270418.60 frames. ], batch size: 414, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:39:00,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1186782.0, ans=0.1 2023-06-22 08:39:02,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1186782.0, ans=0.05 2023-06-22 08:39:15,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-22 08:39:45,603 INFO [train.py:996] (0/4) Epoch 7, batch 14850, loss[loss=0.2257, simple_loss=0.2831, pruned_loss=0.08418, over 21068.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3275, pruned_loss=0.09064, over 4273975.25 frames. ], batch size: 143, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:40:02,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1186902.0, ans=0.125 2023-06-22 08:40:02,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1186902.0, ans=0.0 2023-06-22 08:40:07,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1186962.0, ans=0.2 2023-06-22 08:40:12,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 3.436e+02 3.807e+02 4.957e+02 1.167e+03, threshold=7.615e+02, percent-clipped=4.0 2023-06-22 08:40:21,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1186962.0, ans=0.0 2023-06-22 08:40:30,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1187022.0, ans=0.1 2023-06-22 08:40:32,211 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:41:32,014 INFO [train.py:996] (0/4) Epoch 7, batch 14900, loss[loss=0.1946, simple_loss=0.2554, pruned_loss=0.06693, over 20726.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3297, pruned_loss=0.09165, over 4275741.62 frames. ], batch size: 607, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:41:34,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1187202.0, ans=0.125 2023-06-22 08:41:58,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1187262.0, ans=0.125 2023-06-22 08:42:16,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1187322.0, ans=0.1 2023-06-22 08:42:50,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1187442.0, ans=0.125 2023-06-22 08:42:53,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1187442.0, ans=0.2 2023-06-22 08:43:10,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1187442.0, ans=0.0 2023-06-22 08:43:12,816 INFO [train.py:996] (0/4) Epoch 7, batch 14950, loss[loss=0.2639, simple_loss=0.3466, pruned_loss=0.09055, over 21403.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.33, pruned_loss=0.09026, over 4274343.22 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:43:13,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1187502.0, ans=0.125 2023-06-22 08:43:24,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1187502.0, ans=0.95 2023-06-22 08:43:39,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.264e+02 3.667e+02 4.078e+02 7.613e+02, threshold=7.333e+02, percent-clipped=0.0 2023-06-22 08:44:16,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-22 08:44:36,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.77 vs. limit=15.0 2023-06-22 08:44:37,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1187742.0, ans=0.0 2023-06-22 08:44:52,929 INFO [train.py:996] (0/4) Epoch 7, batch 15000, loss[loss=0.247, simple_loss=0.3171, pruned_loss=0.08839, over 21812.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3317, pruned_loss=0.09134, over 4280033.04 frames. ], batch size: 351, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:44:52,930 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 08:45:09,858 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2588, simple_loss=0.3554, pruned_loss=0.08105, over 1796401.00 frames. 2023-06-22 08:45:09,858 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 08:45:51,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1187862.0, ans=0.125 2023-06-22 08:45:52,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1187862.0, ans=0.1 2023-06-22 08:45:57,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1187922.0, ans=0.2 2023-06-22 08:46:32,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1187982.0, ans=0.2 2023-06-22 08:46:56,314 INFO [train.py:996] (0/4) Epoch 7, batch 15050, loss[loss=0.2453, simple_loss=0.3291, pruned_loss=0.08076, over 21664.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.331, pruned_loss=0.09193, over 4276022.62 frames. ], batch size: 263, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:47:27,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 3.352e+02 4.069e+02 4.839e+02 9.529e+02, threshold=8.138e+02, percent-clipped=2.0 2023-06-22 08:47:36,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1188222.0, ans=0.2 2023-06-22 08:47:57,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1188282.0, ans=0.125 2023-06-22 08:48:02,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1188282.0, ans=0.1 2023-06-22 08:48:19,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1188342.0, ans=0.125 2023-06-22 08:48:32,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1188342.0, ans=0.2 2023-06-22 08:48:39,533 INFO [train.py:996] (0/4) Epoch 7, batch 15100, loss[loss=0.2984, simple_loss=0.3754, pruned_loss=0.1107, over 21565.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3336, pruned_loss=0.09165, over 4278468.41 frames. ], batch size: 414, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:49:12,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1188462.0, ans=0.0 2023-06-22 08:49:38,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1188582.0, ans=0.0 2023-06-22 08:49:46,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1188582.0, ans=0.2 2023-06-22 08:50:07,648 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-22 08:50:18,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1188702.0, ans=0.125 2023-06-22 08:50:19,343 INFO [train.py:996] (0/4) Epoch 7, batch 15150, loss[loss=0.2066, simple_loss=0.2707, pruned_loss=0.0713, over 21559.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3307, pruned_loss=0.09204, over 4273908.00 frames. ], batch size: 263, lr: 4.31e-03, grad_scale: 8.0 2023-06-22 08:50:30,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1188702.0, ans=0.0 2023-06-22 08:50:49,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.254e+02 3.801e+02 4.686e+02 8.027e+02, threshold=7.602e+02, percent-clipped=0.0 2023-06-22 08:50:51,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1188762.0, ans=0.0 2023-06-22 08:51:09,880 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-22 08:51:13,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1188822.0, ans=0.125 2023-06-22 08:51:42,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1188942.0, ans=0.125 2023-06-22 08:52:04,656 INFO [train.py:996] (0/4) Epoch 7, batch 15200, loss[loss=0.2152, simple_loss=0.3033, pruned_loss=0.0636, over 21229.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3215, pruned_loss=0.08814, over 4273423.18 frames. ], batch size: 549, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:52:19,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1189062.0, ans=0.125 2023-06-22 08:53:18,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1189242.0, ans=0.125 2023-06-22 08:53:44,179 INFO [train.py:996] (0/4) Epoch 7, batch 15250, loss[loss=0.287, simple_loss=0.4101, pruned_loss=0.08194, over 19718.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.317, pruned_loss=0.08656, over 4254935.92 frames. ], batch size: 702, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:53:52,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1189302.0, ans=0.1 2023-06-22 08:53:53,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-22 08:54:02,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1189362.0, ans=0.125 2023-06-22 08:54:13,568 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.041e+02 3.715e+02 4.659e+02 9.808e+02, threshold=7.430e+02, percent-clipped=2.0 2023-06-22 08:54:33,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1189422.0, ans=0.2 2023-06-22 08:54:44,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1189482.0, ans=10.0 2023-06-22 08:55:13,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.05 vs. limit=6.0 2023-06-22 08:55:25,363 INFO [train.py:996] (0/4) Epoch 7, batch 15300, loss[loss=0.1832, simple_loss=0.2394, pruned_loss=0.06353, over 20724.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3183, pruned_loss=0.08932, over 4257696.39 frames. ], batch size: 609, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:55:29,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1189602.0, ans=0.125 2023-06-22 08:55:30,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1189602.0, ans=0.0 2023-06-22 08:55:30,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1189602.0, ans=0.0 2023-06-22 08:55:30,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1189602.0, ans=0.2 2023-06-22 08:56:13,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.07 vs. limit=15.0 2023-06-22 08:56:53,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1189842.0, ans=0.0 2023-06-22 08:57:04,788 INFO [train.py:996] (0/4) Epoch 7, batch 15350, loss[loss=0.2023, simple_loss=0.3235, pruned_loss=0.04057, over 19863.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3224, pruned_loss=0.09142, over 4267027.25 frames. ], batch size: 703, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:57:28,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1189962.0, ans=0.125 2023-06-22 08:57:33,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.368e+02 3.940e+02 5.271e+02 1.051e+03, threshold=7.879e+02, percent-clipped=5.0 2023-06-22 08:57:51,881 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-22 08:58:43,335 INFO [train.py:996] (0/4) Epoch 7, batch 15400, loss[loss=0.2615, simple_loss=0.3259, pruned_loss=0.0986, over 21812.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3219, pruned_loss=0.08993, over 4267030.46 frames. ], batch size: 441, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:58:54,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1190202.0, ans=0.0 2023-06-22 08:59:11,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1190262.0, ans=0.2 2023-06-22 08:59:34,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1190382.0, ans=0.0 2023-06-22 08:59:57,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1190442.0, ans=0.1 2023-06-22 09:00:22,619 INFO [train.py:996] (0/4) Epoch 7, batch 15450, loss[loss=0.2228, simple_loss=0.2896, pruned_loss=0.078, over 21152.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3197, pruned_loss=0.0891, over 4267518.33 frames. ], batch size: 608, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:00:34,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1190502.0, ans=0.125 2023-06-22 09:00:51,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 2.924e+02 3.383e+02 4.121e+02 7.553e+02, threshold=6.767e+02, percent-clipped=0.0 2023-06-22 09:00:51,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1190562.0, ans=0.0 2023-06-22 09:01:06,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1190622.0, ans=0.0 2023-06-22 09:01:49,017 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:02:02,988 INFO [train.py:996] (0/4) Epoch 7, batch 15500, loss[loss=0.2544, simple_loss=0.3282, pruned_loss=0.09033, over 21693.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3232, pruned_loss=0.08938, over 4268765.58 frames. ], batch size: 351, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:02:29,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1190862.0, ans=10.0 2023-06-22 09:02:36,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1190862.0, ans=0.04949747468305833 2023-06-22 09:03:48,165 INFO [train.py:996] (0/4) Epoch 7, batch 15550, loss[loss=0.2045, simple_loss=0.2672, pruned_loss=0.0709, over 21906.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3228, pruned_loss=0.08679, over 4272496.87 frames. ], batch size: 98, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:04:06,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1191162.0, ans=0.1 2023-06-22 09:04:12,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 3.104e+02 3.542e+02 4.427e+02 7.965e+02, threshold=7.084e+02, percent-clipped=2.0 2023-06-22 09:04:39,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1191282.0, ans=0.125 2023-06-22 09:05:00,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1191282.0, ans=0.125 2023-06-22 09:05:21,979 INFO [train.py:996] (0/4) Epoch 7, batch 15600, loss[loss=0.2396, simple_loss=0.3103, pruned_loss=0.08447, over 21769.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3163, pruned_loss=0.08543, over 4267015.29 frames. ], batch size: 351, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 09:05:35,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1191402.0, ans=0.2 2023-06-22 09:06:07,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-22 09:07:07,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1191702.0, ans=0.0 2023-06-22 09:07:08,496 INFO [train.py:996] (0/4) Epoch 7, batch 15650, loss[loss=0.2251, simple_loss=0.2823, pruned_loss=0.08393, over 21348.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3147, pruned_loss=0.08449, over 4266971.11 frames. ], batch size: 160, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:07:38,623 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.201e+02 3.774e+02 4.746e+02 8.455e+02, threshold=7.547e+02, percent-clipped=5.0 2023-06-22 09:08:46,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-22 09:08:47,635 INFO [train.py:996] (0/4) Epoch 7, batch 15700, loss[loss=0.2063, simple_loss=0.2723, pruned_loss=0.07018, over 21508.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3113, pruned_loss=0.08348, over 4266217.39 frames. ], batch size: 230, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:08:52,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-06-22 09:09:18,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1192062.0, ans=0.2 2023-06-22 09:10:27,347 INFO [train.py:996] (0/4) Epoch 7, batch 15750, loss[loss=0.241, simple_loss=0.3021, pruned_loss=0.08998, over 21273.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3068, pruned_loss=0.08338, over 4261758.50 frames. ], batch size: 471, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:10:45,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1192362.0, ans=0.125 2023-06-22 09:10:56,921 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.176e+02 3.735e+02 4.754e+02 7.774e+02, threshold=7.471e+02, percent-clipped=1.0 2023-06-22 09:11:19,996 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-22 09:11:30,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1192482.0, ans=0.125 2023-06-22 09:11:32,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192482.0, ans=0.1 2023-06-22 09:11:49,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1192542.0, ans=0.05 2023-06-22 09:12:07,003 INFO [train.py:996] (0/4) Epoch 7, batch 15800, loss[loss=0.2085, simple_loss=0.2795, pruned_loss=0.06871, over 21603.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3028, pruned_loss=0.08305, over 4257044.79 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:12:32,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1192662.0, ans=0.2 2023-06-22 09:12:57,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1192722.0, ans=0.125 2023-06-22 09:13:04,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1192722.0, ans=0.0 2023-06-22 09:13:13,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1192782.0, ans=0.0 2023-06-22 09:13:38,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1192842.0, ans=0.035 2023-06-22 09:13:38,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1192842.0, ans=0.2 2023-06-22 09:13:41,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1192842.0, ans=0.125 2023-06-22 09:13:43,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1192842.0, ans=0.95 2023-06-22 09:13:45,639 INFO [train.py:996] (0/4) Epoch 7, batch 15850, loss[loss=0.2336, simple_loss=0.2997, pruned_loss=0.08378, over 21197.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3052, pruned_loss=0.08553, over 4256031.47 frames. ], batch size: 143, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:13:55,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1192902.0, ans=0.125 2023-06-22 09:14:06,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192962.0, ans=0.1 2023-06-22 09:14:06,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1192962.0, ans=0.125 2023-06-22 09:14:15,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.067e+02 3.802e+02 4.626e+02 8.154e+02, threshold=7.604e+02, percent-clipped=3.0 2023-06-22 09:15:03,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1193142.0, ans=0.2 2023-06-22 09:15:23,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1193142.0, ans=0.1 2023-06-22 09:15:26,021 INFO [train.py:996] (0/4) Epoch 7, batch 15900, loss[loss=0.2046, simple_loss=0.2713, pruned_loss=0.06894, over 21436.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3034, pruned_loss=0.08579, over 4265166.12 frames. ], batch size: 389, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:15:36,257 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:15:59,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1193262.0, ans=0.125 2023-06-22 09:16:27,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1193382.0, ans=0.0 2023-06-22 09:16:38,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1193382.0, ans=0.125 2023-06-22 09:16:49,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1193442.0, ans=0.125 2023-06-22 09:16:55,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1193442.0, ans=0.0 2023-06-22 09:17:05,287 INFO [train.py:996] (0/4) Epoch 7, batch 15950, loss[loss=0.1939, simple_loss=0.2935, pruned_loss=0.04716, over 21710.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3031, pruned_loss=0.08255, over 4266030.02 frames. ], batch size: 247, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:17:19,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-22 09:17:31,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 3.037e+02 3.517e+02 4.251e+02 9.007e+02, threshold=7.034e+02, percent-clipped=1.0 2023-06-22 09:18:31,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1193742.0, ans=0.0 2023-06-22 09:18:46,882 INFO [train.py:996] (0/4) Epoch 7, batch 16000, loss[loss=0.1954, simple_loss=0.2965, pruned_loss=0.04718, over 20934.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3054, pruned_loss=0.08032, over 4257531.96 frames. ], batch size: 607, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:18:52,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-22 09:19:03,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1193862.0, ans=0.125 2023-06-22 09:19:12,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-22 09:20:10,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1194042.0, ans=0.125 2023-06-22 09:20:16,547 INFO [train.py:996] (0/4) Epoch 7, batch 16050, loss[loss=0.2035, simple_loss=0.3124, pruned_loss=0.04729, over 20844.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3091, pruned_loss=0.07895, over 4259278.00 frames. ], batch size: 608, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:20:26,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1194102.0, ans=0.0 2023-06-22 09:20:34,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1194102.0, ans=0.125 2023-06-22 09:20:34,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-22 09:20:35,737 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:20:47,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.171e+02 3.896e+02 5.247e+02 9.817e+02, threshold=7.791e+02, percent-clipped=4.0 2023-06-22 09:20:49,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-22 09:20:59,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1194222.0, ans=0.0 2023-06-22 09:21:06,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1194222.0, ans=0.125 2023-06-22 09:21:29,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1194282.0, ans=0.1 2023-06-22 09:21:33,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-22 09:21:55,846 INFO [train.py:996] (0/4) Epoch 7, batch 16100, loss[loss=0.2498, simple_loss=0.3423, pruned_loss=0.07865, over 21685.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3166, pruned_loss=0.08091, over 4269079.22 frames. ], batch size: 230, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:22:20,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1194462.0, ans=0.015 2023-06-22 09:22:25,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1194462.0, ans=0.0 2023-06-22 09:22:30,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1194462.0, ans=0.125 2023-06-22 09:23:29,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1194642.0, ans=0.0 2023-06-22 09:23:33,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1194642.0, ans=22.5 2023-06-22 09:23:35,237 INFO [train.py:996] (0/4) Epoch 7, batch 16150, loss[loss=0.248, simple_loss=0.3023, pruned_loss=0.09682, over 21622.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3153, pruned_loss=0.08316, over 4280812.04 frames. ], batch size: 548, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:24:08,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.102e+02 3.921e+02 4.852e+02 9.563e+02, threshold=7.842e+02, percent-clipped=2.0 2023-06-22 09:24:08,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1194762.0, ans=0.125 2023-06-22 09:25:18,435 INFO [train.py:996] (0/4) Epoch 7, batch 16200, loss[loss=0.3144, simple_loss=0.3749, pruned_loss=0.1269, over 21847.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3183, pruned_loss=0.0846, over 4286755.90 frames. ], batch size: 124, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:25:20,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1195002.0, ans=0.0 2023-06-22 09:26:23,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1195182.0, ans=0.125 2023-06-22 09:26:59,855 INFO [train.py:996] (0/4) Epoch 7, batch 16250, loss[loss=0.258, simple_loss=0.3379, pruned_loss=0.08905, over 20734.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3162, pruned_loss=0.085, over 4272087.84 frames. ], batch size: 607, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:27:31,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.018e+02 3.500e+02 4.433e+02 8.732e+02, threshold=7.000e+02, percent-clipped=2.0 2023-06-22 09:27:35,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1195422.0, ans=0.125 2023-06-22 09:28:12,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-22 09:28:30,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-22 09:28:33,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.20 vs. limit=15.0 2023-06-22 09:28:40,766 INFO [train.py:996] (0/4) Epoch 7, batch 16300, loss[loss=0.201, simple_loss=0.2709, pruned_loss=0.06554, over 21740.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3114, pruned_loss=0.08113, over 4271742.61 frames. ], batch size: 118, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:29:06,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=15.0 2023-06-22 09:30:24,334 INFO [train.py:996] (0/4) Epoch 7, batch 16350, loss[loss=0.262, simple_loss=0.3315, pruned_loss=0.09621, over 21306.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3104, pruned_loss=0.08121, over 4262365.45 frames. ], batch size: 549, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:30:26,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1195902.0, ans=0.0 2023-06-22 09:30:28,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1195902.0, ans=0.125 2023-06-22 09:31:06,952 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.307e+02 4.109e+02 5.521e+02 1.139e+03, threshold=8.218e+02, percent-clipped=11.0 2023-06-22 09:31:07,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1195962.0, ans=0.0 2023-06-22 09:31:39,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1196082.0, ans=0.0 2023-06-22 09:32:07,025 INFO [train.py:996] (0/4) Epoch 7, batch 16400, loss[loss=0.2543, simple_loss=0.3246, pruned_loss=0.09197, over 21714.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.315, pruned_loss=0.08364, over 4262784.60 frames. ], batch size: 389, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:32:30,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1196262.0, ans=0.025 2023-06-22 09:33:08,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1196382.0, ans=0.0 2023-06-22 09:33:17,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.75 vs. limit=22.5 2023-06-22 09:33:19,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1196382.0, ans=0.2 2023-06-22 09:33:22,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1196382.0, ans=15.0 2023-06-22 09:33:35,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1196442.0, ans=0.125 2023-06-22 09:33:37,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-22 09:33:47,431 INFO [train.py:996] (0/4) Epoch 7, batch 16450, loss[loss=0.2373, simple_loss=0.3066, pruned_loss=0.08394, over 21899.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3156, pruned_loss=0.08431, over 4264489.02 frames. ], batch size: 351, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:34:05,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1196502.0, ans=0.2 2023-06-22 09:34:29,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.059e+02 3.522e+02 4.400e+02 7.364e+02, threshold=7.044e+02, percent-clipped=0.0 2023-06-22 09:34:38,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1196622.0, ans=0.1 2023-06-22 09:35:28,333 INFO [train.py:996] (0/4) Epoch 7, batch 16500, loss[loss=0.2171, simple_loss=0.2907, pruned_loss=0.07175, over 21773.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3149, pruned_loss=0.0848, over 4275333.04 frames. ], batch size: 298, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:35:44,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1196802.0, ans=0.1 2023-06-22 09:36:12,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1196862.0, ans=0.125 2023-06-22 09:36:57,412 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:37:16,089 INFO [train.py:996] (0/4) Epoch 7, batch 16550, loss[loss=0.2239, simple_loss=0.289, pruned_loss=0.07943, over 21696.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.313, pruned_loss=0.08265, over 4279092.46 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:37:17,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-22 09:37:24,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1197102.0, ans=0.125 2023-06-22 09:37:32,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1197102.0, ans=0.125 2023-06-22 09:37:45,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1197162.0, ans=0.0 2023-06-22 09:37:53,834 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.830e+02 4.900e+02 6.619e+02 1.240e+03, threshold=9.800e+02, percent-clipped=18.0 2023-06-22 09:38:07,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1197222.0, ans=0.1 2023-06-22 09:39:08,774 INFO [train.py:996] (0/4) Epoch 7, batch 16600, loss[loss=0.2704, simple_loss=0.354, pruned_loss=0.09338, over 21256.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3192, pruned_loss=0.08369, over 4269850.56 frames. ], batch size: 159, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:39:35,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1197462.0, ans=0.1 2023-06-22 09:39:46,417 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:40:06,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1197582.0, ans=0.125 2023-06-22 09:40:12,988 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:40:18,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-22 09:40:50,879 INFO [train.py:996] (0/4) Epoch 7, batch 16650, loss[loss=0.2463, simple_loss=0.3538, pruned_loss=0.06942, over 20761.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.331, pruned_loss=0.08785, over 4275512.20 frames. ], batch size: 607, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:40:51,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1197702.0, ans=0.125 2023-06-22 09:41:14,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1197762.0, ans=0.1 2023-06-22 09:41:26,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.702e+02 3.520e+02 3.910e+02 4.811e+02 1.011e+03, threshold=7.820e+02, percent-clipped=1.0 2023-06-22 09:42:24,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.82 vs. limit=22.5 2023-06-22 09:42:28,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1197942.0, ans=0.1 2023-06-22 09:42:39,973 INFO [train.py:996] (0/4) Epoch 7, batch 16700, loss[loss=0.2137, simple_loss=0.2737, pruned_loss=0.07688, over 21425.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3312, pruned_loss=0.08911, over 4274313.08 frames. ], batch size: 194, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:42:41,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1198002.0, ans=0.125 2023-06-22 09:42:48,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-06-22 09:42:52,656 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-22 09:43:49,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1198182.0, ans=0.04949747468305833 2023-06-22 09:44:01,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-22 09:44:27,970 INFO [train.py:996] (0/4) Epoch 7, batch 16750, loss[loss=0.2841, simple_loss=0.3732, pruned_loss=0.09753, over 21915.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3338, pruned_loss=0.09103, over 4273741.22 frames. ], batch size: 372, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:44:31,352 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-22 09:44:32,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1198302.0, ans=0.0 2023-06-22 09:44:37,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1198302.0, ans=0.0 2023-06-22 09:45:09,622 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.580e+02 3.471e+02 3.936e+02 4.958e+02 1.171e+03, threshold=7.873e+02, percent-clipped=3.0 2023-06-22 09:45:40,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=15.0 2023-06-22 09:46:11,334 INFO [train.py:996] (0/4) Epoch 7, batch 16800, loss[loss=0.2714, simple_loss=0.4042, pruned_loss=0.06925, over 20704.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3393, pruned_loss=0.09143, over 4268422.49 frames. ], batch size: 607, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:46:14,107 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.39 vs. limit=15.0 2023-06-22 09:46:46,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-22 09:46:47,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1198662.0, ans=0.125 2023-06-22 09:47:22,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.17 vs. limit=10.0 2023-06-22 09:47:51,174 INFO [train.py:996] (0/4) Epoch 7, batch 16850, loss[loss=0.236, simple_loss=0.3281, pruned_loss=0.07199, over 17271.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3358, pruned_loss=0.09158, over 4272697.98 frames. ], batch size: 60, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:48:29,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 3.467e+02 4.300e+02 5.663e+02 1.182e+03, threshold=8.599e+02, percent-clipped=7.0 2023-06-22 09:48:30,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1198962.0, ans=0.2 2023-06-22 09:48:48,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1199022.0, ans=0.05 2023-06-22 09:49:30,257 INFO [train.py:996] (0/4) Epoch 7, batch 16900, loss[loss=0.2046, simple_loss=0.2809, pruned_loss=0.0642, over 21624.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3308, pruned_loss=0.09106, over 4273812.20 frames. ], batch size: 391, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:49:50,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1199262.0, ans=0.0 2023-06-22 09:49:56,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1199262.0, ans=0.125 2023-06-22 09:50:20,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-22 09:51:04,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1199502.0, ans=0.1 2023-06-22 09:51:05,728 INFO [train.py:996] (0/4) Epoch 7, batch 16950, loss[loss=0.2297, simple_loss=0.3019, pruned_loss=0.07872, over 21863.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3227, pruned_loss=0.08862, over 4268138.48 frames. ], batch size: 332, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:51:45,905 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 2.915e+02 3.202e+02 3.763e+02 5.382e+02, threshold=6.404e+02, percent-clipped=0.0 2023-06-22 09:51:46,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1199622.0, ans=0.125 2023-06-22 09:52:08,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1199682.0, ans=0.125 2023-06-22 09:52:50,019 INFO [train.py:996] (0/4) Epoch 7, batch 17000, loss[loss=0.2827, simple_loss=0.3404, pruned_loss=0.1125, over 21913.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3192, pruned_loss=0.08872, over 4275140.32 frames. ], batch size: 414, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:53:48,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-22 09:53:54,412 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-200000.pt 2023-06-22 09:54:07,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=12.0 2023-06-22 09:54:38,322 INFO [train.py:996] (0/4) Epoch 7, batch 17050, loss[loss=0.2822, simple_loss=0.3476, pruned_loss=0.1084, over 21443.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3253, pruned_loss=0.09112, over 4282446.40 frames. ], batch size: 211, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:54:38,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1200102.0, ans=0.125 2023-06-22 09:54:54,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1200162.0, ans=0.125 2023-06-22 09:55:08,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.382e+02 4.158e+02 4.859e+02 8.252e+02, threshold=8.317e+02, percent-clipped=8.0 2023-06-22 09:55:49,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1200282.0, ans=0.0 2023-06-22 09:56:17,484 INFO [train.py:996] (0/4) Epoch 7, batch 17100, loss[loss=0.2393, simple_loss=0.3005, pruned_loss=0.08907, over 21651.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3249, pruned_loss=0.09127, over 4290585.65 frames. ], batch size: 263, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:56:17,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1200402.0, ans=0.1 2023-06-22 09:57:13,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1200582.0, ans=0.0 2023-06-22 09:57:14,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1200582.0, ans=0.0 2023-06-22 09:57:40,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-22 09:57:40,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-22 09:57:49,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1200702.0, ans=0.2 2023-06-22 09:57:50,031 INFO [train.py:996] (0/4) Epoch 7, batch 17150, loss[loss=0.2448, simple_loss=0.3211, pruned_loss=0.08427, over 21558.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3201, pruned_loss=0.09038, over 4286502.84 frames. ], batch size: 471, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:58:00,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1200702.0, ans=0.1 2023-06-22 09:58:03,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1200702.0, ans=0.125 2023-06-22 09:58:04,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1200702.0, ans=0.125 2023-06-22 09:58:13,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1200762.0, ans=0.2 2023-06-22 09:58:21,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1200762.0, ans=0.1 2023-06-22 09:58:30,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.032e+02 3.543e+02 4.123e+02 6.537e+02, threshold=7.086e+02, percent-clipped=0.0 2023-06-22 09:59:12,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1200882.0, ans=0.125 2023-06-22 09:59:26,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1200942.0, ans=0.2 2023-06-22 09:59:27,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1200942.0, ans=0.1 2023-06-22 09:59:30,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1200942.0, ans=0.125 2023-06-22 09:59:37,120 INFO [train.py:996] (0/4) Epoch 7, batch 17200, loss[loss=0.2793, simple_loss=0.3406, pruned_loss=0.109, over 21542.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3208, pruned_loss=0.0905, over 4283672.41 frames. ], batch size: 414, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:59:42,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1201002.0, ans=0.125 2023-06-22 10:00:00,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1201062.0, ans=0.125 2023-06-22 10:00:07,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1201062.0, ans=0.0 2023-06-22 10:00:30,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1201122.0, ans=0.125 2023-06-22 10:01:06,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-22 10:01:20,186 INFO [train.py:996] (0/4) Epoch 7, batch 17250, loss[loss=0.2594, simple_loss=0.3335, pruned_loss=0.09267, over 21623.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3247, pruned_loss=0.09251, over 4285130.79 frames. ], batch size: 263, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:01:35,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1201302.0, ans=0.05 2023-06-22 10:02:01,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.703e+02 3.318e+02 3.860e+02 4.888e+02 8.680e+02, threshold=7.720e+02, percent-clipped=6.0 2023-06-22 10:03:07,142 INFO [train.py:996] (0/4) Epoch 7, batch 17300, loss[loss=0.2869, simple_loss=0.3443, pruned_loss=0.1148, over 21326.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3313, pruned_loss=0.09481, over 4281308.17 frames. ], batch size: 176, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:03:25,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1201602.0, ans=0.125 2023-06-22 10:03:34,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1201662.0, ans=0.1 2023-06-22 10:03:35,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1201662.0, ans=0.1 2023-06-22 10:04:46,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1201842.0, ans=0.2 2023-06-22 10:04:50,689 INFO [train.py:996] (0/4) Epoch 7, batch 17350, loss[loss=0.1779, simple_loss=0.2249, pruned_loss=0.06546, over 17253.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3329, pruned_loss=0.09475, over 4269052.70 frames. ], batch size: 63, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:05:19,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1201962.0, ans=0.125 2023-06-22 10:05:23,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-22 10:05:33,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1201962.0, ans=0.125 2023-06-22 10:05:36,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.363e+02 3.779e+02 4.471e+02 7.201e+02, threshold=7.558e+02, percent-clipped=0.0 2023-06-22 10:05:48,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1202022.0, ans=0.125 2023-06-22 10:05:54,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1202082.0, ans=0.0 2023-06-22 10:06:23,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1202142.0, ans=0.02 2023-06-22 10:06:25,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-22 10:06:32,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1202142.0, ans=0.0 2023-06-22 10:06:37,403 INFO [train.py:996] (0/4) Epoch 7, batch 17400, loss[loss=0.16, simple_loss=0.1975, pruned_loss=0.06129, over 16789.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3277, pruned_loss=0.09053, over 4260321.09 frames. ], batch size: 60, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:07:03,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1202262.0, ans=0.125 2023-06-22 10:07:27,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-22 10:08:24,862 INFO [train.py:996] (0/4) Epoch 7, batch 17450, loss[loss=0.2128, simple_loss=0.2614, pruned_loss=0.08214, over 21823.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3245, pruned_loss=0.08747, over 4263543.05 frames. ], batch size: 98, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 10:08:59,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1202562.0, ans=0.125 2023-06-22 10:09:02,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 3.174e+02 3.775e+02 5.488e+02 9.226e+02, threshold=7.551e+02, percent-clipped=5.0 2023-06-22 10:09:24,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1202682.0, ans=0.125 2023-06-22 10:09:25,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1202682.0, ans=0.125 2023-06-22 10:09:25,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1202682.0, ans=0.0 2023-06-22 10:09:27,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-22 10:09:31,103 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-06-22 10:09:32,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1202682.0, ans=0.0 2023-06-22 10:09:47,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-06-22 10:09:49,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1202742.0, ans=0.125 2023-06-22 10:10:06,316 INFO [train.py:996] (0/4) Epoch 7, batch 17500, loss[loss=0.2323, simple_loss=0.3015, pruned_loss=0.08153, over 21029.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.318, pruned_loss=0.08375, over 4268509.02 frames. ], batch size: 608, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 10:10:18,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-22 10:10:57,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1202982.0, ans=0.0 2023-06-22 10:11:03,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1202982.0, ans=0.1 2023-06-22 10:11:07,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.02 vs. limit=22.5 2023-06-22 10:11:27,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-22 10:11:41,911 INFO [train.py:996] (0/4) Epoch 7, batch 17550, loss[loss=0.2255, simple_loss=0.3175, pruned_loss=0.06676, over 21742.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3177, pruned_loss=0.08232, over 4273221.04 frames. ], batch size: 124, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:12:10,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1203162.0, ans=0.0 2023-06-22 10:12:11,752 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:12:14,423 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.829e+02 3.350e+02 3.891e+02 7.522e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-22 10:13:17,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1203342.0, ans=0.125 2023-06-22 10:13:22,601 INFO [train.py:996] (0/4) Epoch 7, batch 17600, loss[loss=0.2532, simple_loss=0.3172, pruned_loss=0.0946, over 20112.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3215, pruned_loss=0.08325, over 4262715.69 frames. ], batch size: 703, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:13:38,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-22 10:13:52,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1203462.0, ans=0.125 2023-06-22 10:14:09,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1203522.0, ans=0.125 2023-06-22 10:14:17,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1203582.0, ans=0.1 2023-06-22 10:15:03,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-22 10:15:03,732 INFO [train.py:996] (0/4) Epoch 7, batch 17650, loss[loss=0.2399, simple_loss=0.3196, pruned_loss=0.08014, over 21681.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3191, pruned_loss=0.08359, over 4266618.67 frames. ], batch size: 415, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:15:09,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1203702.0, ans=0.1 2023-06-22 10:15:11,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-22 10:15:36,736 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.660e+02 3.234e+02 3.859e+02 4.407e+02 8.519e+02, threshold=7.719e+02, percent-clipped=7.0 2023-06-22 10:15:55,699 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:16:46,343 INFO [train.py:996] (0/4) Epoch 7, batch 17700, loss[loss=0.2471, simple_loss=0.324, pruned_loss=0.08509, over 21441.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3146, pruned_loss=0.08105, over 4272775.10 frames. ], batch size: 131, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:17:12,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1204062.0, ans=0.2 2023-06-22 10:17:40,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-22 10:18:23,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1204242.0, ans=0.125 2023-06-22 10:18:29,557 INFO [train.py:996] (0/4) Epoch 7, batch 17750, loss[loss=0.2876, simple_loss=0.3495, pruned_loss=0.1128, over 21335.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.323, pruned_loss=0.0853, over 4272036.48 frames. ], batch size: 176, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:18:50,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1204362.0, ans=0.0 2023-06-22 10:19:13,752 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.318e+02 4.087e+02 5.384e+02 1.002e+03, threshold=8.174e+02, percent-clipped=10.0 2023-06-22 10:19:17,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1204422.0, ans=0.125 2023-06-22 10:19:47,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1204482.0, ans=0.125 2023-06-22 10:20:11,951 INFO [train.py:996] (0/4) Epoch 7, batch 17800, loss[loss=0.2716, simple_loss=0.347, pruned_loss=0.09809, over 20692.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3234, pruned_loss=0.08495, over 4261833.20 frames. ], batch size: 609, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:20:22,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1204602.0, ans=0.1 2023-06-22 10:20:24,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.51 vs. limit=15.0 2023-06-22 10:20:30,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1204602.0, ans=0.125 2023-06-22 10:20:32,575 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:21:22,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1204782.0, ans=0.125 2023-06-22 10:21:55,077 INFO [train.py:996] (0/4) Epoch 7, batch 17850, loss[loss=0.2109, simple_loss=0.2572, pruned_loss=0.08224, over 21723.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.322, pruned_loss=0.08414, over 4262961.45 frames. ], batch size: 112, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:22:36,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1204962.0, ans=0.04949747468305833 2023-06-22 10:22:45,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.209e+02 3.990e+02 4.443e+02 8.332e+02, threshold=7.980e+02, percent-clipped=3.0 2023-06-22 10:23:07,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1205082.0, ans=0.125 2023-06-22 10:23:34,381 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:23:38,661 INFO [train.py:996] (0/4) Epoch 7, batch 17900, loss[loss=0.2446, simple_loss=0.3224, pruned_loss=0.08339, over 21285.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3272, pruned_loss=0.08699, over 4267862.82 frames. ], batch size: 159, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:24:34,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1205322.0, ans=0.125 2023-06-22 10:24:35,443 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-22 10:24:51,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1205382.0, ans=0.2 2023-06-22 10:24:54,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1205382.0, ans=0.1 2023-06-22 10:24:54,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1205382.0, ans=0.0 2023-06-22 10:24:54,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1205382.0, ans=0.125 2023-06-22 10:25:05,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1205442.0, ans=0.125 2023-06-22 10:25:24,800 INFO [train.py:996] (0/4) Epoch 7, batch 17950, loss[loss=0.2221, simple_loss=0.3151, pruned_loss=0.06454, over 21639.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3292, pruned_loss=0.08415, over 4270977.50 frames. ], batch size: 389, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:26:08,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.466e+02 3.180e+02 3.649e+02 4.821e+02 7.234e+02, threshold=7.298e+02, percent-clipped=0.0 2023-06-22 10:26:45,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1205742.0, ans=0.0 2023-06-22 10:26:54,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1205742.0, ans=0.125 2023-06-22 10:27:09,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1205802.0, ans=0.125 2023-06-22 10:27:10,986 INFO [train.py:996] (0/4) Epoch 7, batch 18000, loss[loss=0.2173, simple_loss=0.2794, pruned_loss=0.07758, over 21608.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3218, pruned_loss=0.08308, over 4268393.62 frames. ], batch size: 247, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:27:10,987 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 10:27:30,139 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.265, simple_loss=0.3646, pruned_loss=0.08269, over 1796401.00 frames. 2023-06-22 10:27:30,140 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 10:27:54,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1205862.0, ans=0.0 2023-06-22 10:28:37,703 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:29:12,743 INFO [train.py:996] (0/4) Epoch 7, batch 18050, loss[loss=0.2599, simple_loss=0.3192, pruned_loss=0.1003, over 21666.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3148, pruned_loss=0.08232, over 4271261.27 frames. ], batch size: 298, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:29:26,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1206102.0, ans=0.0 2023-06-22 10:29:30,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1206162.0, ans=0.125 2023-06-22 10:29:36,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1206162.0, ans=0.125 2023-06-22 10:29:52,829 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.553e+02 3.561e+02 4.207e+02 5.144e+02 1.104e+03, threshold=8.414e+02, percent-clipped=10.0 2023-06-22 10:30:36,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1206282.0, ans=0.0 2023-06-22 10:30:55,015 INFO [train.py:996] (0/4) Epoch 7, batch 18100, loss[loss=0.287, simple_loss=0.3459, pruned_loss=0.114, over 21267.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.32, pruned_loss=0.08545, over 4277660.57 frames. ], batch size: 176, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:30:59,656 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.75 vs. limit=22.5 2023-06-22 10:31:07,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1206402.0, ans=0.125 2023-06-22 10:32:21,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1206642.0, ans=0.1 2023-06-22 10:32:35,108 INFO [train.py:996] (0/4) Epoch 7, batch 18150, loss[loss=0.2352, simple_loss=0.3096, pruned_loss=0.08043, over 21781.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3206, pruned_loss=0.08442, over 4282845.98 frames. ], batch size: 317, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:33:03,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1206762.0, ans=0.125 2023-06-22 10:33:15,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 3.134e+02 3.517e+02 4.943e+02 8.965e+02, threshold=7.034e+02, percent-clipped=1.0 2023-06-22 10:33:24,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1206822.0, ans=0.0 2023-06-22 10:33:26,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1206822.0, ans=10.0 2023-06-22 10:33:39,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1206882.0, ans=0.125 2023-06-22 10:34:13,180 INFO [train.py:996] (0/4) Epoch 7, batch 18200, loss[loss=0.2226, simple_loss=0.2911, pruned_loss=0.077, over 21596.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3146, pruned_loss=0.08443, over 4288728.07 frames. ], batch size: 415, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:34:20,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1207002.0, ans=0.125 2023-06-22 10:34:20,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1207002.0, ans=0.0 2023-06-22 10:35:21,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1207182.0, ans=0.0 2023-06-22 10:35:50,313 INFO [train.py:996] (0/4) Epoch 7, batch 18250, loss[loss=0.2072, simple_loss=0.28, pruned_loss=0.06715, over 21465.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3055, pruned_loss=0.0807, over 4271339.25 frames. ], batch size: 131, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:35:55,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-22 10:35:57,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1207302.0, ans=0.0 2023-06-22 10:36:25,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.178e+02 4.108e+02 6.214e+02 1.567e+03, threshold=8.215e+02, percent-clipped=16.0 2023-06-22 10:37:29,331 INFO [train.py:996] (0/4) Epoch 7, batch 18300, loss[loss=0.2769, simple_loss=0.3785, pruned_loss=0.0877, over 21833.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.306, pruned_loss=0.08132, over 4271933.94 frames. ], batch size: 371, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:37:32,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-22 10:37:41,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1207602.0, ans=0.2 2023-06-22 10:37:48,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-06-22 10:37:56,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1207662.0, ans=0.05 2023-06-22 10:38:01,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1207722.0, ans=0.125 2023-06-22 10:38:06,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1207722.0, ans=0.1 2023-06-22 10:38:38,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1207782.0, ans=0.1 2023-06-22 10:38:46,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1207842.0, ans=0.1 2023-06-22 10:38:50,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1207842.0, ans=0.0 2023-06-22 10:39:08,704 INFO [train.py:996] (0/4) Epoch 7, batch 18350, loss[loss=0.1667, simple_loss=0.2395, pruned_loss=0.04693, over 16909.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3128, pruned_loss=0.0813, over 4261306.06 frames. ], batch size: 63, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:39:43,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.179e+02 3.735e+02 4.992e+02 1.231e+03, threshold=7.469e+02, percent-clipped=7.0 2023-06-22 10:40:49,872 INFO [train.py:996] (0/4) Epoch 7, batch 18400, loss[loss=0.2279, simple_loss=0.3027, pruned_loss=0.07655, over 21618.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3085, pruned_loss=0.0798, over 4256600.54 frames. ], batch size: 414, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:41:17,039 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-22 10:41:20,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-06-22 10:41:29,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.23 vs. limit=15.0 2023-06-22 10:41:34,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-22 10:42:29,250 INFO [train.py:996] (0/4) Epoch 7, batch 18450, loss[loss=0.2133, simple_loss=0.3074, pruned_loss=0.0596, over 21575.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3047, pruned_loss=0.0761, over 4254590.30 frames. ], batch size: 442, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:43:04,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.170e+02 3.772e+02 5.072e+02 1.044e+03, threshold=7.545e+02, percent-clipped=1.0 2023-06-22 10:43:16,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208622.0, ans=0.1 2023-06-22 10:44:09,095 INFO [train.py:996] (0/4) Epoch 7, batch 18500, loss[loss=0.1854, simple_loss=0.254, pruned_loss=0.0584, over 21328.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3015, pruned_loss=0.07514, over 4250997.31 frames. ], batch size: 211, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:44:56,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208922.0, ans=0.1 2023-06-22 10:45:50,072 INFO [train.py:996] (0/4) Epoch 7, batch 18550, loss[loss=0.2038, simple_loss=0.2728, pruned_loss=0.06744, over 21373.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3003, pruned_loss=0.07401, over 4248648.00 frames. ], batch size: 194, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:46:01,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1209102.0, ans=0.1 2023-06-22 10:46:08,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1209162.0, ans=0.0 2023-06-22 10:46:32,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 3.124e+02 3.693e+02 4.756e+02 1.140e+03, threshold=7.385e+02, percent-clipped=12.0 2023-06-22 10:47:30,158 INFO [train.py:996] (0/4) Epoch 7, batch 18600, loss[loss=0.2568, simple_loss=0.3387, pruned_loss=0.08752, over 21789.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2985, pruned_loss=0.07516, over 4235837.86 frames. ], batch size: 371, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:47:51,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209462.0, ans=0.1 2023-06-22 10:47:51,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-22 10:48:52,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1209642.0, ans=0.125 2023-06-22 10:49:09,441 INFO [train.py:996] (0/4) Epoch 7, batch 18650, loss[loss=0.2153, simple_loss=0.2823, pruned_loss=0.07419, over 15106.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2979, pruned_loss=0.07567, over 4224399.45 frames. ], batch size: 60, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:49:22,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1209702.0, ans=0.125 2023-06-22 10:49:30,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.07 vs. limit=10.0 2023-06-22 10:49:34,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1209762.0, ans=0.1 2023-06-22 10:49:45,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 3.160e+02 3.578e+02 4.366e+02 8.700e+02, threshold=7.156e+02, percent-clipped=2.0 2023-06-22 10:50:07,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-22 10:50:23,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1209882.0, ans=0.125 2023-06-22 10:50:47,144 INFO [train.py:996] (0/4) Epoch 7, batch 18700, loss[loss=0.235, simple_loss=0.2932, pruned_loss=0.08845, over 21812.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2959, pruned_loss=0.07757, over 4240049.94 frames. ], batch size: 316, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:50:55,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1210002.0, ans=0.05 2023-06-22 10:51:17,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=8.0 2023-06-22 10:52:26,870 INFO [train.py:996] (0/4) Epoch 7, batch 18750, loss[loss=0.2381, simple_loss=0.297, pruned_loss=0.08962, over 21300.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2983, pruned_loss=0.07978, over 4254080.27 frames. ], batch size: 176, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:52:28,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1210302.0, ans=0.2 2023-06-22 10:53:03,964 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.195e+02 3.885e+02 4.969e+02 1.061e+03, threshold=7.770e+02, percent-clipped=4.0 2023-06-22 10:53:25,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=12.0 2023-06-22 10:54:01,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1210542.0, ans=0.125 2023-06-22 10:54:05,706 INFO [train.py:996] (0/4) Epoch 7, batch 18800, loss[loss=0.2174, simple_loss=0.3067, pruned_loss=0.0641, over 21786.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3044, pruned_loss=0.08178, over 4252734.93 frames. ], batch size: 282, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:54:17,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1210602.0, ans=0.95 2023-06-22 10:54:24,143 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-22 10:54:28,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1210662.0, ans=0.125 2023-06-22 10:54:45,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1210722.0, ans=0.0 2023-06-22 10:55:44,306 INFO [train.py:996] (0/4) Epoch 7, batch 18850, loss[loss=0.1931, simple_loss=0.2781, pruned_loss=0.0541, over 21623.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2989, pruned_loss=0.07661, over 4249239.49 frames. ], batch size: 263, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:55:57,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1210902.0, ans=0.0 2023-06-22 10:56:01,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-06-22 10:56:03,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1210962.0, ans=0.125 2023-06-22 10:56:09,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1210962.0, ans=0.1 2023-06-22 10:56:21,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 3.160e+02 3.995e+02 5.299e+02 8.301e+02, threshold=7.991e+02, percent-clipped=3.0 2023-06-22 10:56:49,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.77 vs. limit=8.0 2023-06-22 10:57:24,834 INFO [train.py:996] (0/4) Epoch 7, batch 18900, loss[loss=0.2022, simple_loss=0.2714, pruned_loss=0.0665, over 21611.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2965, pruned_loss=0.07642, over 4256029.39 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:57:59,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1211322.0, ans=0.125 2023-06-22 10:58:22,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1211382.0, ans=10.0 2023-06-22 10:58:23,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1211382.0, ans=0.125 2023-06-22 10:58:59,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1211502.0, ans=0.1 2023-06-22 10:59:00,586 INFO [train.py:996] (0/4) Epoch 7, batch 18950, loss[loss=0.2297, simple_loss=0.2988, pruned_loss=0.08032, over 21545.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.298, pruned_loss=0.07952, over 4269882.41 frames. ], batch size: 131, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:59:38,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.670e+02 3.300e+02 3.868e+02 4.844e+02 6.994e+02, threshold=7.736e+02, percent-clipped=0.0 2023-06-22 11:00:36,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1211802.0, ans=0.125 2023-06-22 11:00:38,071 INFO [train.py:996] (0/4) Epoch 7, batch 19000, loss[loss=0.2531, simple_loss=0.296, pruned_loss=0.1051, over 21835.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3083, pruned_loss=0.08201, over 4270929.33 frames. ], batch size: 98, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 11:00:46,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1211802.0, ans=0.0 2023-06-22 11:00:54,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1211862.0, ans=0.125 2023-06-22 11:01:11,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1211922.0, ans=0.125 2023-06-22 11:02:18,623 INFO [train.py:996] (0/4) Epoch 7, batch 19050, loss[loss=0.2337, simple_loss=0.3014, pruned_loss=0.08294, over 21312.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3125, pruned_loss=0.08509, over 4276596.17 frames. ], batch size: 143, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:02:45,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1212162.0, ans=0.1 2023-06-22 11:03:06,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.274e+02 3.680e+02 4.051e+02 6.947e+02, threshold=7.360e+02, percent-clipped=0.0 2023-06-22 11:03:45,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1212342.0, ans=0.125 2023-06-22 11:03:57,470 INFO [train.py:996] (0/4) Epoch 7, batch 19100, loss[loss=0.2357, simple_loss=0.2968, pruned_loss=0.08732, over 21259.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3108, pruned_loss=0.08612, over 4280309.72 frames. ], batch size: 471, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:04:00,325 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-22 11:05:24,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1212642.0, ans=0.2 2023-06-22 11:05:40,910 INFO [train.py:996] (0/4) Epoch 7, batch 19150, loss[loss=0.2474, simple_loss=0.3399, pruned_loss=0.0775, over 21495.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3138, pruned_loss=0.08719, over 4276720.76 frames. ], batch size: 230, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:06:42,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.634e+02 3.652e+02 4.521e+02 6.039e+02 1.131e+03, threshold=9.042e+02, percent-clipped=10.0 2023-06-22 11:06:54,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1212882.0, ans=0.0 2023-06-22 11:06:56,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1212882.0, ans=0.125 2023-06-22 11:07:15,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1212942.0, ans=0.125 2023-06-22 11:07:23,672 INFO [train.py:996] (0/4) Epoch 7, batch 19200, loss[loss=0.2853, simple_loss=0.381, pruned_loss=0.09485, over 21713.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3251, pruned_loss=0.08808, over 4271167.93 frames. ], batch size: 351, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:07:27,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1213002.0, ans=0.125 2023-06-22 11:07:47,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1213002.0, ans=0.0 2023-06-22 11:08:58,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1213242.0, ans=0.0 2023-06-22 11:09:02,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1213302.0, ans=0.125 2023-06-22 11:09:03,391 INFO [train.py:996] (0/4) Epoch 7, batch 19250, loss[loss=0.2012, simple_loss=0.2755, pruned_loss=0.06349, over 21382.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3245, pruned_loss=0.0825, over 4274282.58 frames. ], batch size: 131, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:10:00,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1213422.0, ans=0.0 2023-06-22 11:10:04,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 3.337e+02 4.181e+02 5.636e+02 1.044e+03, threshold=8.362e+02, percent-clipped=4.0 2023-06-22 11:10:12,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1213482.0, ans=0.025 2023-06-22 11:10:43,624 INFO [train.py:996] (0/4) Epoch 7, batch 19300, loss[loss=0.2407, simple_loss=0.3042, pruned_loss=0.08859, over 21551.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.322, pruned_loss=0.08192, over 4273188.69 frames. ], batch size: 548, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:10:53,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1213602.0, ans=0.1 2023-06-22 11:11:39,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-22 11:12:03,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-22 11:12:07,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1213842.0, ans=0.1 2023-06-22 11:12:31,360 INFO [train.py:996] (0/4) Epoch 7, batch 19350, loss[loss=0.2203, simple_loss=0.3023, pruned_loss=0.06917, over 21626.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3163, pruned_loss=0.07853, over 4273599.95 frames. ], batch size: 263, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:12:33,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1213902.0, ans=0.125 2023-06-22 11:13:11,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1214022.0, ans=0.125 2023-06-22 11:13:19,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1214022.0, ans=0.2 2023-06-22 11:13:20,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.132e+02 3.696e+02 4.468e+02 9.223e+02, threshold=7.391e+02, percent-clipped=2.0 2023-06-22 11:13:30,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-22 11:13:31,990 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-22 11:13:39,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1214082.0, ans=0.0 2023-06-22 11:14:04,216 INFO [train.py:996] (0/4) Epoch 7, batch 19400, loss[loss=0.2112, simple_loss=0.2816, pruned_loss=0.07038, over 21679.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3124, pruned_loss=0.07698, over 4278415.49 frames. ], batch size: 230, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:14:22,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1214202.0, ans=0.125 2023-06-22 11:15:07,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1214382.0, ans=0.125 2023-06-22 11:15:43,329 INFO [train.py:996] (0/4) Epoch 7, batch 19450, loss[loss=0.2307, simple_loss=0.291, pruned_loss=0.08516, over 21248.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3096, pruned_loss=0.07883, over 4277629.15 frames. ], batch size: 159, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:16:38,389 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.537e+02 3.055e+02 3.774e+02 4.517e+02 1.086e+03, threshold=7.548e+02, percent-clipped=5.0 2023-06-22 11:17:09,324 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:17:23,333 INFO [train.py:996] (0/4) Epoch 7, batch 19500, loss[loss=0.3092, simple_loss=0.3732, pruned_loss=0.1226, over 21440.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3065, pruned_loss=0.08018, over 4276092.10 frames. ], batch size: 507, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:17:54,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1214862.0, ans=0.125 2023-06-22 11:18:03,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1214862.0, ans=0.1 2023-06-22 11:18:03,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1214862.0, ans=0.125 2023-06-22 11:18:14,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1214922.0, ans=0.125 2023-06-22 11:18:29,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1214982.0, ans=0.0 2023-06-22 11:18:32,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-22 11:19:05,704 INFO [train.py:996] (0/4) Epoch 7, batch 19550, loss[loss=0.2155, simple_loss=0.3137, pruned_loss=0.05865, over 21758.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3033, pruned_loss=0.07897, over 4274567.54 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:19:30,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-22 11:19:39,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1215162.0, ans=0.125 2023-06-22 11:19:41,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1215162.0, ans=0.1 2023-06-22 11:19:42,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1215162.0, ans=0.2 2023-06-22 11:19:55,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 3.073e+02 3.530e+02 4.388e+02 8.690e+02, threshold=7.059e+02, percent-clipped=2.0 2023-06-22 11:20:39,272 INFO [train.py:996] (0/4) Epoch 7, batch 19600, loss[loss=0.2371, simple_loss=0.3037, pruned_loss=0.08521, over 21807.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3046, pruned_loss=0.07895, over 4282350.11 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:20:51,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1215402.0, ans=0.0 2023-06-22 11:20:51,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-22 11:21:15,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1215462.0, ans=0.125 2023-06-22 11:22:20,327 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-22 11:22:22,675 INFO [train.py:996] (0/4) Epoch 7, batch 19650, loss[loss=0.2413, simple_loss=0.3236, pruned_loss=0.07954, over 20005.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3117, pruned_loss=0.08408, over 4282409.38 frames. ], batch size: 702, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:22:36,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1215702.0, ans=0.1 2023-06-22 11:23:15,650 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.627e+02 4.077e+02 5.113e+02 8.180e+02, threshold=8.154e+02, percent-clipped=7.0 2023-06-22 11:24:15,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1216002.0, ans=0.125 2023-06-22 11:24:16,451 INFO [train.py:996] (0/4) Epoch 7, batch 19700, loss[loss=0.2081, simple_loss=0.2963, pruned_loss=0.05999, over 21637.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.313, pruned_loss=0.08416, over 4277762.77 frames. ], batch size: 247, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:25:08,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1216122.0, ans=0.1 2023-06-22 11:25:08,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1216122.0, ans=0.04949747468305833 2023-06-22 11:25:17,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-22 11:25:58,837 INFO [train.py:996] (0/4) Epoch 7, batch 19750, loss[loss=0.228, simple_loss=0.3043, pruned_loss=0.07586, over 21394.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3227, pruned_loss=0.08594, over 4269707.58 frames. ], batch size: 131, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:26:10,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1216302.0, ans=0.125 2023-06-22 11:26:13,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-22 11:26:30,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1216422.0, ans=0.125 2023-06-22 11:26:44,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.745e+02 3.729e+02 4.611e+02 5.991e+02 1.312e+03, threshold=9.223e+02, percent-clipped=7.0 2023-06-22 11:27:31,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1216542.0, ans=0.0 2023-06-22 11:27:38,688 INFO [train.py:996] (0/4) Epoch 7, batch 19800, loss[loss=0.213, simple_loss=0.2739, pruned_loss=0.07607, over 21304.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3208, pruned_loss=0.08614, over 4278192.43 frames. ], batch size: 159, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:28:26,202 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-22 11:28:43,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.22 vs. limit=22.5 2023-06-22 11:29:21,351 INFO [train.py:996] (0/4) Epoch 7, batch 19850, loss[loss=0.1819, simple_loss=0.2748, pruned_loss=0.04454, over 21597.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3136, pruned_loss=0.08078, over 4273283.55 frames. ], batch size: 230, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:29:23,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1216902.0, ans=0.2 2023-06-22 11:29:25,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1216902.0, ans=0.0 2023-06-22 11:30:12,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 2.965e+02 3.588e+02 4.617e+02 1.028e+03, threshold=7.176e+02, percent-clipped=3.0 2023-06-22 11:30:20,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-22 11:30:38,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1217082.0, ans=0.125 2023-06-22 11:31:00,272 INFO [train.py:996] (0/4) Epoch 7, batch 19900, loss[loss=0.2377, simple_loss=0.3083, pruned_loss=0.08357, over 21438.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.313, pruned_loss=0.07789, over 4270594.45 frames. ], batch size: 507, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:31:06,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1217202.0, ans=0.125 2023-06-22 11:32:31,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1217442.0, ans=0.0 2023-06-22 11:32:42,614 INFO [train.py:996] (0/4) Epoch 7, batch 19950, loss[loss=0.2559, simple_loss=0.3158, pruned_loss=0.09802, over 21749.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3085, pruned_loss=0.07836, over 4269062.28 frames. ], batch size: 102, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:32:48,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1217502.0, ans=0.125 2023-06-22 11:33:39,621 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.305e+02 4.065e+02 5.440e+02 9.798e+02, threshold=8.130e+02, percent-clipped=10.0 2023-06-22 11:33:56,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-06-22 11:34:05,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1217742.0, ans=0.1 2023-06-22 11:34:20,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1217742.0, ans=0.125 2023-06-22 11:34:22,893 INFO [train.py:996] (0/4) Epoch 7, batch 20000, loss[loss=0.25, simple_loss=0.319, pruned_loss=0.0905, over 21514.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3089, pruned_loss=0.07898, over 4252633.76 frames. ], batch size: 195, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:34:28,415 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:34:59,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1217862.0, ans=0.125 2023-06-22 11:35:14,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-22 11:36:02,160 INFO [train.py:996] (0/4) Epoch 7, batch 20050, loss[loss=0.2356, simple_loss=0.3051, pruned_loss=0.08303, over 21259.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3118, pruned_loss=0.08204, over 4265558.55 frames. ], batch size: 159, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:36:02,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1218102.0, ans=0.2 2023-06-22 11:36:10,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1218102.0, ans=0.2 2023-06-22 11:36:25,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1218162.0, ans=0.125 2023-06-22 11:37:00,027 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.182e+02 3.871e+02 4.475e+02 7.153e+02, threshold=7.741e+02, percent-clipped=0.0 2023-06-22 11:37:19,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-22 11:37:49,414 INFO [train.py:996] (0/4) Epoch 7, batch 20100, loss[loss=0.2505, simple_loss=0.3303, pruned_loss=0.0853, over 21817.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3141, pruned_loss=0.0846, over 4275414.99 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:39:26,441 INFO [train.py:996] (0/4) Epoch 7, batch 20150, loss[loss=0.2146, simple_loss=0.2643, pruned_loss=0.08246, over 20357.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3217, pruned_loss=0.08763, over 4275232.31 frames. ], batch size: 703, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:40:28,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 4.003e+02 4.787e+02 6.267e+02 1.040e+03, threshold=9.575e+02, percent-clipped=17.0 2023-06-22 11:40:32,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-22 11:41:15,567 INFO [train.py:996] (0/4) Epoch 7, batch 20200, loss[loss=0.2868, simple_loss=0.392, pruned_loss=0.0908, over 20755.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3276, pruned_loss=0.09098, over 4276329.41 frames. ], batch size: 607, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:42:19,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1219182.0, ans=0.125 2023-06-22 11:42:29,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1219182.0, ans=0.125 2023-06-22 11:43:00,771 INFO [train.py:996] (0/4) Epoch 7, batch 20250, loss[loss=0.2471, simple_loss=0.3158, pruned_loss=0.08923, over 21328.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3282, pruned_loss=0.08927, over 4275412.48 frames. ], batch size: 176, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:43:16,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1219302.0, ans=0.09899494936611666 2023-06-22 11:43:54,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.109e+02 3.852e+02 4.558e+02 1.289e+03, threshold=7.704e+02, percent-clipped=1.0 2023-06-22 11:44:03,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1219482.0, ans=0.0 2023-06-22 11:44:06,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219482.0, ans=0.1 2023-06-22 11:44:40,581 INFO [train.py:996] (0/4) Epoch 7, batch 20300, loss[loss=0.2579, simple_loss=0.342, pruned_loss=0.08686, over 21610.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3269, pruned_loss=0.08681, over 4274986.24 frames. ], batch size: 389, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:44:50,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1219602.0, ans=0.0 2023-06-22 11:45:24,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1219722.0, ans=0.1 2023-06-22 11:45:45,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1219782.0, ans=0.125 2023-06-22 11:45:52,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1219842.0, ans=0.125 2023-06-22 11:45:55,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1219842.0, ans=0.125 2023-06-22 11:46:03,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1219842.0, ans=0.0 2023-06-22 11:46:05,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1219842.0, ans=0.125 2023-06-22 11:46:18,503 INFO [train.py:996] (0/4) Epoch 7, batch 20350, loss[loss=0.2839, simple_loss=0.3463, pruned_loss=0.1108, over 21857.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3251, pruned_loss=0.08578, over 4267572.75 frames. ], batch size: 351, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:46:40,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1219962.0, ans=0.125 2023-06-22 11:47:02,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-06-22 11:47:05,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.59 vs. limit=5.0 2023-06-22 11:47:11,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.199e+02 3.639e+02 4.659e+02 8.452e+02, threshold=7.278e+02, percent-clipped=1.0 2023-06-22 11:47:28,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1220082.0, ans=0.125 2023-06-22 11:47:58,718 INFO [train.py:996] (0/4) Epoch 7, batch 20400, loss[loss=0.2512, simple_loss=0.3574, pruned_loss=0.07251, over 19861.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3291, pruned_loss=0.08892, over 4261297.49 frames. ], batch size: 704, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 11:48:13,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1220202.0, ans=0.125 2023-06-22 11:48:28,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1220262.0, ans=0.0 2023-06-22 11:48:54,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1220322.0, ans=0.2 2023-06-22 11:49:04,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-22 11:49:43,954 INFO [train.py:996] (0/4) Epoch 7, batch 20450, loss[loss=0.2712, simple_loss=0.3248, pruned_loss=0.1088, over 21924.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3296, pruned_loss=0.09113, over 4261168.48 frames. ], batch size: 113, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 11:50:00,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1220562.0, ans=0.125 2023-06-22 11:50:30,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 3.380e+02 3.850e+02 4.870e+02 7.513e+02, threshold=7.700e+02, percent-clipped=1.0 2023-06-22 11:51:16,550 INFO [train.py:996] (0/4) Epoch 7, batch 20500, loss[loss=0.2217, simple_loss=0.281, pruned_loss=0.08116, over 21375.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3263, pruned_loss=0.09157, over 4247734.75 frames. ], batch size: 548, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:51:31,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1220802.0, ans=0.1 2023-06-22 11:51:40,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1220862.0, ans=0.125 2023-06-22 11:51:53,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1220922.0, ans=0.04949747468305833 2023-06-22 11:52:39,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1221042.0, ans=0.125 2023-06-22 11:53:01,612 INFO [train.py:996] (0/4) Epoch 7, batch 20550, loss[loss=0.2009, simple_loss=0.2746, pruned_loss=0.06358, over 16153.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3193, pruned_loss=0.08929, over 4232480.82 frames. ], batch size: 60, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:53:02,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1221102.0, ans=0.1 2023-06-22 11:53:03,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1221102.0, ans=0.125 2023-06-22 11:53:13,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1221102.0, ans=0.125 2023-06-22 11:53:51,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.265e+02 4.144e+02 5.422e+02 9.318e+02, threshold=8.288e+02, percent-clipped=6.0 2023-06-22 11:53:52,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1221222.0, ans=0.125 2023-06-22 11:54:05,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1221282.0, ans=0.0 2023-06-22 11:54:11,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1221282.0, ans=0.2 2023-06-22 11:54:36,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1221342.0, ans=0.2 2023-06-22 11:54:40,932 INFO [train.py:996] (0/4) Epoch 7, batch 20600, loss[loss=0.2614, simple_loss=0.3385, pruned_loss=0.09212, over 16853.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3203, pruned_loss=0.0878, over 4214498.98 frames. ], batch size: 60, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:55:18,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1221522.0, ans=0.0 2023-06-22 11:55:31,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1221582.0, ans=0.125 2023-06-22 11:56:19,455 INFO [train.py:996] (0/4) Epoch 7, batch 20650, loss[loss=0.2306, simple_loss=0.3055, pruned_loss=0.07781, over 21789.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3173, pruned_loss=0.08828, over 4236212.08 frames. ], batch size: 351, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:56:27,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1221702.0, ans=0.025 2023-06-22 11:57:08,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.629e+02 3.307e+02 4.027e+02 4.834e+02 1.059e+03, threshold=8.054e+02, percent-clipped=3.0 2023-06-22 11:57:22,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1221882.0, ans=0.125 2023-06-22 11:57:26,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.38 vs. limit=10.0 2023-06-22 11:57:42,533 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-22 11:57:59,199 INFO [train.py:996] (0/4) Epoch 7, batch 20700, loss[loss=0.3023, simple_loss=0.3888, pruned_loss=0.1079, over 21491.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3108, pruned_loss=0.08508, over 4248570.94 frames. ], batch size: 508, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:58:19,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1222062.0, ans=0.125 2023-06-22 11:59:17,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1222182.0, ans=0.2 2023-06-22 11:59:41,097 INFO [train.py:996] (0/4) Epoch 7, batch 20750, loss[loss=0.2412, simple_loss=0.3269, pruned_loss=0.07777, over 21587.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3118, pruned_loss=0.08346, over 4258425.97 frames. ], batch size: 230, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:59:43,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1222302.0, ans=0.2 2023-06-22 12:00:04,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-22 12:00:28,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1222422.0, ans=0.1 2023-06-22 12:00:36,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.565e+02 3.483e+02 4.528e+02 6.877e+02 1.317e+03, threshold=9.056e+02, percent-clipped=16.0 2023-06-22 12:00:37,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.57 vs. limit=10.0 2023-06-22 12:01:26,507 INFO [train.py:996] (0/4) Epoch 7, batch 20800, loss[loss=0.2357, simple_loss=0.3161, pruned_loss=0.0776, over 21647.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3169, pruned_loss=0.08507, over 4263744.20 frames. ], batch size: 332, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:01:33,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1222602.0, ans=0.125 2023-06-22 12:01:54,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1222662.0, ans=0.2 2023-06-22 12:02:25,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1222782.0, ans=0.0 2023-06-22 12:02:53,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1222842.0, ans=0.0 2023-06-22 12:02:58,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1222842.0, ans=0.0 2023-06-22 12:03:02,334 INFO [train.py:996] (0/4) Epoch 7, batch 20850, loss[loss=0.2148, simple_loss=0.2755, pruned_loss=0.07703, over 21506.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3107, pruned_loss=0.083, over 4268503.38 frames. ], batch size: 212, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:03:29,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1222962.0, ans=0.0 2023-06-22 12:03:37,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1223022.0, ans=0.1 2023-06-22 12:04:01,382 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.647e+02 5.072e+02 6.568e+02 1.337e+03, threshold=1.014e+03, percent-clipped=9.0 2023-06-22 12:04:12,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-22 12:04:24,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-22 12:04:46,077 INFO [train.py:996] (0/4) Epoch 7, batch 20900, loss[loss=0.2376, simple_loss=0.3092, pruned_loss=0.08301, over 21869.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3105, pruned_loss=0.08406, over 4281378.44 frames. ], batch size: 124, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:05:00,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1223262.0, ans=0.0 2023-06-22 12:05:07,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-22 12:05:16,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1223322.0, ans=0.04949747468305833 2023-06-22 12:05:31,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1223322.0, ans=0.07 2023-06-22 12:06:07,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1223442.0, ans=0.5 2023-06-22 12:06:17,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1223442.0, ans=0.2 2023-06-22 12:06:19,592 INFO [train.py:996] (0/4) Epoch 7, batch 20950, loss[loss=0.2064, simple_loss=0.2855, pruned_loss=0.06359, over 20822.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3066, pruned_loss=0.08039, over 4269194.79 frames. ], batch size: 608, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:06:35,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-22 12:07:15,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.036e+02 3.516e+02 4.387e+02 8.628e+02, threshold=7.032e+02, percent-clipped=0.0 2023-06-22 12:07:23,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1223682.0, ans=0.125 2023-06-22 12:07:25,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1223682.0, ans=0.1 2023-06-22 12:07:48,565 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.32 vs. limit=15.0 2023-06-22 12:07:57,804 INFO [train.py:996] (0/4) Epoch 7, batch 21000, loss[loss=0.2337, simple_loss=0.3175, pruned_loss=0.07498, over 21830.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3049, pruned_loss=0.08077, over 4281029.47 frames. ], batch size: 282, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:07:57,805 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 12:08:15,952 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2689, simple_loss=0.3672, pruned_loss=0.08525, over 1796401.00 frames. 2023-06-22 12:08:15,953 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 12:09:15,866 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-204000.pt 2023-06-22 12:09:54,863 INFO [train.py:996] (0/4) Epoch 7, batch 21050, loss[loss=0.2157, simple_loss=0.2822, pruned_loss=0.07458, over 21256.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3023, pruned_loss=0.08111, over 4286122.08 frames. ], batch size: 159, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:10:49,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.015e+02 3.354e+02 4.094e+02 5.427e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-22 12:11:33,492 INFO [train.py:996] (0/4) Epoch 7, batch 21100, loss[loss=0.2291, simple_loss=0.2931, pruned_loss=0.08249, over 21311.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2988, pruned_loss=0.08054, over 4271509.42 frames. ], batch size: 177, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:12:40,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1224582.0, ans=0.0 2023-06-22 12:12:40,943 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:12:56,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1224642.0, ans=0.0 2023-06-22 12:13:07,724 INFO [train.py:996] (0/4) Epoch 7, batch 21150, loss[loss=0.2223, simple_loss=0.2824, pruned_loss=0.08109, over 21772.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2955, pruned_loss=0.08087, over 4258195.47 frames. ], batch size: 317, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:13:53,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1224822.0, ans=0.1 2023-06-22 12:14:08,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.198e+02 3.741e+02 4.699e+02 9.376e+02, threshold=7.483e+02, percent-clipped=8.0 2023-06-22 12:14:19,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1224882.0, ans=0.0 2023-06-22 12:14:46,332 INFO [train.py:996] (0/4) Epoch 7, batch 21200, loss[loss=0.2003, simple_loss=0.2417, pruned_loss=0.07943, over 20108.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2911, pruned_loss=0.07976, over 4256127.85 frames. ], batch size: 703, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:15:17,637 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:15:27,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1225122.0, ans=0.125 2023-06-22 12:15:32,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1225122.0, ans=0.125 2023-06-22 12:16:00,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1225182.0, ans=0.1 2023-06-22 12:16:21,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1225242.0, ans=0.025 2023-06-22 12:16:30,849 INFO [train.py:996] (0/4) Epoch 7, batch 21250, loss[loss=0.206, simple_loss=0.2684, pruned_loss=0.07177, over 21186.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2894, pruned_loss=0.07947, over 4248239.69 frames. ], batch size: 176, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:17:27,711 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.499e+02 3.297e+02 3.945e+02 5.021e+02 1.062e+03, threshold=7.890e+02, percent-clipped=7.0 2023-06-22 12:17:49,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1225542.0, ans=0.1 2023-06-22 12:18:03,972 INFO [train.py:996] (0/4) Epoch 7, batch 21300, loss[loss=0.2676, simple_loss=0.3325, pruned_loss=0.1014, over 21846.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.2977, pruned_loss=0.08155, over 4252696.35 frames. ], batch size: 391, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:19:36,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-22 12:19:45,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1225842.0, ans=0.125 2023-06-22 12:19:47,795 INFO [train.py:996] (0/4) Epoch 7, batch 21350, loss[loss=0.2208, simple_loss=0.3093, pruned_loss=0.06612, over 21822.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3018, pruned_loss=0.08272, over 4251515.07 frames. ], batch size: 316, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:20:39,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1226022.0, ans=0.125 2023-06-22 12:20:40,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1226022.0, ans=0.0 2023-06-22 12:20:45,054 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.191e+02 3.567e+02 4.757e+02 8.464e+02, threshold=7.133e+02, percent-clipped=1.0 2023-06-22 12:20:47,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-22 12:21:26,957 INFO [train.py:996] (0/4) Epoch 7, batch 21400, loss[loss=0.2195, simple_loss=0.2824, pruned_loss=0.07834, over 21683.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3045, pruned_loss=0.08248, over 4251617.18 frames. ], batch size: 112, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:21:42,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1226262.0, ans=0.125 2023-06-22 12:21:58,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1226262.0, ans=0.0 2023-06-22 12:22:01,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1226262.0, ans=0.125 2023-06-22 12:22:44,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1226442.0, ans=0.125 2023-06-22 12:23:03,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1226442.0, ans=0.0 2023-06-22 12:23:06,025 INFO [train.py:996] (0/4) Epoch 7, batch 21450, loss[loss=0.2797, simple_loss=0.3356, pruned_loss=0.1119, over 21890.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3087, pruned_loss=0.08454, over 4264337.42 frames. ], batch size: 124, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:23:12,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1226502.0, ans=0.125 2023-06-22 12:23:54,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1226622.0, ans=0.125 2023-06-22 12:24:04,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 3.251e+02 3.638e+02 4.479e+02 7.872e+02, threshold=7.276e+02, percent-clipped=2.0 2023-06-22 12:24:34,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=22.5 2023-06-22 12:24:43,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1226802.0, ans=0.0 2023-06-22 12:24:45,020 INFO [train.py:996] (0/4) Epoch 7, batch 21500, loss[loss=0.2916, simple_loss=0.3255, pruned_loss=0.1288, over 21538.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3082, pruned_loss=0.08558, over 4273290.90 frames. ], batch size: 511, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:25:09,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1226862.0, ans=0.125 2023-06-22 12:25:13,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-22 12:25:23,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1226862.0, ans=0.125 2023-06-22 12:26:13,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1227042.0, ans=0.0 2023-06-22 12:26:22,147 INFO [train.py:996] (0/4) Epoch 7, batch 21550, loss[loss=0.3166, simple_loss=0.4481, pruned_loss=0.09254, over 19692.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3022, pruned_loss=0.0831, over 4256649.94 frames. ], batch size: 702, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:26:24,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1227102.0, ans=0.1 2023-06-22 12:27:21,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-22 12:27:22,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.394e+02 4.273e+02 5.102e+02 8.166e+02, threshold=8.546e+02, percent-clipped=3.0 2023-06-22 12:27:25,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1227282.0, ans=0.125 2023-06-22 12:27:36,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1227282.0, ans=15.0 2023-06-22 12:27:57,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1227342.0, ans=0.2 2023-06-22 12:28:03,032 INFO [train.py:996] (0/4) Epoch 7, batch 21600, loss[loss=0.2189, simple_loss=0.2827, pruned_loss=0.07757, over 21828.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2963, pruned_loss=0.08121, over 4256618.96 frames. ], batch size: 352, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:28:24,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1227462.0, ans=0.0 2023-06-22 12:28:59,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-22 12:29:08,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1227582.0, ans=0.0 2023-06-22 12:29:44,731 INFO [train.py:996] (0/4) Epoch 7, batch 21650, loss[loss=0.2164, simple_loss=0.3181, pruned_loss=0.0573, over 21799.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2979, pruned_loss=0.07826, over 4258344.67 frames. ], batch size: 316, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:30:48,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.272e+02 4.072e+02 5.244e+02 1.561e+03, threshold=8.145e+02, percent-clipped=7.0 2023-06-22 12:31:07,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-22 12:31:22,624 INFO [train.py:996] (0/4) Epoch 7, batch 21700, loss[loss=0.2129, simple_loss=0.3077, pruned_loss=0.05901, over 21773.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2989, pruned_loss=0.07612, over 4258319.97 frames. ], batch size: 298, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:31:44,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1228062.0, ans=0.0 2023-06-22 12:32:07,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.88 vs. limit=15.0 2023-06-22 12:32:55,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1228242.0, ans=0.125 2023-06-22 12:33:01,444 INFO [train.py:996] (0/4) Epoch 7, batch 21750, loss[loss=0.2225, simple_loss=0.2741, pruned_loss=0.08551, over 21297.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2945, pruned_loss=0.07661, over 4246266.70 frames. ], batch size: 160, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:33:43,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1228422.0, ans=0.0 2023-06-22 12:33:44,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1228422.0, ans=0.0 2023-06-22 12:33:56,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1228422.0, ans=0.125 2023-06-22 12:34:00,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.122e+02 3.637e+02 4.917e+02 1.048e+03, threshold=7.274e+02, percent-clipped=3.0 2023-06-22 12:34:12,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1228482.0, ans=0.1 2023-06-22 12:34:22,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.19 vs. limit=10.0 2023-06-22 12:34:40,498 INFO [train.py:996] (0/4) Epoch 7, batch 21800, loss[loss=0.2044, simple_loss=0.3146, pruned_loss=0.04713, over 20805.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2965, pruned_loss=0.07844, over 4248885.40 frames. ], batch size: 607, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:35:17,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1228662.0, ans=0.1 2023-06-22 12:35:23,541 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:36:01,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1228782.0, ans=0.0 2023-06-22 12:36:03,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1228842.0, ans=0.2 2023-06-22 12:36:06,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1228842.0, ans=0.1 2023-06-22 12:36:17,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1228842.0, ans=0.125 2023-06-22 12:36:20,681 INFO [train.py:996] (0/4) Epoch 7, batch 21850, loss[loss=0.2233, simple_loss=0.2961, pruned_loss=0.07521, over 21899.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3004, pruned_loss=0.07875, over 4238748.53 frames. ], batch size: 316, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:36:48,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1228962.0, ans=0.2 2023-06-22 12:37:00,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-22 12:37:23,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1229082.0, ans=0.125 2023-06-22 12:37:24,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.283e+02 3.846e+02 4.671e+02 1.030e+03, threshold=7.692e+02, percent-clipped=3.0 2023-06-22 12:38:00,783 INFO [train.py:996] (0/4) Epoch 7, batch 21900, loss[loss=0.205, simple_loss=0.2756, pruned_loss=0.06717, over 21660.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3013, pruned_loss=0.07994, over 4246132.71 frames. ], batch size: 263, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:38:38,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1229262.0, ans=0.1 2023-06-22 12:38:48,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1229322.0, ans=0.0 2023-06-22 12:39:14,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1229382.0, ans=0.125 2023-06-22 12:39:31,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1229442.0, ans=0.2 2023-06-22 12:39:44,773 INFO [train.py:996] (0/4) Epoch 7, batch 21950, loss[loss=0.1839, simple_loss=0.2705, pruned_loss=0.04861, over 21535.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2972, pruned_loss=0.0793, over 4241064.49 frames. ], batch size: 441, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:39:50,666 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-22 12:40:16,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1229562.0, ans=0.0 2023-06-22 12:40:21,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1229622.0, ans=10.0 2023-06-22 12:40:23,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1229622.0, ans=0.125 2023-06-22 12:40:28,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1229622.0, ans=0.1 2023-06-22 12:40:28,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1229622.0, ans=0.125 2023-06-22 12:40:48,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.971e+02 3.645e+02 4.413e+02 9.727e+02, threshold=7.291e+02, percent-clipped=1.0 2023-06-22 12:41:10,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1229742.0, ans=0.125 2023-06-22 12:41:20,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229742.0, ans=0.1 2023-06-22 12:41:24,803 INFO [train.py:996] (0/4) Epoch 7, batch 22000, loss[loss=0.1636, simple_loss=0.2406, pruned_loss=0.04334, over 21498.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2912, pruned_loss=0.07671, over 4238646.17 frames. ], batch size: 195, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:42:22,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1229922.0, ans=0.0 2023-06-22 12:43:11,557 INFO [train.py:996] (0/4) Epoch 7, batch 22050, loss[loss=0.199, simple_loss=0.2714, pruned_loss=0.06333, over 21370.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2978, pruned_loss=0.07884, over 4232631.86 frames. ], batch size: 194, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:43:24,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1230102.0, ans=0.07 2023-06-22 12:44:14,088 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 3.765e+02 5.011e+02 6.386e+02 1.691e+03, threshold=1.002e+03, percent-clipped=17.0 2023-06-22 12:44:31,374 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:44:52,493 INFO [train.py:996] (0/4) Epoch 7, batch 22100, loss[loss=0.2746, simple_loss=0.3417, pruned_loss=0.1037, over 21762.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3094, pruned_loss=0.08411, over 4236972.29 frames. ], batch size: 332, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:44:57,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1230402.0, ans=0.125 2023-06-22 12:45:25,636 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.27 vs. limit=10.0 2023-06-22 12:45:45,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1230522.0, ans=0.125 2023-06-22 12:45:49,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1230582.0, ans=0.1 2023-06-22 12:45:53,350 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:46:21,583 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:46:30,182 INFO [train.py:996] (0/4) Epoch 7, batch 22150, loss[loss=0.2536, simple_loss=0.3176, pruned_loss=0.09477, over 21774.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3133, pruned_loss=0.08584, over 4253206.00 frames. ], batch size: 441, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:47:03,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1230762.0, ans=0.125 2023-06-22 12:47:29,583 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.676e+02 4.235e+02 5.035e+02 1.205e+03, threshold=8.469e+02, percent-clipped=1.0 2023-06-22 12:47:58,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1230942.0, ans=0.125 2023-06-22 12:48:02,757 INFO [train.py:996] (0/4) Epoch 7, batch 22200, loss[loss=0.2434, simple_loss=0.3231, pruned_loss=0.08192, over 21362.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3156, pruned_loss=0.0866, over 4268107.98 frames. ], batch size: 144, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:48:03,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1231002.0, ans=0.1 2023-06-22 12:48:22,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1231062.0, ans=0.125 2023-06-22 12:48:54,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1231122.0, ans=0.1 2023-06-22 12:49:01,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1231122.0, ans=0.125 2023-06-22 12:49:11,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-22 12:49:48,294 INFO [train.py:996] (0/4) Epoch 7, batch 22250, loss[loss=0.253, simple_loss=0.3377, pruned_loss=0.08415, over 21467.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3218, pruned_loss=0.08819, over 4269690.63 frames. ], batch size: 131, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:49:51,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1231302.0, ans=0.0 2023-06-22 12:49:59,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231302.0, ans=0.1 2023-06-22 12:50:14,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1231362.0, ans=0.125 2023-06-22 12:50:46,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1231482.0, ans=0.0 2023-06-22 12:50:50,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.541e+02 3.545e+02 4.219e+02 5.859e+02 1.258e+03, threshold=8.437e+02, percent-clipped=3.0 2023-06-22 12:51:29,309 INFO [train.py:996] (0/4) Epoch 7, batch 22300, loss[loss=0.2812, simple_loss=0.3334, pruned_loss=0.1145, over 21296.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3231, pruned_loss=0.08994, over 4275970.90 frames. ], batch size: 143, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:52:04,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1231662.0, ans=0.125 2023-06-22 12:52:54,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1231842.0, ans=0.0 2023-06-22 12:53:08,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-22 12:53:10,199 INFO [train.py:996] (0/4) Epoch 7, batch 22350, loss[loss=0.2685, simple_loss=0.326, pruned_loss=0.1055, over 21762.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3219, pruned_loss=0.0909, over 4284790.06 frames. ], batch size: 112, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:53:51,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1232022.0, ans=0.125 2023-06-22 12:54:17,421 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.296e+02 3.742e+02 4.441e+02 8.144e+02, threshold=7.483e+02, percent-clipped=0.0 2023-06-22 12:54:27,274 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:54:35,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1232142.0, ans=0.125 2023-06-22 12:54:43,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1232142.0, ans=0.05 2023-06-22 12:54:45,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1232142.0, ans=0.125 2023-06-22 12:54:50,724 INFO [train.py:996] (0/4) Epoch 7, batch 22400, loss[loss=0.2168, simple_loss=0.2876, pruned_loss=0.07303, over 21892.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.32, pruned_loss=0.08826, over 4282409.52 frames. ], batch size: 107, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 12:54:54,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1232202.0, ans=0.1 2023-06-22 12:55:33,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1232322.0, ans=0.0 2023-06-22 12:56:03,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1232382.0, ans=0.125 2023-06-22 12:56:05,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1232382.0, ans=0.2 2023-06-22 12:56:16,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1232442.0, ans=0.125 2023-06-22 12:56:18,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1232442.0, ans=0.07 2023-06-22 12:56:30,532 INFO [train.py:996] (0/4) Epoch 7, batch 22450, loss[loss=0.2087, simple_loss=0.2652, pruned_loss=0.07613, over 21497.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3123, pruned_loss=0.08647, over 4272615.48 frames. ], batch size: 441, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 12:57:34,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.003e+02 3.355e+02 3.883e+02 5.692e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-22 12:58:16,681 INFO [train.py:996] (0/4) Epoch 7, batch 22500, loss[loss=0.2406, simple_loss=0.2945, pruned_loss=0.09335, over 21521.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3063, pruned_loss=0.08556, over 4270666.79 frames. ], batch size: 391, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:59:12,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1232982.0, ans=0.0 2023-06-22 12:59:33,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1233042.0, ans=0.04949747468305833 2023-06-22 13:00:01,054 INFO [train.py:996] (0/4) Epoch 7, batch 22550, loss[loss=0.2532, simple_loss=0.3223, pruned_loss=0.09204, over 21822.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.312, pruned_loss=0.08644, over 4277357.32 frames. ], batch size: 298, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:00:37,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1233162.0, ans=0.125 2023-06-22 13:00:39,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1233222.0, ans=0.125 2023-06-22 13:01:06,760 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.865e+02 3.475e+02 4.180e+02 5.606e+02 1.235e+03, threshold=8.360e+02, percent-clipped=11.0 2023-06-22 13:01:07,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1233282.0, ans=0.125 2023-06-22 13:01:21,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1233282.0, ans=0.125 2023-06-22 13:01:45,212 INFO [train.py:996] (0/4) Epoch 7, batch 22600, loss[loss=0.2274, simple_loss=0.2843, pruned_loss=0.08532, over 20404.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3135, pruned_loss=0.08658, over 4271621.06 frames. ], batch size: 703, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:01:57,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1233402.0, ans=0.07 2023-06-22 13:03:19,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1233642.0, ans=0.1 2023-06-22 13:03:25,491 INFO [train.py:996] (0/4) Epoch 7, batch 22650, loss[loss=0.249, simple_loss=0.301, pruned_loss=0.09844, over 21668.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3101, pruned_loss=0.08642, over 4272775.80 frames. ], batch size: 333, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:04:20,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1233822.0, ans=0.125 2023-06-22 13:04:32,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.918e+02 3.837e+02 4.775e+02 6.238e+02 8.753e+02, threshold=9.549e+02, percent-clipped=4.0 2023-06-22 13:04:46,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1233942.0, ans=0.0 2023-06-22 13:05:04,946 INFO [train.py:996] (0/4) Epoch 7, batch 22700, loss[loss=0.1991, simple_loss=0.2641, pruned_loss=0.0671, over 21723.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3049, pruned_loss=0.08536, over 4253378.47 frames. ], batch size: 124, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:05:14,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1234002.0, ans=0.125 2023-06-22 13:05:37,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1234062.0, ans=0.125 2023-06-22 13:05:57,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-22 13:06:14,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1234182.0, ans=0.125 2023-06-22 13:06:46,582 INFO [train.py:996] (0/4) Epoch 7, batch 22750, loss[loss=0.2779, simple_loss=0.3414, pruned_loss=0.1072, over 21595.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.307, pruned_loss=0.08658, over 4263268.11 frames. ], batch size: 414, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:07:53,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.685e+02 3.471e+02 4.141e+02 5.452e+02 1.173e+03, threshold=8.282e+02, percent-clipped=2.0 2023-06-22 13:08:25,196 INFO [train.py:996] (0/4) Epoch 7, batch 22800, loss[loss=0.2143, simple_loss=0.2817, pruned_loss=0.07348, over 21508.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3099, pruned_loss=0.08853, over 4275611.14 frames. ], batch size: 548, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 13:08:56,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1234662.0, ans=0.125 2023-06-22 13:09:05,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1234722.0, ans=0.125 2023-06-22 13:09:16,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1234722.0, ans=0.1 2023-06-22 13:09:50,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1234842.0, ans=0.0 2023-06-22 13:10:01,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1234842.0, ans=0.125 2023-06-22 13:10:04,272 INFO [train.py:996] (0/4) Epoch 7, batch 22850, loss[loss=0.2225, simple_loss=0.2952, pruned_loss=0.07489, over 19964.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.306, pruned_loss=0.08782, over 4282869.51 frames. ], batch size: 704, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 13:10:31,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1234962.0, ans=0.125 2023-06-22 13:10:43,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-22 13:10:54,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1235022.0, ans=0.0 2023-06-22 13:11:02,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.78 vs. limit=22.5 2023-06-22 13:11:09,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.461e+02 4.069e+02 5.005e+02 9.619e+02, threshold=8.139e+02, percent-clipped=3.0 2023-06-22 13:11:43,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1235202.0, ans=0.125 2023-06-22 13:11:45,245 INFO [train.py:996] (0/4) Epoch 7, batch 22900, loss[loss=0.2361, simple_loss=0.3104, pruned_loss=0.08088, over 21196.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3079, pruned_loss=0.08708, over 4282993.42 frames. ], batch size: 159, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:12:11,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1235262.0, ans=0.1 2023-06-22 13:12:33,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1235322.0, ans=0.125 2023-06-22 13:13:02,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1235382.0, ans=0.125 2023-06-22 13:13:22,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1235442.0, ans=0.0 2023-06-22 13:13:32,263 INFO [train.py:996] (0/4) Epoch 7, batch 22950, loss[loss=0.2426, simple_loss=0.3699, pruned_loss=0.05764, over 21242.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3179, pruned_loss=0.08461, over 4278672.99 frames. ], batch size: 548, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:14:42,487 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.271e+02 4.396e+02 6.484e+02 1.017e+03, threshold=8.792e+02, percent-clipped=10.0 2023-06-22 13:15:11,733 INFO [train.py:996] (0/4) Epoch 7, batch 23000, loss[loss=0.2642, simple_loss=0.324, pruned_loss=0.1022, over 21565.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3194, pruned_loss=0.08369, over 4277696.88 frames. ], batch size: 548, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:15:12,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1235802.0, ans=0.0 2023-06-22 13:15:36,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1235862.0, ans=0.0 2023-06-22 13:15:41,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1235862.0, ans=0.04949747468305833 2023-06-22 13:16:22,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1235982.0, ans=0.125 2023-06-22 13:16:41,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1236042.0, ans=0.0 2023-06-22 13:16:52,150 INFO [train.py:996] (0/4) Epoch 7, batch 23050, loss[loss=0.3132, simple_loss=0.3668, pruned_loss=0.1298, over 21357.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3221, pruned_loss=0.08618, over 4284852.07 frames. ], batch size: 507, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:17:06,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1236102.0, ans=0.5 2023-06-22 13:17:10,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1236102.0, ans=0.2 2023-06-22 13:17:24,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1236162.0, ans=0.125 2023-06-22 13:17:26,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1236162.0, ans=0.1 2023-06-22 13:18:03,106 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.533e+02 3.591e+02 4.450e+02 5.560e+02 1.062e+03, threshold=8.900e+02, percent-clipped=1.0 2023-06-22 13:18:08,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1236282.0, ans=0.125 2023-06-22 13:18:08,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1236282.0, ans=0.0 2023-06-22 13:18:29,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236342.0, ans=0.1 2023-06-22 13:18:33,468 INFO [train.py:996] (0/4) Epoch 7, batch 23100, loss[loss=0.2211, simple_loss=0.2723, pruned_loss=0.08493, over 21205.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3181, pruned_loss=0.08617, over 4267632.63 frames. ], batch size: 176, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:18:54,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1236462.0, ans=0.035 2023-06-22 13:18:54,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1236462.0, ans=0.0 2023-06-22 13:19:56,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-22 13:20:07,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1236642.0, ans=0.125 2023-06-22 13:20:11,900 INFO [train.py:996] (0/4) Epoch 7, batch 23150, loss[loss=0.259, simple_loss=0.3131, pruned_loss=0.1025, over 21754.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3126, pruned_loss=0.0853, over 4265828.96 frames. ], batch size: 441, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:21:11,214 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-22 13:21:21,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.574e+02 4.225e+02 5.615e+02 9.377e+02, threshold=8.449e+02, percent-clipped=1.0 2023-06-22 13:21:23,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1236882.0, ans=0.0 2023-06-22 13:21:50,755 INFO [train.py:996] (0/4) Epoch 7, batch 23200, loss[loss=0.2531, simple_loss=0.3096, pruned_loss=0.09828, over 21488.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.312, pruned_loss=0.08634, over 4271238.61 frames. ], batch size: 194, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:21:54,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1237002.0, ans=0.0 2023-06-22 13:22:10,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1237062.0, ans=0.0 2023-06-22 13:22:18,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2023-06-22 13:22:40,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1237122.0, ans=0.2 2023-06-22 13:22:53,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1237182.0, ans=0.125 2023-06-22 13:22:58,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1237182.0, ans=0.125 2023-06-22 13:23:07,173 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-22 13:23:30,198 INFO [train.py:996] (0/4) Epoch 7, batch 23250, loss[loss=0.2263, simple_loss=0.2992, pruned_loss=0.07666, over 21511.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.312, pruned_loss=0.08782, over 4285806.25 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:23:30,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1237302.0, ans=0.0 2023-06-22 13:23:31,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-22 13:24:46,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 3.587e+02 4.602e+02 6.286e+02 1.178e+03, threshold=9.205e+02, percent-clipped=7.0 2023-06-22 13:25:16,771 INFO [train.py:996] (0/4) Epoch 7, batch 23300, loss[loss=0.2621, simple_loss=0.3741, pruned_loss=0.07504, over 21815.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3194, pruned_loss=0.08922, over 4281949.01 frames. ], batch size: 316, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:25:20,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1237602.0, ans=0.0 2023-06-22 13:25:28,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1237602.0, ans=0.0 2023-06-22 13:26:13,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-22 13:26:27,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1237782.0, ans=0.2 2023-06-22 13:26:45,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1237842.0, ans=0.2 2023-06-22 13:26:56,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1237902.0, ans=0.125 2023-06-22 13:26:57,957 INFO [train.py:996] (0/4) Epoch 7, batch 23350, loss[loss=0.2843, simple_loss=0.3586, pruned_loss=0.105, over 21711.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3219, pruned_loss=0.088, over 4283087.43 frames. ], batch size: 332, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:27:20,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1237902.0, ans=0.1 2023-06-22 13:27:48,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1238022.0, ans=0.0 2023-06-22 13:28:08,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1238082.0, ans=0.2 2023-06-22 13:28:08,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238082.0, ans=0.1 2023-06-22 13:28:09,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.411e+02 4.317e+02 5.452e+02 1.291e+03, threshold=8.634e+02, percent-clipped=4.0 2023-06-22 13:28:15,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-22 13:28:37,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238202.0, ans=0.1 2023-06-22 13:28:38,124 INFO [train.py:996] (0/4) Epoch 7, batch 23400, loss[loss=0.2321, simple_loss=0.3033, pruned_loss=0.0804, over 21648.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3129, pruned_loss=0.08327, over 4283546.21 frames. ], batch size: 263, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:29:39,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1238322.0, ans=0.0 2023-06-22 13:29:44,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1238382.0, ans=0.125 2023-06-22 13:29:55,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1238382.0, ans=0.125 2023-06-22 13:30:24,438 INFO [train.py:996] (0/4) Epoch 7, batch 23450, loss[loss=0.214, simple_loss=0.31, pruned_loss=0.05902, over 20758.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3148, pruned_loss=0.08509, over 4280752.88 frames. ], batch size: 607, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:30:59,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1238622.0, ans=0.2 2023-06-22 13:31:34,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 3.929e+02 4.982e+02 6.736e+02 9.588e+02, threshold=9.965e+02, percent-clipped=2.0 2023-06-22 13:32:07,451 INFO [train.py:996] (0/4) Epoch 7, batch 23500, loss[loss=0.2467, simple_loss=0.3132, pruned_loss=0.0901, over 21882.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3149, pruned_loss=0.08651, over 4291075.91 frames. ], batch size: 351, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:32:38,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238922.0, ans=0.1 2023-06-22 13:33:04,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1238982.0, ans=0.5 2023-06-22 13:33:27,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1239042.0, ans=0.5 2023-06-22 13:33:36,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-22 13:33:48,044 INFO [train.py:996] (0/4) Epoch 7, batch 23550, loss[loss=0.2263, simple_loss=0.285, pruned_loss=0.08377, over 21890.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3104, pruned_loss=0.08607, over 4284910.08 frames. ], batch size: 373, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:33:53,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-22 13:33:56,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1239102.0, ans=0.125 2023-06-22 13:34:08,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1239162.0, ans=0.125 2023-06-22 13:34:51,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-22 13:34:55,622 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.824e+02 3.361e+02 3.874e+02 4.874e+02 9.234e+02, threshold=7.748e+02, percent-clipped=0.0 2023-06-22 13:35:16,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1239342.0, ans=0.07 2023-06-22 13:35:29,855 INFO [train.py:996] (0/4) Epoch 7, batch 23600, loss[loss=0.2582, simple_loss=0.3318, pruned_loss=0.09228, over 21942.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3114, pruned_loss=0.08718, over 4276499.04 frames. ], batch size: 372, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:35:39,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1239402.0, ans=0.125 2023-06-22 13:36:23,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1239522.0, ans=0.125 2023-06-22 13:36:31,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-22 13:37:12,165 INFO [train.py:996] (0/4) Epoch 7, batch 23650, loss[loss=0.2351, simple_loss=0.3125, pruned_loss=0.07883, over 21445.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3118, pruned_loss=0.0852, over 4277413.46 frames. ], batch size: 194, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:37:57,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1239822.0, ans=0.2 2023-06-22 13:38:29,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.732e+02 3.692e+02 5.064e+02 6.588e+02 1.428e+03, threshold=1.013e+03, percent-clipped=16.0 2023-06-22 13:38:41,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1239942.0, ans=0.125 2023-06-22 13:38:53,686 INFO [train.py:996] (0/4) Epoch 7, batch 23700, loss[loss=0.224, simple_loss=0.2985, pruned_loss=0.07473, over 21418.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3134, pruned_loss=0.0846, over 4272994.64 frames. ], batch size: 194, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:39:20,489 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-06-22 13:39:26,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1240062.0, ans=0.2 2023-06-22 13:40:28,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1240242.0, ans=0.07 2023-06-22 13:40:40,729 INFO [train.py:996] (0/4) Epoch 7, batch 23750, loss[loss=0.1646, simple_loss=0.2536, pruned_loss=0.0378, over 21683.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.315, pruned_loss=0.08493, over 4272319.85 frames. ], batch size: 230, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:40:48,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.30 vs. limit=10.0 2023-06-22 13:40:59,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1240362.0, ans=0.95 2023-06-22 13:41:47,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.292e+02 4.228e+02 5.463e+02 1.067e+03, threshold=8.456e+02, percent-clipped=1.0 2023-06-22 13:42:23,037 INFO [train.py:996] (0/4) Epoch 7, batch 23800, loss[loss=0.2771, simple_loss=0.3727, pruned_loss=0.09076, over 21772.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3159, pruned_loss=0.08341, over 4269691.87 frames. ], batch size: 371, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:42:54,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1240662.0, ans=0.2 2023-06-22 13:43:02,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1240662.0, ans=0.0 2023-06-22 13:43:17,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1240722.0, ans=0.125 2023-06-22 13:44:04,763 INFO [train.py:996] (0/4) Epoch 7, batch 23850, loss[loss=0.2712, simple_loss=0.3365, pruned_loss=0.1029, over 21275.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3239, pruned_loss=0.08517, over 4269508.57 frames. ], batch size: 159, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:44:18,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1240902.0, ans=0.0 2023-06-22 13:44:39,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1240962.0, ans=0.125 2023-06-22 13:45:03,039 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-06-22 13:45:14,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-22 13:45:23,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.822e+02 3.704e+02 4.399e+02 5.519e+02 1.068e+03, threshold=8.797e+02, percent-clipped=5.0 2023-06-22 13:45:47,905 INFO [train.py:996] (0/4) Epoch 7, batch 23900, loss[loss=0.2341, simple_loss=0.3097, pruned_loss=0.07931, over 21624.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3303, pruned_loss=0.08747, over 4268623.41 frames. ], batch size: 298, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:46:13,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1241262.0, ans=0.125 2023-06-22 13:46:39,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1241322.0, ans=0.1 2023-06-22 13:46:49,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-22 13:47:22,333 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:47:27,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1241442.0, ans=0.02 2023-06-22 13:47:30,450 INFO [train.py:996] (0/4) Epoch 7, batch 23950, loss[loss=0.2374, simple_loss=0.2919, pruned_loss=0.09143, over 21849.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3253, pruned_loss=0.08753, over 4258321.99 frames. ], batch size: 107, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:47:45,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1241502.0, ans=0.125 2023-06-22 13:47:51,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1241502.0, ans=0.0 2023-06-22 13:47:54,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1241502.0, ans=0.0 2023-06-22 13:48:47,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.707e+02 3.506e+02 4.327e+02 5.535e+02 8.905e+02, threshold=8.653e+02, percent-clipped=1.0 2023-06-22 13:48:58,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1241742.0, ans=0.125 2023-06-22 13:49:17,857 INFO [train.py:996] (0/4) Epoch 7, batch 24000, loss[loss=0.2387, simple_loss=0.3165, pruned_loss=0.08048, over 21773.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3261, pruned_loss=0.08999, over 4260488.45 frames. ], batch size: 247, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:49:17,858 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 13:49:27,498 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.9113, 3.2889, 1.7455, 1.9357], device='cuda:0') 2023-06-22 13:49:33,459 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2773, simple_loss=0.3696, pruned_loss=0.09254, over 1796401.00 frames. 2023-06-22 13:49:33,459 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 13:49:35,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1241802.0, ans=0.125 2023-06-22 13:50:23,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1241922.0, ans=0.125 2023-06-22 13:50:24,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-22 13:50:29,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1241922.0, ans=15.0 2023-06-22 13:50:38,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1241982.0, ans=0.125 2023-06-22 13:51:13,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1242042.0, ans=0.0 2023-06-22 13:51:16,768 INFO [train.py:996] (0/4) Epoch 7, batch 24050, loss[loss=0.2416, simple_loss=0.3142, pruned_loss=0.08455, over 20042.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3274, pruned_loss=0.09048, over 4264352.80 frames. ], batch size: 703, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:51:26,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1242102.0, ans=0.125 2023-06-22 13:51:55,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1242162.0, ans=0.1 2023-06-22 13:52:13,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-22 13:52:17,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1242282.0, ans=0.1 2023-06-22 13:52:19,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1242282.0, ans=0.125 2023-06-22 13:52:29,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1242282.0, ans=0.125 2023-06-22 13:52:35,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.538e+02 3.768e+02 5.092e+02 6.295e+02 1.003e+03, threshold=1.018e+03, percent-clipped=2.0 2023-06-22 13:52:58,074 INFO [train.py:996] (0/4) Epoch 7, batch 24100, loss[loss=0.3118, simple_loss=0.3905, pruned_loss=0.1166, over 21742.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3289, pruned_loss=0.09, over 4263912.69 frames. ], batch size: 441, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:54:06,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1242582.0, ans=0.1 2023-06-22 13:54:16,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1242582.0, ans=0.0 2023-06-22 13:54:38,983 INFO [train.py:996] (0/4) Epoch 7, batch 24150, loss[loss=0.277, simple_loss=0.3349, pruned_loss=0.1095, over 21856.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3288, pruned_loss=0.09149, over 4273304.36 frames. ], batch size: 371, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:54:44,880 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-22 13:55:07,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.23 vs. limit=12.0 2023-06-22 13:55:16,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1242762.0, ans=0.0 2023-06-22 13:55:52,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-22 13:55:57,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1242882.0, ans=0.05 2023-06-22 13:55:57,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.873e+02 3.729e+02 4.537e+02 5.592e+02 8.815e+02, threshold=9.074e+02, percent-clipped=0.0 2023-06-22 13:56:07,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1242942.0, ans=0.1 2023-06-22 13:56:11,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1242942.0, ans=0.125 2023-06-22 13:56:18,608 INFO [train.py:996] (0/4) Epoch 7, batch 24200, loss[loss=0.2602, simple_loss=0.3452, pruned_loss=0.08758, over 21714.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3317, pruned_loss=0.09302, over 4273516.03 frames. ], batch size: 351, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:56:53,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1243062.0, ans=0.125 2023-06-22 13:57:32,226 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-22 13:57:37,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1243182.0, ans=0.125 2023-06-22 13:57:41,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1243242.0, ans=0.04949747468305833 2023-06-22 13:58:03,647 INFO [train.py:996] (0/4) Epoch 7, batch 24250, loss[loss=0.1785, simple_loss=0.2705, pruned_loss=0.04328, over 21373.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3276, pruned_loss=0.08598, over 4275406.25 frames. ], batch size: 194, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 13:59:01,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1243482.0, ans=0.125 2023-06-22 13:59:06,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1243482.0, ans=0.04949747468305833 2023-06-22 13:59:07,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.091e+02 3.731e+02 4.711e+02 7.099e+02, threshold=7.462e+02, percent-clipped=0.0 2023-06-22 13:59:34,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=22.5 2023-06-22 13:59:38,176 INFO [train.py:996] (0/4) Epoch 7, batch 24300, loss[loss=0.2072, simple_loss=0.3062, pruned_loss=0.05412, over 21233.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3186, pruned_loss=0.0796, over 4281320.80 frames. ], batch size: 548, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 13:59:44,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1243602.0, ans=0.5 2023-06-22 13:59:51,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1243602.0, ans=0.125 2023-06-22 14:00:01,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1243662.0, ans=0.1 2023-06-22 14:00:44,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1243782.0, ans=0.0 2023-06-22 14:01:12,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-22 14:01:17,197 INFO [train.py:996] (0/4) Epoch 7, batch 24350, loss[loss=0.2272, simple_loss=0.2949, pruned_loss=0.07977, over 21263.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3159, pruned_loss=0.08025, over 4290729.60 frames. ], batch size: 143, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:01:17,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1243902.0, ans=0.0 2023-06-22 14:02:31,000 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 3.282e+02 3.867e+02 5.054e+02 1.141e+03, threshold=7.734e+02, percent-clipped=5.0 2023-06-22 14:02:56,615 INFO [train.py:996] (0/4) Epoch 7, batch 24400, loss[loss=0.2522, simple_loss=0.3232, pruned_loss=0.09064, over 21282.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3198, pruned_loss=0.08383, over 4286964.04 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:03:42,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1244322.0, ans=0.125 2023-06-22 14:03:55,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1244382.0, ans=0.07 2023-06-22 14:04:15,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1244382.0, ans=0.09899494936611666 2023-06-22 14:04:28,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-22 14:04:28,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-06-22 14:04:39,356 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:04:42,192 INFO [train.py:996] (0/4) Epoch 7, batch 24450, loss[loss=0.2883, simple_loss=0.3733, pruned_loss=0.1017, over 21766.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3218, pruned_loss=0.0848, over 4282609.41 frames. ], batch size: 351, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:05:36,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1244622.0, ans=0.0 2023-06-22 14:05:57,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.418e+02 4.354e+02 5.626e+02 1.189e+03, threshold=8.709e+02, percent-clipped=8.0 2023-06-22 14:06:22,879 INFO [train.py:996] (0/4) Epoch 7, batch 24500, loss[loss=0.2739, simple_loss=0.3358, pruned_loss=0.106, over 21791.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3216, pruned_loss=0.08428, over 4288024.20 frames. ], batch size: 441, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:07:13,894 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:07:50,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1245042.0, ans=0.125 2023-06-22 14:07:56,100 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:08:05,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1245042.0, ans=0.125 2023-06-22 14:08:11,148 INFO [train.py:996] (0/4) Epoch 7, batch 24550, loss[loss=0.2972, simple_loss=0.3724, pruned_loss=0.111, over 21582.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3229, pruned_loss=0.08647, over 4284042.52 frames. ], batch size: 414, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:08:31,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-22 14:09:03,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1245222.0, ans=0.0 2023-06-22 14:09:05,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1245222.0, ans=0.125 2023-06-22 14:09:14,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1245282.0, ans=0.125 2023-06-22 14:09:25,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 3.383e+02 4.024e+02 4.711e+02 8.418e+02, threshold=8.048e+02, percent-clipped=0.0 2023-06-22 14:09:44,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1245342.0, ans=15.0 2023-06-22 14:09:45,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1245342.0, ans=0.125 2023-06-22 14:09:51,583 INFO [train.py:996] (0/4) Epoch 7, batch 24600, loss[loss=0.2581, simple_loss=0.3201, pruned_loss=0.09804, over 21783.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3192, pruned_loss=0.08716, over 4283538.63 frames. ], batch size: 352, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:10:42,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1245522.0, ans=0.2 2023-06-22 14:11:09,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1245642.0, ans=0.2 2023-06-22 14:11:32,128 INFO [train.py:996] (0/4) Epoch 7, batch 24650, loss[loss=0.2217, simple_loss=0.2786, pruned_loss=0.08236, over 21351.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3122, pruned_loss=0.08631, over 4282054.67 frames. ], batch size: 473, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:11:36,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1245702.0, ans=0.125 2023-06-22 14:11:45,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1245702.0, ans=0.125 2023-06-22 14:12:20,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1245822.0, ans=0.125 2023-06-22 14:12:30,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1245882.0, ans=22.5 2023-06-22 14:12:46,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.556e+02 4.332e+02 6.409e+02 1.192e+03, threshold=8.664e+02, percent-clipped=12.0 2023-06-22 14:13:12,584 INFO [train.py:996] (0/4) Epoch 7, batch 24700, loss[loss=0.22, simple_loss=0.2892, pruned_loss=0.07535, over 16987.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3094, pruned_loss=0.08421, over 4267198.70 frames. ], batch size: 67, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:13:22,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1246002.0, ans=0.0 2023-06-22 14:13:45,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1246062.0, ans=0.125 2023-06-22 14:14:00,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1246122.0, ans=0.0 2023-06-22 14:14:08,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1246122.0, ans=0.1 2023-06-22 14:14:14,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1246182.0, ans=0.125 2023-06-22 14:14:27,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1246182.0, ans=0.0 2023-06-22 14:14:30,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1246242.0, ans=0.125 2023-06-22 14:14:32,047 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:14:33,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1246242.0, ans=0.125 2023-06-22 14:14:50,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=22.5 2023-06-22 14:14:54,489 INFO [train.py:996] (0/4) Epoch 7, batch 24750, loss[loss=0.2037, simple_loss=0.2654, pruned_loss=0.07104, over 21480.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.303, pruned_loss=0.08144, over 4275960.10 frames. ], batch size: 132, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:15:08,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1246362.0, ans=0.125 2023-06-22 14:16:00,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1246482.0, ans=0.125 2023-06-22 14:16:07,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 3.441e+02 4.025e+02 5.655e+02 9.587e+02, threshold=8.050e+02, percent-clipped=3.0 2023-06-22 14:16:33,218 INFO [train.py:996] (0/4) Epoch 7, batch 24800, loss[loss=0.2323, simple_loss=0.3074, pruned_loss=0.07862, over 21869.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2984, pruned_loss=0.08144, over 4276075.02 frames. ], batch size: 124, lr: 4.21e-03, grad_scale: 32.0 2023-06-22 14:16:35,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1246602.0, ans=0.0 2023-06-22 14:17:04,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1246662.0, ans=0.1 2023-06-22 14:17:08,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1246662.0, ans=0.0 2023-06-22 14:17:15,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-22 14:17:28,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1246722.0, ans=0.125 2023-06-22 14:18:14,891 INFO [train.py:996] (0/4) Epoch 7, batch 24850, loss[loss=0.2124, simple_loss=0.2879, pruned_loss=0.06847, over 21717.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2994, pruned_loss=0.08246, over 4278659.12 frames. ], batch size: 247, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:18:43,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1246962.0, ans=0.0 2023-06-22 14:19:14,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1247082.0, ans=0.2 2023-06-22 14:19:21,658 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-22 14:19:30,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1247082.0, ans=0.1 2023-06-22 14:19:31,875 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.875e+02 3.674e+02 4.443e+02 5.678e+02 1.334e+03, threshold=8.887e+02, percent-clipped=8.0 2023-06-22 14:19:41,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-22 14:19:50,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1247142.0, ans=0.125 2023-06-22 14:19:53,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1247142.0, ans=0.125 2023-06-22 14:19:55,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1247202.0, ans=0.1 2023-06-22 14:19:56,857 INFO [train.py:996] (0/4) Epoch 7, batch 24900, loss[loss=0.2391, simple_loss=0.3137, pruned_loss=0.08225, over 21545.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3032, pruned_loss=0.0842, over 4275913.44 frames. ], batch size: 194, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:19:58,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1247202.0, ans=0.125 2023-06-22 14:19:59,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.77 vs. limit=10.0 2023-06-22 14:20:15,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-22 14:20:33,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1247262.0, ans=0.125 2023-06-22 14:21:27,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1247442.0, ans=0.125 2023-06-22 14:21:43,167 INFO [train.py:996] (0/4) Epoch 7, batch 24950, loss[loss=0.2996, simple_loss=0.3696, pruned_loss=0.1148, over 21590.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.31, pruned_loss=0.08783, over 4274923.47 frames. ], batch size: 389, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:21:58,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1247562.0, ans=0.125 2023-06-22 14:22:33,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1247622.0, ans=0.0 2023-06-22 14:22:38,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1247622.0, ans=0.0 2023-06-22 14:22:46,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1247622.0, ans=0.125 2023-06-22 14:22:57,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-22 14:23:02,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 3.624e+02 4.184e+02 5.539e+02 8.778e+02, threshold=8.368e+02, percent-clipped=0.0 2023-06-22 14:23:26,312 INFO [train.py:996] (0/4) Epoch 7, batch 25000, loss[loss=0.2146, simple_loss=0.284, pruned_loss=0.07261, over 21756.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3185, pruned_loss=0.0903, over 4276388.72 frames. ], batch size: 112, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:23:52,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1247862.0, ans=0.125 2023-06-22 14:24:09,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-22 14:24:21,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1247922.0, ans=0.125 2023-06-22 14:24:22,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1247922.0, ans=0.0 2023-06-22 14:24:36,021 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-208000.pt 2023-06-22 14:24:48,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1248042.0, ans=0.125 2023-06-22 14:25:10,078 INFO [train.py:996] (0/4) Epoch 7, batch 25050, loss[loss=0.2337, simple_loss=0.285, pruned_loss=0.09121, over 21836.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3115, pruned_loss=0.08843, over 4271507.30 frames. ], batch size: 373, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:25:44,687 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:25:51,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1248222.0, ans=0.125 2023-06-22 14:25:54,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1248222.0, ans=0.125 2023-06-22 14:26:28,136 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 3.310e+02 3.900e+02 4.690e+02 8.099e+02, threshold=7.799e+02, percent-clipped=0.0 2023-06-22 14:26:48,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=22.5 2023-06-22 14:26:48,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-22 14:26:51,217 INFO [train.py:996] (0/4) Epoch 7, batch 25100, loss[loss=0.2011, simple_loss=0.2752, pruned_loss=0.06347, over 21222.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3054, pruned_loss=0.08716, over 4272961.59 frames. ], batch size: 548, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:28:08,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-22 14:28:14,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1248642.0, ans=0.125 2023-06-22 14:28:14,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-22 14:28:25,025 INFO [train.py:996] (0/4) Epoch 7, batch 25150, loss[loss=0.2214, simple_loss=0.2914, pruned_loss=0.0757, over 17851.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3082, pruned_loss=0.08442, over 4272543.22 frames. ], batch size: 69, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:29:48,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.302e+02 3.375e+02 4.359e+02 5.711e+02 9.666e+02, threshold=8.717e+02, percent-clipped=5.0 2023-06-22 14:30:06,549 INFO [train.py:996] (0/4) Epoch 7, batch 25200, loss[loss=0.255, simple_loss=0.3596, pruned_loss=0.0752, over 20859.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3072, pruned_loss=0.08188, over 4263338.28 frames. ], batch size: 608, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:30:14,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-06-22 14:30:56,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1249122.0, ans=0.125 2023-06-22 14:30:59,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1249122.0, ans=0.0 2023-06-22 14:31:02,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1249122.0, ans=0.2 2023-06-22 14:31:45,965 INFO [train.py:996] (0/4) Epoch 7, batch 25250, loss[loss=0.2177, simple_loss=0.2968, pruned_loss=0.06931, over 21640.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3058, pruned_loss=0.08029, over 4274468.49 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:31:55,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1249302.0, ans=0.1 2023-06-22 14:32:00,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1249302.0, ans=0.0 2023-06-22 14:32:26,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-22 14:33:04,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 3.268e+02 4.285e+02 6.443e+02 1.274e+03, threshold=8.570e+02, percent-clipped=9.0 2023-06-22 14:33:32,060 INFO [train.py:996] (0/4) Epoch 7, batch 25300, loss[loss=0.2482, simple_loss=0.325, pruned_loss=0.08564, over 21781.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3017, pruned_loss=0.07909, over 4262851.11 frames. ], batch size: 332, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:33:40,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1249602.0, ans=0.2 2023-06-22 14:33:55,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1249662.0, ans=0.015 2023-06-22 14:34:02,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-22 14:34:11,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1249662.0, ans=0.125 2023-06-22 14:34:18,738 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.25 vs. limit=15.0 2023-06-22 14:34:42,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1249782.0, ans=0.125 2023-06-22 14:35:12,509 INFO [train.py:996] (0/4) Epoch 7, batch 25350, loss[loss=0.1755, simple_loss=0.2381, pruned_loss=0.05641, over 16784.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3028, pruned_loss=0.07829, over 4248654.67 frames. ], batch size: 61, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:35:13,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1249902.0, ans=0.0 2023-06-22 14:35:38,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1249962.0, ans=0.0 2023-06-22 14:36:25,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.273e+02 3.948e+02 5.188e+02 1.185e+03, threshold=7.895e+02, percent-clipped=3.0 2023-06-22 14:36:47,864 INFO [train.py:996] (0/4) Epoch 7, batch 25400, loss[loss=0.1881, simple_loss=0.2731, pruned_loss=0.05154, over 21619.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2995, pruned_loss=0.07747, over 4241905.81 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:36:54,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1250202.0, ans=0.2 2023-06-22 14:37:08,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1250262.0, ans=0.125 2023-06-22 14:37:38,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1250322.0, ans=0.125 2023-06-22 14:37:48,318 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-22 14:38:01,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1250382.0, ans=0.125 2023-06-22 14:38:33,461 INFO [train.py:996] (0/4) Epoch 7, batch 25450, loss[loss=0.2257, simple_loss=0.3092, pruned_loss=0.07109, over 21320.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3007, pruned_loss=0.08012, over 4253502.96 frames. ], batch size: 159, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:39:40,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1250682.0, ans=0.125 2023-06-22 14:39:47,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.153e+02 3.590e+02 4.840e+02 1.017e+03, threshold=7.180e+02, percent-clipped=5.0 2023-06-22 14:39:56,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1250742.0, ans=0.2 2023-06-22 14:40:00,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1250742.0, ans=0.1 2023-06-22 14:40:16,030 INFO [train.py:996] (0/4) Epoch 7, batch 25500, loss[loss=0.1882, simple_loss=0.2825, pruned_loss=0.04692, over 21632.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3006, pruned_loss=0.07675, over 4260606.20 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:40:57,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1250922.0, ans=0.2 2023-06-22 14:41:12,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1250922.0, ans=0.0 2023-06-22 14:41:41,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1251042.0, ans=0.125 2023-06-22 14:41:41,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1251042.0, ans=0.1 2023-06-22 14:41:57,292 INFO [train.py:996] (0/4) Epoch 7, batch 25550, loss[loss=0.2155, simple_loss=0.3078, pruned_loss=0.06155, over 21548.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3086, pruned_loss=0.07719, over 4264520.32 frames. ], batch size: 230, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:42:17,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-22 14:42:23,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.37 vs. limit=22.5 2023-06-22 14:43:16,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.498e+02 4.719e+02 6.187e+02 1.035e+03, threshold=9.438e+02, percent-clipped=13.0 2023-06-22 14:43:39,341 INFO [train.py:996] (0/4) Epoch 7, batch 25600, loss[loss=0.2368, simple_loss=0.3083, pruned_loss=0.08262, over 21640.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3136, pruned_loss=0.07889, over 4266450.01 frames. ], batch size: 230, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:43:39,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1251402.0, ans=0.95 2023-06-22 14:43:52,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1251402.0, ans=0.1 2023-06-22 14:45:17,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1251642.0, ans=0.125 2023-06-22 14:45:20,822 INFO [train.py:996] (0/4) Epoch 7, batch 25650, loss[loss=0.245, simple_loss=0.3117, pruned_loss=0.08913, over 21761.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3132, pruned_loss=0.08096, over 4265013.58 frames. ], batch size: 124, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:45:32,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1251702.0, ans=0.0 2023-06-22 14:45:34,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1251702.0, ans=0.0 2023-06-22 14:46:37,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.765e+02 3.700e+02 4.314e+02 5.225e+02 1.015e+03, threshold=8.627e+02, percent-clipped=1.0 2023-06-22 14:46:54,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1251942.0, ans=0.1 2023-06-22 14:47:05,291 INFO [train.py:996] (0/4) Epoch 7, batch 25700, loss[loss=0.2789, simple_loss=0.4103, pruned_loss=0.07372, over 19832.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3109, pruned_loss=0.08281, over 4263504.29 frames. ], batch size: 702, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:48:31,788 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.52 vs. limit=15.0 2023-06-22 14:48:39,709 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:48:47,668 INFO [train.py:996] (0/4) Epoch 7, batch 25750, loss[loss=0.2202, simple_loss=0.2903, pruned_loss=0.07503, over 19949.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3165, pruned_loss=0.08601, over 4268424.01 frames. ], batch size: 702, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:48:49,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1252302.0, ans=10.0 2023-06-22 14:50:13,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 3.834e+02 4.976e+02 6.309e+02 1.046e+03, threshold=9.952e+02, percent-clipped=8.0 2023-06-22 14:50:31,435 INFO [train.py:996] (0/4) Epoch 7, batch 25800, loss[loss=0.2772, simple_loss=0.3521, pruned_loss=0.1011, over 21378.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3272, pruned_loss=0.09, over 4268840.92 frames. ], batch size: 159, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:50:57,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1252662.0, ans=0.125 2023-06-22 14:51:00,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1252662.0, ans=10.0 2023-06-22 14:51:28,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1252722.0, ans=0.0 2023-06-22 14:52:13,825 INFO [train.py:996] (0/4) Epoch 7, batch 25850, loss[loss=0.2306, simple_loss=0.3041, pruned_loss=0.07858, over 21491.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3295, pruned_loss=0.08999, over 4275396.78 frames. ], batch size: 548, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:52:48,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-22 14:53:10,117 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-22 14:53:11,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1253022.0, ans=0.0 2023-06-22 14:53:26,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1253082.0, ans=0.125 2023-06-22 14:53:33,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-22 14:53:39,303 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 3.379e+02 4.198e+02 5.392e+02 1.106e+03, threshold=8.396e+02, percent-clipped=2.0 2023-06-22 14:54:05,903 INFO [train.py:996] (0/4) Epoch 7, batch 25900, loss[loss=0.2653, simple_loss=0.3577, pruned_loss=0.08639, over 21612.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3312, pruned_loss=0.09086, over 4278929.76 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:54:23,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1253202.0, ans=0.125 2023-06-22 14:54:35,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1253262.0, ans=0.0 2023-06-22 14:55:19,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1253382.0, ans=0.125 2023-06-22 14:55:52,698 INFO [train.py:996] (0/4) Epoch 7, batch 25950, loss[loss=0.2548, simple_loss=0.3351, pruned_loss=0.08727, over 21850.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3372, pruned_loss=0.09363, over 4283181.78 frames. ], batch size: 118, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:56:16,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1253562.0, ans=0.1 2023-06-22 14:56:33,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1253622.0, ans=0.0 2023-06-22 14:57:11,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.589e+02 4.463e+02 5.427e+02 9.904e+02, threshold=8.927e+02, percent-clipped=3.0 2023-06-22 14:57:37,692 INFO [train.py:996] (0/4) Epoch 7, batch 26000, loss[loss=0.3624, simple_loss=0.4176, pruned_loss=0.1536, over 21356.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3372, pruned_loss=0.09267, over 4283835.71 frames. ], batch size: 507, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:58:01,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1253862.0, ans=0.2 2023-06-22 14:58:04,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1253862.0, ans=0.125 2023-06-22 14:58:04,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1253862.0, ans=0.125 2023-06-22 14:58:13,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-22 14:58:21,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-22 14:59:18,565 INFO [train.py:996] (0/4) Epoch 7, batch 26050, loss[loss=0.28, simple_loss=0.3287, pruned_loss=0.1156, over 21675.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3385, pruned_loss=0.09468, over 4283887.59 frames. ], batch size: 507, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:59:47,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1254162.0, ans=0.2 2023-06-22 15:00:06,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1254282.0, ans=0.125 2023-06-22 15:00:17,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1254282.0, ans=0.0 2023-06-22 15:00:37,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.714e+02 3.622e+02 4.374e+02 5.321e+02 8.486e+02, threshold=8.748e+02, percent-clipped=0.0 2023-06-22 15:00:56,284 INFO [train.py:996] (0/4) Epoch 7, batch 26100, loss[loss=0.2648, simple_loss=0.3219, pruned_loss=0.1039, over 21376.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3331, pruned_loss=0.09376, over 4282053.27 frames. ], batch size: 143, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:01:06,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1254402.0, ans=0.125 2023-06-22 15:01:14,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1254462.0, ans=0.0 2023-06-22 15:01:20,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1254462.0, ans=0.1 2023-06-22 15:01:50,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1254582.0, ans=0.0 2023-06-22 15:02:12,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-22 15:02:21,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-22 15:02:36,850 INFO [train.py:996] (0/4) Epoch 7, batch 26150, loss[loss=0.3007, simple_loss=0.357, pruned_loss=0.1222, over 21836.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3302, pruned_loss=0.09377, over 4290833.74 frames. ], batch size: 441, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:03:52,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1254882.0, ans=0.1 2023-06-22 15:04:03,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1254942.0, ans=0.0 2023-06-22 15:04:04,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.200e+02 3.625e+02 4.544e+02 6.697e+02, threshold=7.250e+02, percent-clipped=0.0 2023-06-22 15:04:10,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1254942.0, ans=0.0 2023-06-22 15:04:19,703 INFO [train.py:996] (0/4) Epoch 7, batch 26200, loss[loss=0.2459, simple_loss=0.3388, pruned_loss=0.0765, over 21495.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3305, pruned_loss=0.09175, over 4285215.71 frames. ], batch size: 211, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:05:05,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1255122.0, ans=0.125 2023-06-22 15:05:13,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1255122.0, ans=0.125 2023-06-22 15:05:21,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1255182.0, ans=0.0 2023-06-22 15:05:48,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1255242.0, ans=0.125 2023-06-22 15:05:59,788 INFO [train.py:996] (0/4) Epoch 7, batch 26250, loss[loss=0.2443, simple_loss=0.3159, pruned_loss=0.08639, over 21356.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3349, pruned_loss=0.0911, over 4284866.23 frames. ], batch size: 176, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:06:03,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1255302.0, ans=0.125 2023-06-22 15:06:05,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-22 15:06:06,430 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:06:15,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1255362.0, ans=15.0 2023-06-22 15:06:21,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1255362.0, ans=0.125 2023-06-22 15:06:51,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1255422.0, ans=0.0 2023-06-22 15:06:52,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1255422.0, ans=10.0 2023-06-22 15:07:23,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1255542.0, ans=0.1 2023-06-22 15:07:24,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.543e+02 3.568e+02 4.427e+02 6.029e+02 1.421e+03, threshold=8.855e+02, percent-clipped=13.0 2023-06-22 15:07:38,973 INFO [train.py:996] (0/4) Epoch 7, batch 26300, loss[loss=0.2413, simple_loss=0.3117, pruned_loss=0.08551, over 21869.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3308, pruned_loss=0.09165, over 4295208.72 frames. ], batch size: 107, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:07:46,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1255602.0, ans=0.2 2023-06-22 15:08:20,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1255722.0, ans=0.0 2023-06-22 15:08:41,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1255722.0, ans=0.125 2023-06-22 15:09:17,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1255842.0, ans=0.5 2023-06-22 15:09:18,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1255902.0, ans=0.1 2023-06-22 15:09:19,836 INFO [train.py:996] (0/4) Epoch 7, batch 26350, loss[loss=0.2637, simple_loss=0.3359, pruned_loss=0.09575, over 21389.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3286, pruned_loss=0.0917, over 4296312.09 frames. ], batch size: 159, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:09:28,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1255902.0, ans=0.125 2023-06-22 15:10:45,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.781e+02 3.544e+02 4.038e+02 5.377e+02 1.137e+03, threshold=8.075e+02, percent-clipped=4.0 2023-06-22 15:10:52,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1256142.0, ans=0.0 2023-06-22 15:11:00,708 INFO [train.py:996] (0/4) Epoch 7, batch 26400, loss[loss=0.2911, simple_loss=0.3208, pruned_loss=0.1307, over 21336.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3236, pruned_loss=0.09213, over 4294622.37 frames. ], batch size: 507, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:11:01,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1256202.0, ans=0.95 2023-06-22 15:11:03,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-22 15:11:15,548 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-22 15:11:28,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1256262.0, ans=0.125 2023-06-22 15:11:37,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1256262.0, ans=0.0 2023-06-22 15:11:48,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.43 vs. limit=10.0 2023-06-22 15:12:26,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1256442.0, ans=0.125 2023-06-22 15:12:40,496 INFO [train.py:996] (0/4) Epoch 7, batch 26450, loss[loss=0.25, simple_loss=0.3382, pruned_loss=0.08089, over 21426.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3242, pruned_loss=0.09175, over 4289434.39 frames. ], batch size: 211, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:13:23,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1256562.0, ans=0.025 2023-06-22 15:13:42,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1256622.0, ans=0.2 2023-06-22 15:14:08,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 3.688e+02 4.621e+02 8.318e+02 2.033e+03, threshold=9.242e+02, percent-clipped=27.0 2023-06-22 15:14:24,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1256742.0, ans=0.0 2023-06-22 15:14:37,765 INFO [train.py:996] (0/4) Epoch 7, batch 26500, loss[loss=0.2773, simple_loss=0.356, pruned_loss=0.09934, over 21678.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3259, pruned_loss=0.09013, over 4282006.05 frames. ], batch size: 414, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:14:58,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1256862.0, ans=0.0 2023-06-22 15:15:05,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1256862.0, ans=0.0 2023-06-22 15:15:09,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1256862.0, ans=0.0 2023-06-22 15:15:37,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1256982.0, ans=0.07 2023-06-22 15:16:14,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1257042.0, ans=0.0 2023-06-22 15:16:27,022 INFO [train.py:996] (0/4) Epoch 7, batch 26550, loss[loss=0.2053, simple_loss=0.3139, pruned_loss=0.0483, over 21142.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3236, pruned_loss=0.08698, over 4275470.35 frames. ], batch size: 548, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:17:53,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1257342.0, ans=0.125 2023-06-22 15:17:55,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.702e+02 5.125e+02 7.307e+02 1.235e+03, threshold=1.025e+03, percent-clipped=15.0 2023-06-22 15:18:08,973 INFO [train.py:996] (0/4) Epoch 7, batch 26600, loss[loss=0.2526, simple_loss=0.3093, pruned_loss=0.09789, over 21846.00 frames. ], tot_loss[loss=0.245, simple_loss=0.322, pruned_loss=0.08396, over 4276281.26 frames. ], batch size: 107, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:18:11,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1257402.0, ans=0.125 2023-06-22 15:18:19,788 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.203e-03 2023-06-22 15:19:06,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1257522.0, ans=0.125 2023-06-22 15:19:09,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-22 15:19:36,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1257642.0, ans=0.125 2023-06-22 15:19:46,595 INFO [train.py:996] (0/4) Epoch 7, batch 26650, loss[loss=0.197, simple_loss=0.2747, pruned_loss=0.05969, over 21709.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.315, pruned_loss=0.08277, over 4262789.47 frames. ], batch size: 298, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:19:55,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-22 15:19:59,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1257702.0, ans=0.125 2023-06-22 15:19:59,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1257702.0, ans=0.125 2023-06-22 15:20:01,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1257762.0, ans=0.0 2023-06-22 15:20:12,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-22 15:20:23,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1257822.0, ans=0.1 2023-06-22 15:20:42,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-22 15:21:10,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-22 15:21:12,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.375e+02 4.059e+02 5.279e+02 9.271e+02, threshold=8.118e+02, percent-clipped=0.0 2023-06-22 15:21:19,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1257942.0, ans=0.1 2023-06-22 15:21:25,466 INFO [train.py:996] (0/4) Epoch 7, batch 26700, loss[loss=0.2061, simple_loss=0.2801, pruned_loss=0.06607, over 21842.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3072, pruned_loss=0.07912, over 4272820.33 frames. ], batch size: 282, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:22:03,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-22 15:22:58,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-22 15:23:08,494 INFO [train.py:996] (0/4) Epoch 7, batch 26750, loss[loss=0.1988, simple_loss=0.2807, pruned_loss=0.05842, over 21637.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3069, pruned_loss=0.0777, over 4280859.63 frames. ], batch size: 230, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:23:56,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1258422.0, ans=0.2 2023-06-22 15:24:06,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1258422.0, ans=0.0 2023-06-22 15:24:29,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1258482.0, ans=0.125 2023-06-22 15:24:37,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.345e+02 4.329e+02 5.688e+02 1.116e+03, threshold=8.657e+02, percent-clipped=5.0 2023-06-22 15:24:50,401 INFO [train.py:996] (0/4) Epoch 7, batch 26800, loss[loss=0.3033, simple_loss=0.3637, pruned_loss=0.1214, over 21461.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3151, pruned_loss=0.08285, over 4282112.35 frames. ], batch size: 510, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:26:37,064 INFO [train.py:996] (0/4) Epoch 7, batch 26850, loss[loss=0.2288, simple_loss=0.2865, pruned_loss=0.08557, over 20073.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3169, pruned_loss=0.08557, over 4276510.25 frames. ], batch size: 703, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:26:58,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1258962.0, ans=0.0 2023-06-22 15:27:00,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1258962.0, ans=0.125 2023-06-22 15:27:59,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 3.440e+02 3.815e+02 4.750e+02 7.886e+02, threshold=7.630e+02, percent-clipped=0.0 2023-06-22 15:28:17,083 INFO [train.py:996] (0/4) Epoch 7, batch 26900, loss[loss=0.1984, simple_loss=0.2583, pruned_loss=0.06925, over 21600.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3082, pruned_loss=0.0845, over 4271612.87 frames. ], batch size: 298, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:28:19,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.99 vs. limit=15.0 2023-06-22 15:28:35,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1259202.0, ans=0.0 2023-06-22 15:29:19,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-22 15:29:26,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1259382.0, ans=0.125 2023-06-22 15:29:30,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1259382.0, ans=0.125 2023-06-22 15:29:52,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1259442.0, ans=0.0 2023-06-22 15:29:56,689 INFO [train.py:996] (0/4) Epoch 7, batch 26950, loss[loss=0.2709, simple_loss=0.3414, pruned_loss=0.1002, over 21495.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.307, pruned_loss=0.0844, over 4250666.99 frames. ], batch size: 389, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:30:14,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1259502.0, ans=0.125 2023-06-22 15:30:51,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1259622.0, ans=0.025 2023-06-22 15:31:16,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-22 15:31:18,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.421e+02 4.204e+02 5.403e+02 1.152e+03, threshold=8.409e+02, percent-clipped=7.0 2023-06-22 15:31:40,927 INFO [train.py:996] (0/4) Epoch 7, batch 27000, loss[loss=0.1926, simple_loss=0.2738, pruned_loss=0.05567, over 21374.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3084, pruned_loss=0.08291, over 4246998.06 frames. ], batch size: 211, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:31:40,928 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 15:31:52,737 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.5567, 3.0029, 1.4230, 1.5917], device='cuda:0') 2023-06-22 15:31:59,799 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2427, simple_loss=0.3424, pruned_loss=0.07152, over 1796401.00 frames. 2023-06-22 15:31:59,800 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 15:32:11,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1259802.0, ans=0.1 2023-06-22 15:32:20,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1259862.0, ans=0.0 2023-06-22 15:32:30,510 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.91 vs. limit=15.0 2023-06-22 15:32:32,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1259862.0, ans=0.0 2023-06-22 15:32:33,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-22 15:32:34,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1259862.0, ans=0.125 2023-06-22 15:32:49,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1259922.0, ans=0.125 2023-06-22 15:33:08,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1259982.0, ans=0.125 2023-06-22 15:33:21,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1260042.0, ans=0.125 2023-06-22 15:33:38,759 INFO [train.py:996] (0/4) Epoch 7, batch 27050, loss[loss=0.2521, simple_loss=0.3168, pruned_loss=0.09368, over 21222.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3105, pruned_loss=0.07994, over 4247081.92 frames. ], batch size: 143, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:33:48,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1260102.0, ans=0.2 2023-06-22 15:33:56,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1260102.0, ans=0.125 2023-06-22 15:35:07,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.078e+02 3.767e+02 4.545e+02 7.806e+02, threshold=7.533e+02, percent-clipped=0.0 2023-06-22 15:35:23,482 INFO [train.py:996] (0/4) Epoch 7, batch 27100, loss[loss=0.2254, simple_loss=0.2957, pruned_loss=0.07757, over 21614.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3125, pruned_loss=0.08079, over 4250122.55 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:35:35,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1260402.0, ans=0.125 2023-06-22 15:35:50,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1260462.0, ans=0.125 2023-06-22 15:36:32,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1260582.0, ans=0.125 2023-06-22 15:36:32,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1260582.0, ans=0.125 2023-06-22 15:37:02,889 INFO [train.py:996] (0/4) Epoch 7, batch 27150, loss[loss=0.2924, simple_loss=0.3796, pruned_loss=0.1026, over 21751.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.323, pruned_loss=0.08359, over 4264213.63 frames. ], batch size: 351, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:37:42,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1260822.0, ans=0.0 2023-06-22 15:38:09,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1260882.0, ans=0.125 2023-06-22 15:38:18,920 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=15.0 2023-06-22 15:38:31,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.782e+02 4.019e+02 5.172e+02 7.242e+02 1.500e+03, threshold=1.034e+03, percent-clipped=23.0 2023-06-22 15:38:32,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-06-22 15:38:42,522 INFO [train.py:996] (0/4) Epoch 7, batch 27200, loss[loss=0.2483, simple_loss=0.3291, pruned_loss=0.08378, over 21611.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3303, pruned_loss=0.08617, over 4276488.95 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:39:53,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-22 15:40:13,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1261242.0, ans=0.125 2023-06-22 15:40:23,231 INFO [train.py:996] (0/4) Epoch 7, batch 27250, loss[loss=0.2956, simple_loss=0.3491, pruned_loss=0.121, over 21800.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3331, pruned_loss=0.09022, over 4278477.79 frames. ], batch size: 247, lr: 4.18e-03, grad_scale: 32.0 2023-06-22 15:41:19,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1261422.0, ans=0.025 2023-06-22 15:41:19,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1261422.0, ans=0.1 2023-06-22 15:41:36,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=22.5 2023-06-22 15:41:48,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1261542.0, ans=0.0 2023-06-22 15:41:54,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.033e+02 3.741e+02 4.375e+02 5.438e+02 9.965e+02, threshold=8.750e+02, percent-clipped=0.0 2023-06-22 15:42:08,909 INFO [train.py:996] (0/4) Epoch 7, batch 27300, loss[loss=0.2598, simple_loss=0.3355, pruned_loss=0.09208, over 21474.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3353, pruned_loss=0.09139, over 4277554.43 frames. ], batch size: 211, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:43:48,258 INFO [train.py:996] (0/4) Epoch 7, batch 27350, loss[loss=0.2682, simple_loss=0.3543, pruned_loss=0.09107, over 21629.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3384, pruned_loss=0.09282, over 4277212.22 frames. ], batch size: 414, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:44:40,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1262022.0, ans=0.2 2023-06-22 15:44:56,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1262082.0, ans=0.025 2023-06-22 15:45:15,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.698e+02 3.868e+02 4.861e+02 6.523e+02 1.170e+03, threshold=9.722e+02, percent-clipped=10.0 2023-06-22 15:45:25,602 INFO [train.py:996] (0/4) Epoch 7, batch 27400, loss[loss=0.2074, simple_loss=0.2795, pruned_loss=0.06767, over 21768.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3328, pruned_loss=0.0919, over 4281842.50 frames. ], batch size: 371, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:45:39,079 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-22 15:46:03,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1262262.0, ans=0.025 2023-06-22 15:46:04,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1262262.0, ans=0.0 2023-06-22 15:46:44,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1262442.0, ans=0.125 2023-06-22 15:47:09,279 INFO [train.py:996] (0/4) Epoch 7, batch 27450, loss[loss=0.2679, simple_loss=0.3465, pruned_loss=0.09465, over 21405.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3261, pruned_loss=0.08969, over 4283772.24 frames. ], batch size: 194, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:47:09,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1262502.0, ans=0.125 2023-06-22 15:47:21,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-22 15:47:25,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1262562.0, ans=0.1 2023-06-22 15:47:53,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-22 15:48:28,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.635e+02 3.611e+02 4.155e+02 4.904e+02 8.641e+02, threshold=8.310e+02, percent-clipped=0.0 2023-06-22 15:48:29,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1262742.0, ans=0.1 2023-06-22 15:48:41,026 INFO [train.py:996] (0/4) Epoch 7, batch 27500, loss[loss=0.2384, simple_loss=0.2991, pruned_loss=0.08884, over 21183.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3239, pruned_loss=0.08944, over 4289033.73 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 15:49:21,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1262922.0, ans=0.0 2023-06-22 15:49:39,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-22 15:50:10,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1263042.0, ans=0.0 2023-06-22 15:50:20,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1263042.0, ans=0.125 2023-06-22 15:50:23,458 INFO [train.py:996] (0/4) Epoch 7, batch 27550, loss[loss=0.2229, simple_loss=0.2961, pruned_loss=0.07479, over 21737.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3187, pruned_loss=0.08687, over 4287214.81 frames. ], batch size: 351, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 15:51:47,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.463e+02 4.160e+02 5.154e+02 1.063e+03, threshold=8.319e+02, percent-clipped=3.0 2023-06-22 15:52:00,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1263402.0, ans=0.1 2023-06-22 15:52:01,061 INFO [train.py:996] (0/4) Epoch 7, batch 27600, loss[loss=0.2152, simple_loss=0.2776, pruned_loss=0.07639, over 21549.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3142, pruned_loss=0.08656, over 4274519.64 frames. ], batch size: 263, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:52:24,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1263462.0, ans=0.0 2023-06-22 15:53:05,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1263582.0, ans=0.125 2023-06-22 15:53:23,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2023-06-22 15:53:31,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1263702.0, ans=0.125 2023-06-22 15:53:32,847 INFO [train.py:996] (0/4) Epoch 7, batch 27650, loss[loss=0.2167, simple_loss=0.2771, pruned_loss=0.07812, over 21455.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3074, pruned_loss=0.08507, over 4277609.13 frames. ], batch size: 194, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:53:36,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1263702.0, ans=0.04949747468305833 2023-06-22 15:53:56,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1263762.0, ans=0.035 2023-06-22 15:54:22,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1263822.0, ans=0.1 2023-06-22 15:54:30,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1263882.0, ans=0.1 2023-06-22 15:54:33,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=22.5 2023-06-22 15:54:58,116 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.224e+02 3.872e+02 5.377e+02 9.163e+02, threshold=7.744e+02, percent-clipped=1.0 2023-06-22 15:55:07,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1263942.0, ans=0.125 2023-06-22 15:55:10,445 INFO [train.py:996] (0/4) Epoch 7, batch 27700, loss[loss=0.2664, simple_loss=0.3423, pruned_loss=0.0953, over 21789.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3081, pruned_loss=0.08306, over 4277336.04 frames. ], batch size: 332, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:55:37,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1264062.0, ans=0.125 2023-06-22 15:55:39,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-22 15:55:55,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-22 15:56:30,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1264242.0, ans=0.95 2023-06-22 15:56:53,371 INFO [train.py:996] (0/4) Epoch 7, batch 27750, loss[loss=0.2222, simple_loss=0.3059, pruned_loss=0.06929, over 21807.00 frames. ], tot_loss[loss=0.239, simple_loss=0.312, pruned_loss=0.08297, over 4285682.51 frames. ], batch size: 332, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:57:00,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1264302.0, ans=0.2 2023-06-22 15:57:39,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1264422.0, ans=0.125 2023-06-22 15:57:52,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1264482.0, ans=0.125 2023-06-22 15:58:17,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.542e+02 4.380e+02 5.826e+02 1.163e+03, threshold=8.759e+02, percent-clipped=13.0 2023-06-22 15:58:25,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1264602.0, ans=0.2 2023-06-22 15:58:26,091 INFO [train.py:996] (0/4) Epoch 7, batch 27800, loss[loss=0.2282, simple_loss=0.2955, pruned_loss=0.08044, over 21632.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3097, pruned_loss=0.08322, over 4280208.37 frames. ], batch size: 195, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:58:47,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.30 vs. limit=6.0 2023-06-22 15:58:56,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1264662.0, ans=0.05 2023-06-22 15:59:16,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1264722.0, ans=0.125 2023-06-22 15:59:25,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1264782.0, ans=0.1 2023-06-22 16:00:09,313 INFO [train.py:996] (0/4) Epoch 7, batch 27850, loss[loss=0.2043, simple_loss=0.2691, pruned_loss=0.06973, over 21204.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3085, pruned_loss=0.08433, over 4283182.75 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:00:53,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1265022.0, ans=0.0 2023-06-22 16:01:41,973 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-22 16:01:44,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.686e+02 3.358e+02 4.006e+02 5.079e+02 1.358e+03, threshold=8.013e+02, percent-clipped=6.0 2023-06-22 16:01:50,960 INFO [train.py:996] (0/4) Epoch 7, batch 27900, loss[loss=0.2618, simple_loss=0.3569, pruned_loss=0.08337, over 21620.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3167, pruned_loss=0.08512, over 4281845.80 frames. ], batch size: 441, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 16:01:51,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1265202.0, ans=0.125 2023-06-22 16:02:13,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1265262.0, ans=0.0 2023-06-22 16:03:04,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-22 16:03:30,628 INFO [train.py:996] (0/4) Epoch 7, batch 27950, loss[loss=0.2251, simple_loss=0.3152, pruned_loss=0.06745, over 21717.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3173, pruned_loss=0.08175, over 4284185.57 frames. ], batch size: 332, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 16:03:56,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1265562.0, ans=0.125 2023-06-22 16:04:24,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1265622.0, ans=0.0 2023-06-22 16:05:01,863 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.331e+02 4.215e+02 5.224e+02 1.025e+03, threshold=8.430e+02, percent-clipped=5.0 2023-06-22 16:05:13,107 INFO [train.py:996] (0/4) Epoch 7, batch 28000, loss[loss=0.2653, simple_loss=0.3339, pruned_loss=0.09841, over 21756.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.316, pruned_loss=0.07977, over 4282719.92 frames. ], batch size: 112, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:05:18,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1265802.0, ans=0.1 2023-06-22 16:05:24,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-22 16:05:33,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1265862.0, ans=0.0 2023-06-22 16:05:59,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1265922.0, ans=0.125 2023-06-22 16:06:03,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1265922.0, ans=0.2 2023-06-22 16:06:22,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1265982.0, ans=0.125 2023-06-22 16:06:25,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1265982.0, ans=0.0 2023-06-22 16:06:25,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1265982.0, ans=0.125 2023-06-22 16:06:30,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1266042.0, ans=0.0 2023-06-22 16:06:47,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1266042.0, ans=0.05 2023-06-22 16:06:47,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.20 vs. limit=15.0 2023-06-22 16:06:52,783 INFO [train.py:996] (0/4) Epoch 7, batch 28050, loss[loss=0.2426, simple_loss=0.3385, pruned_loss=0.07333, over 20848.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3148, pruned_loss=0.08151, over 4281770.66 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:07:06,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1266102.0, ans=0.09899494936611666 2023-06-22 16:07:49,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-22 16:08:16,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1266342.0, ans=0.125 2023-06-22 16:08:25,038 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.562e+02 4.496e+02 6.296e+02 1.484e+03, threshold=8.993e+02, percent-clipped=8.0 2023-06-22 16:08:31,037 INFO [train.py:996] (0/4) Epoch 7, batch 28100, loss[loss=0.2302, simple_loss=0.2919, pruned_loss=0.0842, over 21864.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3119, pruned_loss=0.08118, over 4274556.81 frames. ], batch size: 98, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:08:46,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1266402.0, ans=0.0 2023-06-22 16:09:05,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1266462.0, ans=0.07 2023-06-22 16:09:25,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1266522.0, ans=0.0 2023-06-22 16:10:07,983 INFO [train.py:996] (0/4) Epoch 7, batch 28150, loss[loss=0.2015, simple_loss=0.2656, pruned_loss=0.06875, over 21654.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3045, pruned_loss=0.0812, over 4270517.21 frames. ], batch size: 298, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:11:24,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1266882.0, ans=0.125 2023-06-22 16:11:32,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1266942.0, ans=0.07 2023-06-22 16:11:39,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.795e+02 3.363e+02 3.938e+02 4.881e+02 1.435e+03, threshold=7.877e+02, percent-clipped=4.0 2023-06-22 16:11:40,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1266942.0, ans=0.0 2023-06-22 16:11:46,193 INFO [train.py:996] (0/4) Epoch 7, batch 28200, loss[loss=0.2745, simple_loss=0.335, pruned_loss=0.107, over 21201.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3053, pruned_loss=0.08321, over 4269500.90 frames. ], batch size: 143, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:11:59,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1267002.0, ans=0.09899494936611666 2023-06-22 16:13:07,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1267242.0, ans=0.0 2023-06-22 16:13:23,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1267302.0, ans=0.125 2023-06-22 16:13:24,189 INFO [train.py:996] (0/4) Epoch 7, batch 28250, loss[loss=0.2709, simple_loss=0.3151, pruned_loss=0.1134, over 21445.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3106, pruned_loss=0.08586, over 4269425.15 frames. ], batch size: 475, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:14:08,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1267422.0, ans=0.0 2023-06-22 16:14:43,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1267542.0, ans=0.0 2023-06-22 16:14:56,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.948e+02 3.937e+02 4.825e+02 6.344e+02 1.441e+03, threshold=9.649e+02, percent-clipped=9.0 2023-06-22 16:14:59,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1267542.0, ans=0.0 2023-06-22 16:15:07,531 INFO [train.py:996] (0/4) Epoch 7, batch 28300, loss[loss=0.2234, simple_loss=0.3123, pruned_loss=0.06727, over 21707.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3078, pruned_loss=0.08399, over 4254600.43 frames. ], batch size: 298, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:15:10,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=8.0 2023-06-22 16:15:26,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1267602.0, ans=0.125 2023-06-22 16:15:36,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-22 16:16:12,523 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.84 vs. limit=22.5 2023-06-22 16:16:13,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1267782.0, ans=0.125 2023-06-22 16:16:13,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-22 16:16:51,642 INFO [train.py:996] (0/4) Epoch 7, batch 28350, loss[loss=0.1955, simple_loss=0.3289, pruned_loss=0.03104, over 20790.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3043, pruned_loss=0.07843, over 4253618.45 frames. ], batch size: 607, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:16:53,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1267902.0, ans=0.0 2023-06-22 16:17:03,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-22 16:17:32,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1268022.0, ans=0.125 2023-06-22 16:17:42,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1268082.0, ans=0.04949747468305833 2023-06-22 16:17:58,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1268082.0, ans=0.125 2023-06-22 16:18:16,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.369e+02 4.514e+02 6.821e+02 1.544e+03, threshold=9.028e+02, percent-clipped=6.0 2023-06-22 16:18:28,279 INFO [train.py:996] (0/4) Epoch 7, batch 28400, loss[loss=0.2695, simple_loss=0.3351, pruned_loss=0.102, over 21329.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2995, pruned_loss=0.0781, over 4232540.44 frames. ], batch size: 549, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:19:19,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1268322.0, ans=0.125 2023-06-22 16:19:48,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1268442.0, ans=0.0 2023-06-22 16:20:09,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1268502.0, ans=0.125 2023-06-22 16:20:10,455 INFO [train.py:996] (0/4) Epoch 7, batch 28450, loss[loss=0.2292, simple_loss=0.3017, pruned_loss=0.07831, over 21949.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3026, pruned_loss=0.08055, over 4239614.44 frames. ], batch size: 316, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:20:10,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1268502.0, ans=0.125 2023-06-22 16:20:36,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1268562.0, ans=0.125 2023-06-22 16:20:38,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-22 16:20:50,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1268622.0, ans=0.125 2023-06-22 16:21:12,947 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-22 16:21:18,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1268682.0, ans=0.125 2023-06-22 16:21:29,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1268742.0, ans=0.125 2023-06-22 16:21:38,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1268742.0, ans=0.05 2023-06-22 16:21:44,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 3.754e+02 4.591e+02 6.046e+02 1.139e+03, threshold=9.182e+02, percent-clipped=3.0 2023-06-22 16:21:49,037 INFO [train.py:996] (0/4) Epoch 7, batch 28500, loss[loss=0.2947, simple_loss=0.3512, pruned_loss=0.1191, over 21768.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3066, pruned_loss=0.08388, over 4257747.24 frames. ], batch size: 298, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:22:07,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1268802.0, ans=0.0 2023-06-22 16:22:10,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1268862.0, ans=0.125 2023-06-22 16:22:11,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1268862.0, ans=0.2 2023-06-22 16:23:33,163 INFO [train.py:996] (0/4) Epoch 7, batch 28550, loss[loss=0.2771, simple_loss=0.363, pruned_loss=0.09556, over 21750.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3158, pruned_loss=0.08664, over 4263323.71 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:23:35,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1269102.0, ans=0.0 2023-06-22 16:23:39,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1269102.0, ans=0.125 2023-06-22 16:24:50,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1269342.0, ans=0.2 2023-06-22 16:25:04,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1269342.0, ans=0.0 2023-06-22 16:25:07,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.907e+02 3.553e+02 4.352e+02 5.602e+02 1.151e+03, threshold=8.703e+02, percent-clipped=1.0 2023-06-22 16:25:10,391 INFO [train.py:996] (0/4) Epoch 7, batch 28600, loss[loss=0.2444, simple_loss=0.3172, pruned_loss=0.08578, over 21747.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3228, pruned_loss=0.08941, over 4268809.59 frames. ], batch size: 124, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:25:10,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1269402.0, ans=0.0 2023-06-22 16:25:13,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.82 vs. limit=5.0 2023-06-22 16:25:40,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1269462.0, ans=0.0 2023-06-22 16:25:59,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1269522.0, ans=0.125 2023-06-22 16:26:22,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1269582.0, ans=0.2 2023-06-22 16:26:47,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1269702.0, ans=0.0 2023-06-22 16:26:48,966 INFO [train.py:996] (0/4) Epoch 7, batch 28650, loss[loss=0.2229, simple_loss=0.2832, pruned_loss=0.08128, over 21534.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3179, pruned_loss=0.0888, over 4269371.67 frames. ], batch size: 263, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:27:00,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1269702.0, ans=0.0 2023-06-22 16:27:31,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1269822.0, ans=0.04949747468305833 2023-06-22 16:27:51,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1269882.0, ans=10.0 2023-06-22 16:28:16,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-22 16:28:24,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.670e+02 4.036e+02 5.079e+02 7.151e+02 1.408e+03, threshold=1.016e+03, percent-clipped=11.0 2023-06-22 16:28:32,064 INFO [train.py:996] (0/4) Epoch 7, batch 28700, loss[loss=0.1947, simple_loss=0.253, pruned_loss=0.06814, over 21254.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3179, pruned_loss=0.08977, over 4262110.47 frames. ], batch size: 549, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:28:37,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.78 vs. limit=5.0 2023-06-22 16:28:42,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-22 16:29:31,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-22 16:30:05,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1270242.0, ans=0.125 2023-06-22 16:30:09,819 INFO [train.py:996] (0/4) Epoch 7, batch 28750, loss[loss=0.274, simple_loss=0.3307, pruned_loss=0.1086, over 21301.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3184, pruned_loss=0.08962, over 4261126.62 frames. ], batch size: 143, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:30:16,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1270302.0, ans=0.125 2023-06-22 16:30:27,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1270302.0, ans=0.1 2023-06-22 16:30:29,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1270362.0, ans=0.125 2023-06-22 16:31:45,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.805e+02 3.311e+02 3.906e+02 4.905e+02 1.219e+03, threshold=7.811e+02, percent-clipped=5.0 2023-06-22 16:31:46,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1270542.0, ans=0.035 2023-06-22 16:31:49,113 INFO [train.py:996] (0/4) Epoch 7, batch 28800, loss[loss=0.2756, simple_loss=0.3458, pruned_loss=0.1028, over 21780.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3214, pruned_loss=0.09039, over 4270225.05 frames. ], batch size: 332, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:32:05,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1270602.0, ans=0.125 2023-06-22 16:32:09,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1270662.0, ans=0.0 2023-06-22 16:32:23,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-22 16:32:55,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1270782.0, ans=0.125 2023-06-22 16:32:57,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1270782.0, ans=0.125 2023-06-22 16:33:03,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1270782.0, ans=0.125 2023-06-22 16:33:05,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1270842.0, ans=0.0 2023-06-22 16:33:09,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1270842.0, ans=0.125 2023-06-22 16:33:30,952 INFO [train.py:996] (0/4) Epoch 7, batch 28850, loss[loss=0.2638, simple_loss=0.377, pruned_loss=0.07532, over 19962.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3227, pruned_loss=0.0919, over 4275034.99 frames. ], batch size: 702, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:34:12,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1271022.0, ans=0.125 2023-06-22 16:35:03,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.913e+02 3.986e+02 4.811e+02 7.732e+02 1.672e+03, threshold=9.622e+02, percent-clipped=22.0 2023-06-22 16:35:06,510 INFO [train.py:996] (0/4) Epoch 7, batch 28900, loss[loss=0.269, simple_loss=0.3229, pruned_loss=0.1076, over 21614.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3264, pruned_loss=0.09383, over 4278354.53 frames. ], batch size: 548, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:35:08,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1271202.0, ans=0.2 2023-06-22 16:35:33,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1271262.0, ans=0.125 2023-06-22 16:35:55,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1271322.0, ans=0.1 2023-06-22 16:35:58,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1271322.0, ans=0.0 2023-06-22 16:36:42,232 INFO [train.py:996] (0/4) Epoch 7, batch 28950, loss[loss=0.2646, simple_loss=0.3373, pruned_loss=0.09589, over 21764.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3284, pruned_loss=0.09332, over 4272782.43 frames. ], batch size: 332, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:37:02,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1271502.0, ans=0.125 2023-06-22 16:37:54,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1271682.0, ans=0.0 2023-06-22 16:38:23,465 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.694e+02 3.738e+02 4.662e+02 6.144e+02 1.231e+03, threshold=9.324e+02, percent-clipped=1.0 2023-06-22 16:38:31,537 INFO [train.py:996] (0/4) Epoch 7, batch 29000, loss[loss=0.2578, simple_loss=0.3323, pruned_loss=0.09169, over 21340.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3299, pruned_loss=0.0913, over 4269333.06 frames. ], batch size: 549, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:38:46,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1271862.0, ans=0.2 2023-06-22 16:38:48,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1271862.0, ans=0.0 2023-06-22 16:38:49,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1271862.0, ans=0.1 2023-06-22 16:38:54,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1271862.0, ans=0.025 2023-06-22 16:39:21,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1271922.0, ans=0.125 2023-06-22 16:39:25,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1271982.0, ans=0.2 2023-06-22 16:39:27,138 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-212000.pt 2023-06-22 16:39:58,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1272042.0, ans=0.125 2023-06-22 16:39:59,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=8.0 2023-06-22 16:40:05,726 INFO [train.py:996] (0/4) Epoch 7, batch 29050, loss[loss=0.2251, simple_loss=0.2968, pruned_loss=0.07667, over 21662.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3276, pruned_loss=0.09132, over 4272011.03 frames. ], batch size: 263, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:40:27,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.80 vs. limit=22.5 2023-06-22 16:40:35,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1272162.0, ans=0.125 2023-06-22 16:41:00,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1272282.0, ans=0.0 2023-06-22 16:41:06,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272282.0, ans=0.1 2023-06-22 16:41:37,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1272342.0, ans=0.2 2023-06-22 16:41:39,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.766e+02 3.571e+02 4.314e+02 5.526e+02 9.643e+02, threshold=8.628e+02, percent-clipped=2.0 2023-06-22 16:41:42,571 INFO [train.py:996] (0/4) Epoch 7, batch 29100, loss[loss=0.1915, simple_loss=0.2556, pruned_loss=0.06366, over 21620.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3183, pruned_loss=0.08887, over 4281847.13 frames. ], batch size: 298, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:42:18,018 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-22 16:42:40,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1272582.0, ans=0.0 2023-06-22 16:43:07,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1272642.0, ans=0.125 2023-06-22 16:43:19,105 INFO [train.py:996] (0/4) Epoch 7, batch 29150, loss[loss=0.2337, simple_loss=0.3096, pruned_loss=0.07887, over 21675.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3158, pruned_loss=0.08686, over 4285213.14 frames. ], batch size: 247, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:43:33,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=22.5 2023-06-22 16:43:36,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1272762.0, ans=0.125 2023-06-22 16:44:27,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1272882.0, ans=0.125 2023-06-22 16:44:52,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.751e+02 3.598e+02 4.101e+02 5.513e+02 1.275e+03, threshold=8.201e+02, percent-clipped=6.0 2023-06-22 16:44:54,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1273002.0, ans=0.125 2023-06-22 16:44:55,717 INFO [train.py:996] (0/4) Epoch 7, batch 29200, loss[loss=0.2106, simple_loss=0.2761, pruned_loss=0.07254, over 20063.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3106, pruned_loss=0.08586, over 4274524.26 frames. ], batch size: 702, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:45:04,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1273002.0, ans=0.125 2023-06-22 16:45:23,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1273062.0, ans=0.0 2023-06-22 16:45:34,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-22 16:45:41,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1273122.0, ans=0.1 2023-06-22 16:45:52,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1273182.0, ans=0.04949747468305833 2023-06-22 16:45:59,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-22 16:46:35,604 INFO [train.py:996] (0/4) Epoch 7, batch 29250, loss[loss=0.2217, simple_loss=0.2945, pruned_loss=0.07449, over 21232.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3078, pruned_loss=0.0831, over 4271689.08 frames. ], batch size: 176, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:47:31,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1273482.0, ans=0.125 2023-06-22 16:48:10,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.544e+02 3.452e+02 4.218e+02 5.407e+02 1.140e+03, threshold=8.437e+02, percent-clipped=6.0 2023-06-22 16:48:13,965 INFO [train.py:996] (0/4) Epoch 7, batch 29300, loss[loss=0.2123, simple_loss=0.2938, pruned_loss=0.06542, over 19753.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.31, pruned_loss=0.08254, over 4274256.32 frames. ], batch size: 703, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:48:16,026 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:48:45,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1273662.0, ans=0.0 2023-06-22 16:48:47,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1273662.0, ans=0.125 2023-06-22 16:49:05,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1273722.0, ans=0.125 2023-06-22 16:49:08,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1273782.0, ans=0.125 2023-06-22 16:49:48,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.01 vs. limit=15.0 2023-06-22 16:49:52,321 INFO [train.py:996] (0/4) Epoch 7, batch 29350, loss[loss=0.2426, simple_loss=0.303, pruned_loss=0.09107, over 21365.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3069, pruned_loss=0.08259, over 4269810.56 frames. ], batch size: 131, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:50:12,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1273902.0, ans=0.125 2023-06-22 16:50:16,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1273962.0, ans=0.0 2023-06-22 16:50:22,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1273962.0, ans=0.125 2023-06-22 16:51:29,120 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.690e+02 4.735e+02 5.967e+02 1.066e+03, threshold=9.470e+02, percent-clipped=8.0 2023-06-22 16:51:30,501 INFO [train.py:996] (0/4) Epoch 7, batch 29400, loss[loss=0.2113, simple_loss=0.2934, pruned_loss=0.06457, over 21754.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3063, pruned_loss=0.07965, over 4266950.45 frames. ], batch size: 352, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:51:54,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1274262.0, ans=0.0 2023-06-22 16:51:58,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1274262.0, ans=0.0 2023-06-22 16:52:42,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1274382.0, ans=0.125 2023-06-22 16:52:45,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1274382.0, ans=0.125 2023-06-22 16:53:06,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1274442.0, ans=0.0 2023-06-22 16:53:09,804 INFO [train.py:996] (0/4) Epoch 7, batch 29450, loss[loss=0.2552, simple_loss=0.3325, pruned_loss=0.08901, over 21739.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3035, pruned_loss=0.07797, over 4273279.45 frames. ], batch size: 332, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:53:24,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1274562.0, ans=0.0 2023-06-22 16:53:44,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-22 16:54:06,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1274682.0, ans=0.1 2023-06-22 16:54:28,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1274742.0, ans=0.1 2023-06-22 16:54:41,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.31 vs. limit=15.0 2023-06-22 16:54:41,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.353e+02 4.158e+02 5.429e+02 7.361e+02 1.574e+03, threshold=1.086e+03, percent-clipped=7.0 2023-06-22 16:54:43,573 INFO [train.py:996] (0/4) Epoch 7, batch 29500, loss[loss=0.2577, simple_loss=0.3148, pruned_loss=0.1003, over 21568.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3105, pruned_loss=0.08272, over 4272781.04 frames. ], batch size: 548, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:56:21,930 INFO [train.py:996] (0/4) Epoch 7, batch 29550, loss[loss=0.229, simple_loss=0.298, pruned_loss=0.07994, over 21338.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3109, pruned_loss=0.08464, over 4278256.46 frames. ], batch size: 159, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:56:25,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-22 16:57:00,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1275222.0, ans=0.125 2023-06-22 16:57:16,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1275222.0, ans=0.125 2023-06-22 16:57:19,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1275282.0, ans=0.125 2023-06-22 16:57:59,482 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.724e+02 4.638e+02 7.025e+02 1.809e+03, threshold=9.276e+02, percent-clipped=5.0 2023-06-22 16:58:01,085 INFO [train.py:996] (0/4) Epoch 7, batch 29600, loss[loss=0.3007, simple_loss=0.3801, pruned_loss=0.1106, over 21766.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3179, pruned_loss=0.08756, over 4286514.86 frames. ], batch size: 332, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:58:08,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-22 16:58:20,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1275462.0, ans=0.04949747468305833 2023-06-22 16:58:27,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-22 16:58:28,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1275462.0, ans=0.09899494936611666 2023-06-22 16:58:40,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.02 vs. limit=12.0 2023-06-22 16:59:08,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-22 16:59:21,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1275642.0, ans=0.0 2023-06-22 16:59:33,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-22 16:59:37,962 INFO [train.py:996] (0/4) Epoch 7, batch 29650, loss[loss=0.204, simple_loss=0.2757, pruned_loss=0.06618, over 21794.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3169, pruned_loss=0.08471, over 4288512.93 frames. ], batch size: 282, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:59:46,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1275702.0, ans=0.125 2023-06-22 16:59:52,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1275762.0, ans=0.125 2023-06-22 17:01:16,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.826e+02 4.944e+02 6.202e+02 1.000e+03, threshold=9.888e+02, percent-clipped=1.0 2023-06-22 17:01:16,770 INFO [train.py:996] (0/4) Epoch 7, batch 29700, loss[loss=0.3212, simple_loss=0.4188, pruned_loss=0.1118, over 21537.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3175, pruned_loss=0.08463, over 4293143.27 frames. ], batch size: 471, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:02:52,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1276242.0, ans=0.125 2023-06-22 17:02:55,352 INFO [train.py:996] (0/4) Epoch 7, batch 29750, loss[loss=0.3077, simple_loss=0.3834, pruned_loss=0.116, over 21537.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3219, pruned_loss=0.08456, over 4281547.93 frames. ], batch size: 507, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:03:00,417 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:04:08,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1276482.0, ans=0.0 2023-06-22 17:04:32,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.419e+02 3.988e+02 5.118e+02 1.049e+03, threshold=7.976e+02, percent-clipped=2.0 2023-06-22 17:04:32,190 INFO [train.py:996] (0/4) Epoch 7, batch 29800, loss[loss=0.265, simple_loss=0.3297, pruned_loss=0.1001, over 21883.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3225, pruned_loss=0.0849, over 4275183.84 frames. ], batch size: 351, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:04:34,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1276602.0, ans=0.0 2023-06-22 17:04:45,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-22 17:05:08,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1276662.0, ans=0.125 2023-06-22 17:06:10,689 INFO [train.py:996] (0/4) Epoch 7, batch 29850, loss[loss=0.1941, simple_loss=0.2707, pruned_loss=0.05882, over 21559.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3175, pruned_loss=0.08251, over 4283877.02 frames. ], batch size: 212, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:06:22,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1276902.0, ans=0.125 2023-06-22 17:06:30,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1276962.0, ans=0.0 2023-06-22 17:06:59,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.93 vs. limit=15.0 2023-06-22 17:07:02,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1277022.0, ans=0.125 2023-06-22 17:07:05,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1277022.0, ans=0.2 2023-06-22 17:07:48,878 INFO [train.py:996] (0/4) Epoch 7, batch 29900, loss[loss=0.2554, simple_loss=0.3181, pruned_loss=0.09634, over 21654.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3167, pruned_loss=0.08345, over 4284713.10 frames. ], batch size: 230, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:07:50,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.619e+02 3.325e+02 3.983e+02 5.006e+02 1.426e+03, threshold=7.966e+02, percent-clipped=5.0 2023-06-22 17:08:35,801 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-22 17:09:03,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1277382.0, ans=0.125 2023-06-22 17:09:22,735 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-22 17:09:25,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1277442.0, ans=0.125 2023-06-22 17:09:33,573 INFO [train.py:996] (0/4) Epoch 7, batch 29950, loss[loss=0.2571, simple_loss=0.3264, pruned_loss=0.09391, over 21422.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3224, pruned_loss=0.08786, over 4280537.09 frames. ], batch size: 549, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:09:36,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1277502.0, ans=0.1 2023-06-22 17:10:01,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1277562.0, ans=0.125 2023-06-22 17:10:05,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-22 17:10:49,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1277682.0, ans=0.0 2023-06-22 17:11:13,200 INFO [train.py:996] (0/4) Epoch 7, batch 30000, loss[loss=0.2314, simple_loss=0.3246, pruned_loss=0.06912, over 21705.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.324, pruned_loss=0.08815, over 4280561.73 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:11:13,201 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 17:11:34,230 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2473, simple_loss=0.3461, pruned_loss=0.0743, over 1796401.00 frames. 2023-06-22 17:11:34,230 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 17:11:36,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 3.810e+02 4.424e+02 5.666e+02 1.321e+03, threshold=8.847e+02, percent-clipped=8.0 2023-06-22 17:11:37,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-22 17:11:41,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1277802.0, ans=0.0 2023-06-22 17:12:17,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1277922.0, ans=15.0 2023-06-22 17:12:23,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1277922.0, ans=0.125 2023-06-22 17:12:30,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1277922.0, ans=0.125 2023-06-22 17:12:47,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1277982.0, ans=0.2 2023-06-22 17:13:03,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1278042.0, ans=0.04949747468305833 2023-06-22 17:13:26,191 INFO [train.py:996] (0/4) Epoch 7, batch 30050, loss[loss=0.2951, simple_loss=0.3916, pruned_loss=0.09931, over 21861.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3267, pruned_loss=0.08485, over 4273472.42 frames. ], batch size: 372, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:14:29,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1278282.0, ans=0.125 2023-06-22 17:14:44,041 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-22 17:15:03,333 INFO [train.py:996] (0/4) Epoch 7, batch 30100, loss[loss=0.1942, simple_loss=0.2475, pruned_loss=0.07043, over 19979.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3249, pruned_loss=0.08406, over 4261328.29 frames. ], batch size: 702, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:15:04,868 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 3.663e+02 4.877e+02 6.196e+02 1.469e+03, threshold=9.754e+02, percent-clipped=9.0 2023-06-22 17:15:09,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-22 17:15:45,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1278522.0, ans=0.125 2023-06-22 17:15:56,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1278522.0, ans=0.125 2023-06-22 17:16:26,702 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-06-22 17:16:34,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1278642.0, ans=0.125 2023-06-22 17:16:41,795 INFO [train.py:996] (0/4) Epoch 7, batch 30150, loss[loss=0.2644, simple_loss=0.3269, pruned_loss=0.1009, over 21362.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.32, pruned_loss=0.08569, over 4265152.87 frames. ], batch size: 176, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:16:52,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-22 17:17:16,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1278762.0, ans=0.125 2023-06-22 17:18:24,053 INFO [train.py:996] (0/4) Epoch 7, batch 30200, loss[loss=0.248, simple_loss=0.3389, pruned_loss=0.0786, over 21599.00 frames. ], tot_loss[loss=0.247, simple_loss=0.324, pruned_loss=0.08499, over 4269360.68 frames. ], batch size: 414, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:18:25,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.591e+02 3.500e+02 4.327e+02 6.195e+02 1.104e+03, threshold=8.654e+02, percent-clipped=5.0 2023-06-22 17:18:44,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1279062.0, ans=0.0 2023-06-22 17:19:01,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1279062.0, ans=0.125 2023-06-22 17:19:50,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-22 17:20:08,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1279302.0, ans=0.125 2023-06-22 17:20:09,072 INFO [train.py:996] (0/4) Epoch 7, batch 30250, loss[loss=0.2488, simple_loss=0.3509, pruned_loss=0.07332, over 21469.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3298, pruned_loss=0.08721, over 4263993.53 frames. ], batch size: 211, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:20:12,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1279302.0, ans=0.0 2023-06-22 17:20:51,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1279422.0, ans=0.2 2023-06-22 17:21:23,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1279482.0, ans=0.025 2023-06-22 17:21:48,237 INFO [train.py:996] (0/4) Epoch 7, batch 30300, loss[loss=0.1907, simple_loss=0.2511, pruned_loss=0.06513, over 20698.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3264, pruned_loss=0.08701, over 4260280.05 frames. ], batch size: 607, lr: 4.15e-03, grad_scale: 16.0 2023-06-22 17:21:49,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 4.197e+02 5.234e+02 7.319e+02 1.495e+03, threshold=1.047e+03, percent-clipped=13.0 2023-06-22 17:22:05,210 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:22:08,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1279662.0, ans=0.0 2023-06-22 17:22:22,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1279662.0, ans=0.125 2023-06-22 17:22:24,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.13 vs. limit=10.0 2023-06-22 17:22:53,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1279782.0, ans=0.2 2023-06-22 17:23:05,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-22 17:23:14,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279842.0, ans=0.1 2023-06-22 17:23:33,759 INFO [train.py:996] (0/4) Epoch 7, batch 30350, loss[loss=0.2657, simple_loss=0.364, pruned_loss=0.08372, over 20778.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3274, pruned_loss=0.08837, over 4265736.19 frames. ], batch size: 607, lr: 4.15e-03, grad_scale: 16.0 2023-06-22 17:24:02,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1279962.0, ans=10.0 2023-06-22 17:24:56,728 INFO [train.py:996] (0/4) Epoch 7, batch 30400, loss[loss=0.2483, simple_loss=0.3017, pruned_loss=0.09749, over 20256.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3209, pruned_loss=0.08645, over 4250669.36 frames. ], batch size: 703, lr: 4.15e-03, grad_scale: 32.0 2023-06-22 17:24:58,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.840e+02 4.001e+02 6.030e+02 8.810e+02 1.556e+03, threshold=1.206e+03, percent-clipped=18.0 2023-06-22 17:25:18,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1280262.0, ans=0.0 2023-06-22 17:25:20,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-22 17:25:31,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-22 17:25:56,215 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:26:21,157 INFO [train.py:996] (0/4) Epoch 7, batch 30450, loss[loss=0.2636, simple_loss=0.3714, pruned_loss=0.07794, over 19917.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3213, pruned_loss=0.08559, over 4193773.52 frames. ], batch size: 702, lr: 4.15e-03, grad_scale: 32.0 2023-06-22 17:26:34,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1280502.0, ans=0.0 2023-06-22 17:26:46,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1280562.0, ans=0.125 2023-06-22 17:27:32,574 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-7.pt 2023-06-22 17:29:06,161 INFO [train.py:996] (0/4) Epoch 8, batch 0, loss[loss=0.2311, simple_loss=0.2997, pruned_loss=0.08126, over 21658.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2997, pruned_loss=0.08126, over 21658.00 frames. ], batch size: 333, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:29:06,162 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 17:29:21,699 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2437, simple_loss=0.3524, pruned_loss=0.06749, over 1796401.00 frames. 2023-06-22 17:29:21,700 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 17:29:30,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.318e+02 7.236e+02 1.078e+03 1.767e+03 4.535e+03, threshold=2.157e+03, percent-clipped=44.0 2023-06-22 17:29:40,684 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-22 17:29:46,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-22 17:30:23,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1280952.0, ans=0.125 2023-06-22 17:31:00,100 INFO [train.py:996] (0/4) Epoch 8, batch 50, loss[loss=0.2602, simple_loss=0.3453, pruned_loss=0.08758, over 21771.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3201, pruned_loss=0.08557, over 952497.73 frames. ], batch size: 351, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:31:00,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-22 17:31:05,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1281072.0, ans=0.09899494936611666 2023-06-22 17:31:22,235 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:31:30,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=22.5 2023-06-22 17:32:05,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1281252.0, ans=0.1 2023-06-22 17:32:26,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-22 17:32:33,435 INFO [train.py:996] (0/4) Epoch 8, batch 100, loss[loss=0.2496, simple_loss=0.3409, pruned_loss=0.07918, over 21348.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3365, pruned_loss=0.0894, over 1685779.59 frames. ], batch size: 159, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:32:44,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.771e+02 3.609e+02 4.818e+02 6.662e+02 2.202e+03, threshold=9.637e+02, percent-clipped=1.0 2023-06-22 17:32:58,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1281432.0, ans=0.2 2023-06-22 17:33:14,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1281492.0, ans=0.125 2023-06-22 17:34:06,766 INFO [train.py:996] (0/4) Epoch 8, batch 150, loss[loss=0.2755, simple_loss=0.355, pruned_loss=0.09797, over 21747.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3415, pruned_loss=0.08972, over 2243688.28 frames. ], batch size: 441, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:34:29,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1281732.0, ans=0.2 2023-06-22 17:35:16,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1281852.0, ans=0.125 2023-06-22 17:35:25,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1281912.0, ans=0.125 2023-06-22 17:35:36,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1281912.0, ans=0.2 2023-06-22 17:35:39,239 INFO [train.py:996] (0/4) Epoch 8, batch 200, loss[loss=0.2077, simple_loss=0.2869, pruned_loss=0.06424, over 21156.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3414, pruned_loss=0.089, over 2689140.07 frames. ], batch size: 159, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:35:49,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.926e+02 4.076e+02 5.203e+02 6.716e+02 1.490e+03, threshold=1.041e+03, percent-clipped=7.0 2023-06-22 17:35:55,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1282032.0, ans=0.125 2023-06-22 17:36:32,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.55 vs. limit=15.0 2023-06-22 17:36:39,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-22 17:37:14,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1282212.0, ans=0.0 2023-06-22 17:37:18,896 INFO [train.py:996] (0/4) Epoch 8, batch 250, loss[loss=0.2549, simple_loss=0.3161, pruned_loss=0.0969, over 21938.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3357, pruned_loss=0.08857, over 3043157.09 frames. ], batch size: 316, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:37:43,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=12.0 2023-06-22 17:38:21,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1282452.0, ans=0.125 2023-06-22 17:38:31,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1282452.0, ans=0.1 2023-06-22 17:38:45,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1282512.0, ans=0.125 2023-06-22 17:38:54,780 INFO [train.py:996] (0/4) Epoch 8, batch 300, loss[loss=0.2669, simple_loss=0.356, pruned_loss=0.08893, over 21663.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3293, pruned_loss=0.08747, over 3321391.81 frames. ], batch size: 389, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:39:06,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 4.058e+02 5.447e+02 7.307e+02 1.512e+03, threshold=1.089e+03, percent-clipped=7.0 2023-06-22 17:39:09,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1282632.0, ans=0.0 2023-06-22 17:39:12,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.80 vs. limit=15.0 2023-06-22 17:39:59,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-22 17:40:35,187 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=22.5 2023-06-22 17:40:35,728 INFO [train.py:996] (0/4) Epoch 8, batch 350, loss[loss=0.2045, simple_loss=0.2712, pruned_loss=0.06892, over 21647.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3214, pruned_loss=0.08495, over 3532193.29 frames. ], batch size: 282, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:40:59,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1282932.0, ans=0.125 2023-06-22 17:41:32,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-22 17:42:06,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1283112.0, ans=0.125 2023-06-22 17:42:14,821 INFO [train.py:996] (0/4) Epoch 8, batch 400, loss[loss=0.2151, simple_loss=0.2792, pruned_loss=0.0755, over 21830.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3153, pruned_loss=0.0837, over 3706450.88 frames. ], batch size: 352, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:42:21,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1283172.0, ans=0.125 2023-06-22 17:42:21,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1283172.0, ans=0.1 2023-06-22 17:42:26,000 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.922e+02 3.737e+02 4.920e+02 6.486e+02 1.177e+03, threshold=9.840e+02, percent-clipped=3.0 2023-06-22 17:42:26,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1283172.0, ans=0.125 2023-06-22 17:43:22,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1283352.0, ans=0.1 2023-06-22 17:43:27,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1283352.0, ans=0.1 2023-06-22 17:43:39,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-22 17:43:54,190 INFO [train.py:996] (0/4) Epoch 8, batch 450, loss[loss=0.2215, simple_loss=0.3192, pruned_loss=0.06187, over 21501.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3115, pruned_loss=0.08155, over 3837998.94 frames. ], batch size: 471, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:44:07,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1283472.0, ans=0.05 2023-06-22 17:44:10,815 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:45:17,352 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-22 17:45:29,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1283712.0, ans=0.0 2023-06-22 17:45:32,293 INFO [train.py:996] (0/4) Epoch 8, batch 500, loss[loss=0.2388, simple_loss=0.3158, pruned_loss=0.0809, over 21247.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3124, pruned_loss=0.08134, over 3939318.50 frames. ], batch size: 159, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:45:59,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 3.923e+02 5.554e+02 7.720e+02 1.831e+03, threshold=1.111e+03, percent-clipped=13.0 2023-06-22 17:46:12,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1283832.0, ans=0.07 2023-06-22 17:46:28,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-22 17:46:30,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-22 17:46:39,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1283952.0, ans=0.0 2023-06-22 17:47:15,027 INFO [train.py:996] (0/4) Epoch 8, batch 550, loss[loss=0.3705, simple_loss=0.4544, pruned_loss=0.1433, over 21468.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3158, pruned_loss=0.08097, over 4020796.25 frames. ], batch size: 507, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:47:18,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1284072.0, ans=0.125 2023-06-22 17:47:56,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1284192.0, ans=0.0 2023-06-22 17:48:23,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1284252.0, ans=0.0 2023-06-22 17:48:28,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-22 17:48:34,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1284312.0, ans=0.0 2023-06-22 17:48:46,270 INFO [train.py:996] (0/4) Epoch 8, batch 600, loss[loss=0.2893, simple_loss=0.3926, pruned_loss=0.09301, over 21606.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3194, pruned_loss=0.08237, over 4083987.82 frames. ], batch size: 441, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:49:08,682 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.844e+02 4.934e+02 7.871e+02 2.167e+03, threshold=9.868e+02, percent-clipped=19.0 2023-06-22 17:49:19,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.66 vs. limit=15.0 2023-06-22 17:49:32,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1284492.0, ans=0.125 2023-06-22 17:49:35,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1284492.0, ans=0.0 2023-06-22 17:49:49,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1284552.0, ans=0.0 2023-06-22 17:50:13,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-22 17:50:23,365 INFO [train.py:996] (0/4) Epoch 8, batch 650, loss[loss=0.2583, simple_loss=0.3866, pruned_loss=0.06495, over 20803.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3206, pruned_loss=0.0822, over 4131060.97 frames. ], batch size: 607, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:50:25,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1284672.0, ans=0.1 2023-06-22 17:50:29,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1284672.0, ans=0.125 2023-06-22 17:50:43,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1284732.0, ans=0.2 2023-06-22 17:51:19,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1284852.0, ans=0.1 2023-06-22 17:51:35,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1284912.0, ans=0.1 2023-06-22 17:51:56,385 INFO [train.py:996] (0/4) Epoch 8, batch 700, loss[loss=0.2736, simple_loss=0.3378, pruned_loss=0.1047, over 21890.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3224, pruned_loss=0.08283, over 4167525.14 frames. ], batch size: 118, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:52:20,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.941e+02 4.186e+02 5.348e+02 7.319e+02 1.415e+03, threshold=1.070e+03, percent-clipped=6.0 2023-06-22 17:52:40,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1285092.0, ans=0.0 2023-06-22 17:53:14,305 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:53:29,838 INFO [train.py:996] (0/4) Epoch 8, batch 750, loss[loss=0.2213, simple_loss=0.2792, pruned_loss=0.08171, over 15453.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3199, pruned_loss=0.0835, over 4188386.64 frames. ], batch size: 61, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:53:33,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1285272.0, ans=0.125 2023-06-22 17:53:41,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1285272.0, ans=0.1 2023-06-22 17:54:03,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1285332.0, ans=0.0 2023-06-22 17:54:06,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1285332.0, ans=0.125 2023-06-22 17:55:07,721 INFO [train.py:996] (0/4) Epoch 8, batch 800, loss[loss=0.2301, simple_loss=0.2966, pruned_loss=0.08182, over 21849.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3171, pruned_loss=0.08373, over 4214222.29 frames. ], batch size: 107, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:55:35,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.791e+02 4.052e+02 4.658e+02 6.687e+02 1.387e+03, threshold=9.317e+02, percent-clipped=3.0 2023-06-22 17:55:50,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1285632.0, ans=0.2 2023-06-22 17:56:54,349 INFO [train.py:996] (0/4) Epoch 8, batch 850, loss[loss=0.2347, simple_loss=0.3058, pruned_loss=0.08179, over 21880.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3158, pruned_loss=0.08379, over 4231722.25 frames. ], batch size: 414, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:57:06,389 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-22 17:57:27,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1285932.0, ans=0.125 2023-06-22 17:57:58,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-22 17:58:33,078 INFO [train.py:996] (0/4) Epoch 8, batch 900, loss[loss=0.2589, simple_loss=0.3021, pruned_loss=0.1079, over 21453.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3118, pruned_loss=0.08379, over 4245084.77 frames. ], batch size: 508, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:58:47,803 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.900e+02 3.949e+02 5.086e+02 6.787e+02 1.769e+03, threshold=1.017e+03, percent-clipped=9.0 2023-06-22 17:59:01,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1286232.0, ans=0.0 2023-06-22 17:59:06,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1286292.0, ans=0.2 2023-06-22 18:00:12,921 INFO [train.py:996] (0/4) Epoch 8, batch 950, loss[loss=0.2257, simple_loss=0.3024, pruned_loss=0.07449, over 21861.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3097, pruned_loss=0.08362, over 4258190.66 frames. ], batch size: 298, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:00:22,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1286472.0, ans=0.125 2023-06-22 18:00:23,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1286472.0, ans=0.1 2023-06-22 18:00:35,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1286532.0, ans=0.125 2023-06-22 18:00:44,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-06-22 18:01:51,053 INFO [train.py:996] (0/4) Epoch 8, batch 1000, loss[loss=0.2322, simple_loss=0.3105, pruned_loss=0.07696, over 21678.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3115, pruned_loss=0.08412, over 4268252.31 frames. ], batch size: 389, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:02:05,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 3.565e+02 4.375e+02 6.228e+02 1.305e+03, threshold=8.750e+02, percent-clipped=2.0 2023-06-22 18:03:32,524 INFO [train.py:996] (0/4) Epoch 8, batch 1050, loss[loss=0.2071, simple_loss=0.2861, pruned_loss=0.06409, over 21762.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3101, pruned_loss=0.08251, over 4265878.75 frames. ], batch size: 124, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:03:50,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1287132.0, ans=0.125 2023-06-22 18:04:55,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1287312.0, ans=0.0 2023-06-22 18:05:03,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1287312.0, ans=0.0 2023-06-22 18:05:07,894 INFO [train.py:996] (0/4) Epoch 8, batch 1100, loss[loss=0.1913, simple_loss=0.2901, pruned_loss=0.04629, over 21721.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3101, pruned_loss=0.08266, over 4271892.91 frames. ], batch size: 351, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:05:08,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1287372.0, ans=0.125 2023-06-22 18:05:21,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.840e+02 4.476e+02 5.815e+02 7.362e+02 1.371e+03, threshold=1.163e+03, percent-clipped=15.0 2023-06-22 18:05:22,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1287432.0, ans=0.0 2023-06-22 18:05:27,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1287432.0, ans=0.95 2023-06-22 18:05:40,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1287432.0, ans=15.0 2023-06-22 18:06:43,972 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:06:48,170 INFO [train.py:996] (0/4) Epoch 8, batch 1150, loss[loss=0.2428, simple_loss=0.306, pruned_loss=0.08982, over 21836.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3108, pruned_loss=0.08319, over 4269174.15 frames. ], batch size: 107, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:07:00,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1287672.0, ans=0.125 2023-06-22 18:07:24,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1287792.0, ans=0.035 2023-06-22 18:07:40,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=15.0 2023-06-22 18:08:09,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1287912.0, ans=0.2 2023-06-22 18:08:12,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1287912.0, ans=0.125 2023-06-22 18:08:14,321 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:08:24,627 INFO [train.py:996] (0/4) Epoch 8, batch 1200, loss[loss=0.2686, simple_loss=0.3496, pruned_loss=0.09384, over 21583.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3132, pruned_loss=0.08292, over 4272566.08 frames. ], batch size: 471, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:08:43,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 3.897e+02 4.987e+02 7.014e+02 1.089e+03, threshold=9.974e+02, percent-clipped=0.0 2023-06-22 18:08:54,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1288032.0, ans=0.125 2023-06-22 18:09:27,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.03 vs. limit=10.0 2023-06-22 18:10:03,645 INFO [train.py:996] (0/4) Epoch 8, batch 1250, loss[loss=0.2109, simple_loss=0.2508, pruned_loss=0.08552, over 20120.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3131, pruned_loss=0.08237, over 4273255.14 frames. ], batch size: 703, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:10:30,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1288332.0, ans=0.125 2023-06-22 18:11:01,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-22 18:11:16,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1288452.0, ans=0.125 2023-06-22 18:11:22,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1288452.0, ans=0.0 2023-06-22 18:11:41,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1288512.0, ans=0.0 2023-06-22 18:11:44,063 INFO [train.py:996] (0/4) Epoch 8, batch 1300, loss[loss=0.3056, simple_loss=0.3609, pruned_loss=0.1251, over 21802.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3163, pruned_loss=0.08349, over 4272921.43 frames. ], batch size: 441, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:12:00,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1288572.0, ans=0.0 2023-06-22 18:12:04,933 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 4.208e+02 5.615e+02 7.044e+02 1.517e+03, threshold=1.123e+03, percent-clipped=9.0 2023-06-22 18:12:43,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=8.0 2023-06-22 18:12:43,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1288692.0, ans=0.125 2023-06-22 18:12:57,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1288752.0, ans=0.125 2023-06-22 18:13:12,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1288812.0, ans=0.125 2023-06-22 18:13:24,651 INFO [train.py:996] (0/4) Epoch 8, batch 1350, loss[loss=0.3114, simple_loss=0.366, pruned_loss=0.1284, over 21406.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3182, pruned_loss=0.08463, over 4281267.33 frames. ], batch size: 509, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:14:39,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289052.0, ans=0.1 2023-06-22 18:14:55,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1289112.0, ans=0.125 2023-06-22 18:15:01,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1289112.0, ans=0.125 2023-06-22 18:15:01,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1289112.0, ans=0.0 2023-06-22 18:15:03,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1289112.0, ans=0.125 2023-06-22 18:15:05,861 INFO [train.py:996] (0/4) Epoch 8, batch 1400, loss[loss=0.23, simple_loss=0.2955, pruned_loss=0.08223, over 21717.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3159, pruned_loss=0.08476, over 4282863.56 frames. ], batch size: 332, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:15:26,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.742e+02 3.782e+02 4.959e+02 6.793e+02 1.586e+03, threshold=9.917e+02, percent-clipped=6.0 2023-06-22 18:16:35,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1289412.0, ans=0.125 2023-06-22 18:16:45,948 INFO [train.py:996] (0/4) Epoch 8, batch 1450, loss[loss=0.2816, simple_loss=0.3398, pruned_loss=0.1117, over 21415.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3168, pruned_loss=0.08586, over 4276452.13 frames. ], batch size: 131, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:18:04,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-22 18:18:25,453 INFO [train.py:996] (0/4) Epoch 8, batch 1500, loss[loss=0.2404, simple_loss=0.3038, pruned_loss=0.08845, over 21942.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3192, pruned_loss=0.08732, over 4283076.54 frames. ], batch size: 118, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:18:37,229 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:18:45,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1289832.0, ans=0.1 2023-06-22 18:18:46,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 3.712e+02 4.836e+02 6.899e+02 1.421e+03, threshold=9.672e+02, percent-clipped=7.0 2023-06-22 18:18:52,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-22 18:19:16,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1289892.0, ans=0.05 2023-06-22 18:19:30,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1289952.0, ans=0.125 2023-06-22 18:19:38,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-22 18:19:41,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1289952.0, ans=0.125 2023-06-22 18:19:41,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1289952.0, ans=0.125 2023-06-22 18:20:04,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1290012.0, ans=0.0 2023-06-22 18:20:06,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1290072.0, ans=0.125 2023-06-22 18:20:07,046 INFO [train.py:996] (0/4) Epoch 8, batch 1550, loss[loss=0.1851, simple_loss=0.2741, pruned_loss=0.04802, over 21580.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3153, pruned_loss=0.0839, over 4286121.21 frames. ], batch size: 389, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:20:10,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1290072.0, ans=0.2 2023-06-22 18:21:37,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1290312.0, ans=0.0 2023-06-22 18:21:48,303 INFO [train.py:996] (0/4) Epoch 8, batch 1600, loss[loss=0.2133, simple_loss=0.2853, pruned_loss=0.0707, over 21246.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3142, pruned_loss=0.08313, over 4280393.18 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:22:02,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1290372.0, ans=0.0 2023-06-22 18:22:16,351 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.959e+02 3.911e+02 5.598e+02 7.259e+02 1.641e+03, threshold=1.120e+03, percent-clipped=8.0 2023-06-22 18:22:17,400 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-22 18:22:50,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.14 vs. limit=12.0 2023-06-22 18:23:26,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1290612.0, ans=0.0 2023-06-22 18:23:36,637 INFO [train.py:996] (0/4) Epoch 8, batch 1650, loss[loss=0.2999, simple_loss=0.3689, pruned_loss=0.1155, over 21834.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3148, pruned_loss=0.08293, over 4287663.12 frames. ], batch size: 118, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:24:48,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-22 18:24:52,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1290852.0, ans=0.0 2023-06-22 18:25:17,570 INFO [train.py:996] (0/4) Epoch 8, batch 1700, loss[loss=0.2472, simple_loss=0.3346, pruned_loss=0.07988, over 21780.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3204, pruned_loss=0.08491, over 4288150.07 frames. ], batch size: 282, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:25:36,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1290972.0, ans=0.125 2023-06-22 18:25:45,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.906e+02 3.944e+02 4.852e+02 6.481e+02 1.409e+03, threshold=9.704e+02, percent-clipped=2.0 2023-06-22 18:25:53,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1291032.0, ans=0.0 2023-06-22 18:26:19,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=12.0 2023-06-22 18:26:27,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1291152.0, ans=0.125 2023-06-22 18:26:50,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1291212.0, ans=22.5 2023-06-22 18:27:04,050 INFO [train.py:996] (0/4) Epoch 8, batch 1750, loss[loss=0.156, simple_loss=0.2262, pruned_loss=0.04288, over 21387.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3164, pruned_loss=0.08165, over 4284496.32 frames. ], batch size: 131, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:28:32,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1291512.0, ans=0.05 2023-06-22 18:28:46,012 INFO [train.py:996] (0/4) Epoch 8, batch 1800, loss[loss=0.2351, simple_loss=0.3375, pruned_loss=0.06636, over 21645.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3176, pruned_loss=0.08029, over 4276246.40 frames. ], batch size: 414, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:29:04,670 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.938e+02 3.886e+02 4.989e+02 8.763e+02 2.376e+03, threshold=9.977e+02, percent-clipped=20.0 2023-06-22 18:29:05,135 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:29:20,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.54 vs. limit=22.5 2023-06-22 18:29:29,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1291692.0, ans=0.2 2023-06-22 18:29:33,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1291692.0, ans=0.2 2023-06-22 18:30:06,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-22 18:30:14,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1291812.0, ans=0.125 2023-06-22 18:30:26,422 INFO [train.py:996] (0/4) Epoch 8, batch 1850, loss[loss=0.2836, simple_loss=0.3702, pruned_loss=0.0985, over 21537.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3177, pruned_loss=0.07937, over 4273008.25 frames. ], batch size: 473, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:31:00,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1291932.0, ans=0.2 2023-06-22 18:31:13,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1291992.0, ans=0.025 2023-06-22 18:31:24,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1292052.0, ans=0.125 2023-06-22 18:32:06,609 INFO [train.py:996] (0/4) Epoch 8, batch 1900, loss[loss=0.2525, simple_loss=0.3307, pruned_loss=0.08713, over 21763.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3174, pruned_loss=0.08021, over 4277194.52 frames. ], batch size: 298, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:32:16,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1292172.0, ans=0.95 2023-06-22 18:32:24,586 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:32:25,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.699e+02 3.836e+02 4.981e+02 6.397e+02 1.530e+03, threshold=9.962e+02, percent-clipped=6.0 2023-06-22 18:32:37,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=15.0 2023-06-22 18:32:44,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1292292.0, ans=0.125 2023-06-22 18:32:57,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=15.0 2023-06-22 18:33:00,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2023-06-22 18:33:46,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-22 18:33:49,934 INFO [train.py:996] (0/4) Epoch 8, batch 1950, loss[loss=0.2044, simple_loss=0.3009, pruned_loss=0.05391, over 21634.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3149, pruned_loss=0.08105, over 4283129.59 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:33:59,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1292472.0, ans=0.1 2023-06-22 18:34:13,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1292532.0, ans=0.125 2023-06-22 18:35:01,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1292652.0, ans=0.2 2023-06-22 18:35:17,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1292712.0, ans=0.125 2023-06-22 18:35:31,203 INFO [train.py:996] (0/4) Epoch 8, batch 2000, loss[loss=0.1563, simple_loss=0.2233, pruned_loss=0.04462, over 21282.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3109, pruned_loss=0.07912, over 4285784.49 frames. ], batch size: 131, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:35:40,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-22 18:35:54,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 4.321e+02 6.258e+02 9.587e+02 1.701e+03, threshold=1.252e+03, percent-clipped=22.0 2023-06-22 18:35:56,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1292832.0, ans=0.0 2023-06-22 18:36:37,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1292892.0, ans=0.125 2023-06-22 18:37:13,664 INFO [train.py:996] (0/4) Epoch 8, batch 2050, loss[loss=0.2327, simple_loss=0.3002, pruned_loss=0.08261, over 21436.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3134, pruned_loss=0.07878, over 4281073.09 frames. ], batch size: 144, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:37:21,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1293072.0, ans=0.0 2023-06-22 18:37:44,062 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:38:43,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1293312.0, ans=0.125 2023-06-22 18:38:45,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1293312.0, ans=0.0 2023-06-22 18:38:48,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-22 18:38:53,106 INFO [train.py:996] (0/4) Epoch 8, batch 2100, loss[loss=0.3332, simple_loss=0.3653, pruned_loss=0.1506, over 21389.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3182, pruned_loss=0.08211, over 4276322.24 frames. ], batch size: 507, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:38:56,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1293372.0, ans=0.02 2023-06-22 18:39:17,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.801e+02 4.056e+02 5.317e+02 7.512e+02 1.644e+03, threshold=1.063e+03, percent-clipped=5.0 2023-06-22 18:39:33,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1293492.0, ans=0.0 2023-06-22 18:40:33,834 INFO [train.py:996] (0/4) Epoch 8, batch 2150, loss[loss=0.236, simple_loss=0.2966, pruned_loss=0.08768, over 21242.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3155, pruned_loss=0.08335, over 4271417.32 frames. ], batch size: 143, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:40:44,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1293672.0, ans=0.0 2023-06-22 18:41:06,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1293732.0, ans=0.0 2023-06-22 18:41:15,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-22 18:41:45,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1293852.0, ans=0.125 2023-06-22 18:41:58,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1293912.0, ans=0.125 2023-06-22 18:42:09,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1293972.0, ans=0.0 2023-06-22 18:42:10,366 INFO [train.py:996] (0/4) Epoch 8, batch 2200, loss[loss=0.1924, simple_loss=0.2637, pruned_loss=0.0606, over 21152.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.318, pruned_loss=0.08424, over 4279978.48 frames. ], batch size: 143, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:42:33,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.009e+02 3.898e+02 4.994e+02 6.578e+02 1.550e+03, threshold=9.987e+02, percent-clipped=10.0 2023-06-22 18:43:27,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1294152.0, ans=0.025 2023-06-22 18:43:39,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1294212.0, ans=0.1 2023-06-22 18:43:40,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1294212.0, ans=0.125 2023-06-22 18:43:50,434 INFO [train.py:996] (0/4) Epoch 8, batch 2250, loss[loss=0.2524, simple_loss=0.3708, pruned_loss=0.06704, over 21214.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3156, pruned_loss=0.08296, over 4284766.20 frames. ], batch size: 549, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:44:25,194 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-06-22 18:44:26,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1294332.0, ans=0.2 2023-06-22 18:45:09,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1294452.0, ans=0.05 2023-06-22 18:45:29,779 INFO [train.py:996] (0/4) Epoch 8, batch 2300, loss[loss=0.2227, simple_loss=0.2809, pruned_loss=0.08218, over 21116.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3109, pruned_loss=0.08231, over 4275566.05 frames. ], batch size: 159, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:45:53,604 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 4.015e+02 5.277e+02 7.353e+02 1.540e+03, threshold=1.055e+03, percent-clipped=5.0 2023-06-22 18:46:09,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-22 18:46:10,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1294692.0, ans=0.125 2023-06-22 18:46:36,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1294692.0, ans=0.0 2023-06-22 18:46:36,686 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=22.5 2023-06-22 18:46:47,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-22 18:47:11,808 INFO [train.py:996] (0/4) Epoch 8, batch 2350, loss[loss=0.308, simple_loss=0.3622, pruned_loss=0.1269, over 21404.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3098, pruned_loss=0.0826, over 4275041.19 frames. ], batch size: 471, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:47:47,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-22 18:48:40,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-22 18:48:53,778 INFO [train.py:996] (0/4) Epoch 8, batch 2400, loss[loss=0.269, simple_loss=0.3429, pruned_loss=0.09751, over 21608.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3136, pruned_loss=0.08407, over 4274107.36 frames. ], batch size: 415, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:48:59,419 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:49:07,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1295172.0, ans=0.125 2023-06-22 18:49:18,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-22 18:49:18,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.938e+02 4.939e+02 6.884e+02 8.991e+02 1.831e+03, threshold=1.377e+03, percent-clipped=16.0 2023-06-22 18:49:41,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1295292.0, ans=0.125 2023-06-22 18:49:45,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1295292.0, ans=0.2 2023-06-22 18:50:14,309 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-22 18:50:35,319 INFO [train.py:996] (0/4) Epoch 8, batch 2450, loss[loss=0.2231, simple_loss=0.2973, pruned_loss=0.0744, over 21611.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3177, pruned_loss=0.0857, over 4277395.05 frames. ], batch size: 212, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:50:55,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1295532.0, ans=0.07 2023-06-22 18:51:13,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1295532.0, ans=0.125 2023-06-22 18:51:23,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-06-22 18:52:00,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1295712.0, ans=10.0 2023-06-22 18:52:15,547 INFO [train.py:996] (0/4) Epoch 8, batch 2500, loss[loss=0.2412, simple_loss=0.3046, pruned_loss=0.08894, over 22017.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3157, pruned_loss=0.08506, over 4278272.93 frames. ], batch size: 103, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:52:16,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-22 18:52:34,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.052e+02 4.409e+02 5.815e+02 8.522e+02 2.143e+03, threshold=1.163e+03, percent-clipped=4.0 2023-06-22 18:53:11,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1295892.0, ans=0.0 2023-06-22 18:53:13,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1295892.0, ans=0.0 2023-06-22 18:53:17,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=22.5 2023-06-22 18:53:18,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1295952.0, ans=0.125 2023-06-22 18:53:30,068 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-216000.pt 2023-06-22 18:53:43,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1296012.0, ans=0.125 2023-06-22 18:53:54,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1296012.0, ans=0.2 2023-06-22 18:53:57,213 INFO [train.py:996] (0/4) Epoch 8, batch 2550, loss[loss=0.2659, simple_loss=0.3201, pruned_loss=0.1059, over 21450.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3139, pruned_loss=0.08442, over 4271205.01 frames. ], batch size: 389, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:54:59,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-22 18:55:04,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1296252.0, ans=0.02 2023-06-22 18:55:37,976 INFO [train.py:996] (0/4) Epoch 8, batch 2600, loss[loss=0.2422, simple_loss=0.3315, pruned_loss=0.07644, over 21449.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3139, pruned_loss=0.08588, over 4275146.44 frames. ], batch size: 211, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:55:48,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1296372.0, ans=0.125 2023-06-22 18:55:52,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1296432.0, ans=0.1 2023-06-22 18:55:57,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.832e+02 4.000e+02 4.999e+02 6.903e+02 1.017e+03, threshold=9.998e+02, percent-clipped=0.0 2023-06-22 18:56:13,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-22 18:56:17,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1296492.0, ans=0.0 2023-06-22 18:56:24,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1296492.0, ans=0.125 2023-06-22 18:56:55,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-22 18:57:14,611 INFO [train.py:996] (0/4) Epoch 8, batch 2650, loss[loss=0.2717, simple_loss=0.3514, pruned_loss=0.09599, over 21689.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3147, pruned_loss=0.08597, over 4274242.32 frames. ], batch size: 389, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:57:31,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1296732.0, ans=0.125 2023-06-22 18:57:33,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-22 18:57:34,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1296732.0, ans=0.125 2023-06-22 18:57:39,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1296732.0, ans=0.1 2023-06-22 18:58:04,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1296792.0, ans=0.0 2023-06-22 18:58:12,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1296792.0, ans=0.125 2023-06-22 18:58:55,302 INFO [train.py:996] (0/4) Epoch 8, batch 2700, loss[loss=0.2197, simple_loss=0.2911, pruned_loss=0.07415, over 21676.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.313, pruned_loss=0.08511, over 4280691.55 frames. ], batch size: 298, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:58:58,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1296972.0, ans=0.125 2023-06-22 18:58:58,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1296972.0, ans=0.125 2023-06-22 18:59:14,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.148e+02 4.384e+02 5.256e+02 7.143e+02 1.333e+03, threshold=1.051e+03, percent-clipped=8.0 2023-06-22 18:59:18,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1297032.0, ans=0.0 2023-06-22 18:59:20,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1297032.0, ans=0.125 2023-06-22 19:00:37,160 INFO [train.py:996] (0/4) Epoch 8, batch 2750, loss[loss=0.2844, simple_loss=0.3481, pruned_loss=0.1103, over 21341.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3138, pruned_loss=0.08621, over 4277081.32 frames. ], batch size: 143, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 19:00:50,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1297272.0, ans=0.2 2023-06-22 19:02:18,177 INFO [train.py:996] (0/4) Epoch 8, batch 2800, loss[loss=0.2171, simple_loss=0.3441, pruned_loss=0.04506, over 19764.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3168, pruned_loss=0.08605, over 4278908.61 frames. ], batch size: 702, lr: 3.83e-03, grad_scale: 32.0 2023-06-22 19:02:56,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.531e+02 5.975e+02 9.124e+02 1.757e+03, threshold=1.195e+03, percent-clipped=17.0 2023-06-22 19:03:36,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1297752.0, ans=0.2 2023-06-22 19:03:49,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1297812.0, ans=0.125 2023-06-22 19:04:01,708 INFO [train.py:996] (0/4) Epoch 8, batch 2850, loss[loss=0.2876, simple_loss=0.3504, pruned_loss=0.1124, over 21755.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3202, pruned_loss=0.08787, over 4283813.75 frames. ], batch size: 441, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:04:48,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-22 19:04:52,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1297992.0, ans=0.125 2023-06-22 19:05:19,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1298112.0, ans=0.125 2023-06-22 19:05:29,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1298112.0, ans=0.125 2023-06-22 19:05:36,702 INFO [train.py:996] (0/4) Epoch 8, batch 2900, loss[loss=0.1817, simple_loss=0.2393, pruned_loss=0.0621, over 21194.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3157, pruned_loss=0.0865, over 4281768.99 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:06:04,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1298172.0, ans=0.1 2023-06-22 19:06:12,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 4.241e+02 6.077e+02 8.455e+02 1.821e+03, threshold=1.215e+03, percent-clipped=6.0 2023-06-22 19:07:14,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1298472.0, ans=0.125 2023-06-22 19:07:15,536 INFO [train.py:996] (0/4) Epoch 8, batch 2950, loss[loss=0.2539, simple_loss=0.3568, pruned_loss=0.07546, over 20847.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3183, pruned_loss=0.08645, over 4283041.06 frames. ], batch size: 607, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:07:17,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1298472.0, ans=0.1 2023-06-22 19:08:03,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1298592.0, ans=0.0 2023-06-22 19:08:52,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1298712.0, ans=0.125 2023-06-22 19:08:56,490 INFO [train.py:996] (0/4) Epoch 8, batch 3000, loss[loss=0.2597, simple_loss=0.316, pruned_loss=0.1017, over 21419.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3242, pruned_loss=0.088, over 4283898.26 frames. ], batch size: 211, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:08:56,491 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 19:09:09,047 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.2607, 2.3712, 2.2395, 3.2849], device='cuda:0') 2023-06-22 19:09:17,907 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2518, simple_loss=0.3464, pruned_loss=0.0786, over 1796401.00 frames. 2023-06-22 19:09:17,909 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 19:09:40,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.798e+02 4.528e+02 5.625e+02 8.163e+02 1.642e+03, threshold=1.125e+03, percent-clipped=6.0 2023-06-22 19:09:44,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1298832.0, ans=0.2 2023-06-22 19:10:01,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1298892.0, ans=0.0 2023-06-22 19:10:07,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-22 19:10:25,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1299012.0, ans=0.125 2023-06-22 19:10:38,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1299012.0, ans=0.04949747468305833 2023-06-22 19:10:58,784 INFO [train.py:996] (0/4) Epoch 8, batch 3050, loss[loss=0.1872, simple_loss=0.2634, pruned_loss=0.05548, over 21410.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.321, pruned_loss=0.08579, over 4283048.96 frames. ], batch size: 194, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:10:59,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299072.0, ans=0.1 2023-06-22 19:11:36,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-22 19:11:43,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1299192.0, ans=0.0 2023-06-22 19:12:29,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1299312.0, ans=0.125 2023-06-22 19:12:38,550 INFO [train.py:996] (0/4) Epoch 8, batch 3100, loss[loss=0.218, simple_loss=0.3076, pruned_loss=0.06423, over 21826.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3185, pruned_loss=0.08317, over 4284540.60 frames. ], batch size: 282, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:13:05,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.778e+02 3.934e+02 5.608e+02 7.913e+02 1.726e+03, threshold=1.122e+03, percent-clipped=9.0 2023-06-22 19:13:53,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1299552.0, ans=0.2 2023-06-22 19:14:08,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-22 19:14:18,783 INFO [train.py:996] (0/4) Epoch 8, batch 3150, loss[loss=0.3508, simple_loss=0.4005, pruned_loss=0.1506, over 21386.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3228, pruned_loss=0.08545, over 4284500.89 frames. ], batch size: 509, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:15:31,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1299852.0, ans=0.125 2023-06-22 19:15:46,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1299912.0, ans=0.1 2023-06-22 19:16:06,916 INFO [train.py:996] (0/4) Epoch 8, batch 3200, loss[loss=0.2522, simple_loss=0.3303, pruned_loss=0.0871, over 21778.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3229, pruned_loss=0.08512, over 4284003.33 frames. ], batch size: 124, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:16:19,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-22 19:16:24,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.68 vs. limit=15.0 2023-06-22 19:16:28,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1300032.0, ans=0.0 2023-06-22 19:16:29,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.108e+02 3.882e+02 4.322e+02 5.833e+02 1.816e+03, threshold=8.643e+02, percent-clipped=1.0 2023-06-22 19:17:13,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1300152.0, ans=0.0 2023-06-22 19:17:46,241 INFO [train.py:996] (0/4) Epoch 8, batch 3250, loss[loss=0.2447, simple_loss=0.3089, pruned_loss=0.0902, over 21585.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3235, pruned_loss=0.08678, over 4284633.71 frames. ], batch size: 230, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:18:07,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1300332.0, ans=0.1 2023-06-22 19:18:38,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1300392.0, ans=0.0 2023-06-22 19:19:25,822 INFO [train.py:996] (0/4) Epoch 8, batch 3300, loss[loss=0.2148, simple_loss=0.2949, pruned_loss=0.06735, over 21570.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3189, pruned_loss=0.08631, over 4287307.75 frames. ], batch size: 230, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:19:30,933 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:19:34,568 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:19:42,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1300632.0, ans=0.125 2023-06-22 19:19:48,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 4.529e+02 6.033e+02 9.657e+02 1.783e+03, threshold=1.207e+03, percent-clipped=28.0 2023-06-22 19:20:06,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1300692.0, ans=0.125 2023-06-22 19:20:37,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1300752.0, ans=0.125 2023-06-22 19:21:01,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1300812.0, ans=0.125 2023-06-22 19:21:04,783 INFO [train.py:996] (0/4) Epoch 8, batch 3350, loss[loss=0.2399, simple_loss=0.3187, pruned_loss=0.08061, over 21483.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3204, pruned_loss=0.08739, over 4284039.03 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:21:05,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1300872.0, ans=0.2 2023-06-22 19:21:10,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1300872.0, ans=0.0 2023-06-22 19:21:41,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1300932.0, ans=0.125 2023-06-22 19:21:52,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1300992.0, ans=0.0 2023-06-22 19:22:11,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1301052.0, ans=0.5 2023-06-22 19:22:39,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1301112.0, ans=0.125 2023-06-22 19:22:43,293 INFO [train.py:996] (0/4) Epoch 8, batch 3400, loss[loss=0.2008, simple_loss=0.2834, pruned_loss=0.05904, over 21443.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3192, pruned_loss=0.08733, over 4288773.63 frames. ], batch size: 211, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:22:50,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1301172.0, ans=0.0 2023-06-22 19:23:16,076 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.036e+02 4.257e+02 5.465e+02 6.871e+02 1.586e+03, threshold=1.093e+03, percent-clipped=5.0 2023-06-22 19:24:10,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1301412.0, ans=0.0 2023-06-22 19:24:24,330 INFO [train.py:996] (0/4) Epoch 8, batch 3450, loss[loss=0.2741, simple_loss=0.3298, pruned_loss=0.1092, over 21576.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3153, pruned_loss=0.0869, over 4275001.20 frames. ], batch size: 548, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:24:49,681 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-22 19:25:00,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1301532.0, ans=0.0 2023-06-22 19:25:02,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1301532.0, ans=0.1 2023-06-22 19:25:55,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.00 vs. limit=22.5 2023-06-22 19:26:09,215 INFO [train.py:996] (0/4) Epoch 8, batch 3500, loss[loss=0.2988, simple_loss=0.3625, pruned_loss=0.1176, over 21233.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3224, pruned_loss=0.0894, over 4269826.61 frames. ], batch size: 143, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:26:28,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-06-22 19:26:30,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1301832.0, ans=0.0 2023-06-22 19:26:36,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.220e+02 4.832e+02 6.635e+02 8.517e+02 1.814e+03, threshold=1.327e+03, percent-clipped=16.0 2023-06-22 19:27:37,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1302012.0, ans=0.125 2023-06-22 19:27:39,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-22 19:27:42,651 INFO [train.py:996] (0/4) Epoch 8, batch 3550, loss[loss=0.227, simple_loss=0.2986, pruned_loss=0.07764, over 21847.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3257, pruned_loss=0.09092, over 4274271.83 frames. ], batch size: 372, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:28:04,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1302132.0, ans=0.125 2023-06-22 19:28:14,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1302132.0, ans=0.04949747468305833 2023-06-22 19:28:42,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1302252.0, ans=0.125 2023-06-22 19:28:53,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1302252.0, ans=0.125 2023-06-22 19:29:21,130 INFO [train.py:996] (0/4) Epoch 8, batch 3600, loss[loss=0.2243, simple_loss=0.33, pruned_loss=0.05933, over 20799.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3216, pruned_loss=0.09077, over 4271968.50 frames. ], batch size: 607, lr: 3.83e-03, grad_scale: 32.0 2023-06-22 19:29:30,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1302372.0, ans=0.035 2023-06-22 19:29:35,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1302372.0, ans=0.125 2023-06-22 19:29:44,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1302432.0, ans=0.0 2023-06-22 19:29:48,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.074e+02 4.388e+02 6.270e+02 8.797e+02 1.377e+03, threshold=1.254e+03, percent-clipped=1.0 2023-06-22 19:30:14,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1302492.0, ans=0.125 2023-06-22 19:30:26,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1302552.0, ans=0.025 2023-06-22 19:30:54,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.37 vs. limit=15.0 2023-06-22 19:30:59,847 INFO [train.py:996] (0/4) Epoch 8, batch 3650, loss[loss=0.2362, simple_loss=0.3071, pruned_loss=0.08268, over 21434.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3215, pruned_loss=0.09014, over 4273475.96 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:31:15,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1302672.0, ans=0.0 2023-06-22 19:31:26,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-22 19:31:56,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1302852.0, ans=22.5 2023-06-22 19:32:14,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-22 19:32:37,367 INFO [train.py:996] (0/4) Epoch 8, batch 3700, loss[loss=0.2622, simple_loss=0.3278, pruned_loss=0.09828, over 22034.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3203, pruned_loss=0.08993, over 4279918.45 frames. ], batch size: 119, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:33:03,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1303032.0, ans=0.125 2023-06-22 19:33:05,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 4.107e+02 5.215e+02 7.517e+02 1.439e+03, threshold=1.043e+03, percent-clipped=3.0 2023-06-22 19:33:16,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1303032.0, ans=0.2 2023-06-22 19:33:30,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1303092.0, ans=0.125 2023-06-22 19:33:42,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1303152.0, ans=0.0 2023-06-22 19:34:16,538 INFO [train.py:996] (0/4) Epoch 8, batch 3750, loss[loss=0.2698, simple_loss=0.3347, pruned_loss=0.1025, over 21750.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3184, pruned_loss=0.08923, over 4289066.97 frames. ], batch size: 441, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:35:04,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1303392.0, ans=0.0 2023-06-22 19:35:08,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1303392.0, ans=0.125 2023-06-22 19:35:29,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1303452.0, ans=0.125 2023-06-22 19:36:00,343 INFO [train.py:996] (0/4) Epoch 8, batch 3800, loss[loss=0.2869, simple_loss=0.3496, pruned_loss=0.1121, over 21536.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3175, pruned_loss=0.0879, over 4288682.73 frames. ], batch size: 473, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:36:17,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-06-22 19:36:28,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 4.734e+02 6.125e+02 7.875e+02 1.546e+03, threshold=1.225e+03, percent-clipped=6.0 2023-06-22 19:36:38,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1303692.0, ans=0.2 2023-06-22 19:37:01,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1303752.0, ans=0.2 2023-06-22 19:37:07,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-22 19:37:35,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1303812.0, ans=15.0 2023-06-22 19:37:37,343 INFO [train.py:996] (0/4) Epoch 8, batch 3850, loss[loss=0.2569, simple_loss=0.328, pruned_loss=0.09293, over 21403.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3153, pruned_loss=0.0883, over 4294229.30 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:37:42,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1303872.0, ans=0.2 2023-06-22 19:37:42,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-22 19:37:55,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1303872.0, ans=0.0 2023-06-22 19:38:09,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=12.0 2023-06-22 19:38:10,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1303932.0, ans=0.125 2023-06-22 19:38:24,286 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-22 19:38:41,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1304052.0, ans=0.2 2023-06-22 19:38:49,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1304112.0, ans=0.125 2023-06-22 19:39:16,047 INFO [train.py:996] (0/4) Epoch 8, batch 3900, loss[loss=0.2965, simple_loss=0.3901, pruned_loss=0.1015, over 19740.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3116, pruned_loss=0.08745, over 4288107.69 frames. ], batch size: 702, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:39:22,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1304172.0, ans=0.125 2023-06-22 19:39:45,205 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.973e+02 4.645e+02 5.915e+02 7.788e+02 1.896e+03, threshold=1.183e+03, percent-clipped=6.0 2023-06-22 19:40:56,655 INFO [train.py:996] (0/4) Epoch 8, batch 3950, loss[loss=0.2664, simple_loss=0.3598, pruned_loss=0.08648, over 21623.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.314, pruned_loss=0.08624, over 4286615.46 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:41:14,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1304472.0, ans=0.0 2023-06-22 19:41:53,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1304652.0, ans=0.125 2023-06-22 19:42:25,731 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:42:36,370 INFO [train.py:996] (0/4) Epoch 8, batch 4000, loss[loss=0.2054, simple_loss=0.2741, pruned_loss=0.06833, over 21827.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3087, pruned_loss=0.08318, over 4275407.51 frames. ], batch size: 98, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:43:02,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1304832.0, ans=0.0 2023-06-22 19:43:05,272 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.962e+02 4.102e+02 5.710e+02 7.605e+02 1.219e+03, threshold=1.142e+03, percent-clipped=1.0 2023-06-22 19:44:21,122 INFO [train.py:996] (0/4) Epoch 8, batch 4050, loss[loss=0.1952, simple_loss=0.2957, pruned_loss=0.04739, over 21755.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3082, pruned_loss=0.08146, over 4279652.81 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:44:36,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1305132.0, ans=10.0 2023-06-22 19:45:04,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1305192.0, ans=0.2 2023-06-22 19:45:05,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1305192.0, ans=0.0 2023-06-22 19:45:23,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1305252.0, ans=0.2 2023-06-22 19:46:00,755 INFO [train.py:996] (0/4) Epoch 8, batch 4100, loss[loss=0.2402, simple_loss=0.3125, pruned_loss=0.08393, over 21392.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3072, pruned_loss=0.08093, over 4280878.85 frames. ], batch size: 131, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:46:26,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.689e+02 3.586e+02 4.759e+02 6.058e+02 1.628e+03, threshold=9.517e+02, percent-clipped=6.0 2023-06-22 19:46:32,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-22 19:46:44,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1305492.0, ans=0.125 2023-06-22 19:47:16,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1305552.0, ans=0.1 2023-06-22 19:47:39,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1305672.0, ans=0.125 2023-06-22 19:47:40,291 INFO [train.py:996] (0/4) Epoch 8, batch 4150, loss[loss=0.2575, simple_loss=0.3224, pruned_loss=0.09625, over 21632.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3077, pruned_loss=0.07793, over 4275334.18 frames. ], batch size: 263, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:47:47,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1305672.0, ans=0.125 2023-06-22 19:48:11,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1305732.0, ans=0.125 2023-06-22 19:48:39,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1305852.0, ans=0.125 2023-06-22 19:49:23,592 INFO [train.py:996] (0/4) Epoch 8, batch 4200, loss[loss=0.2728, simple_loss=0.3649, pruned_loss=0.0903, over 21849.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3087, pruned_loss=0.07837, over 4259467.74 frames. ], batch size: 372, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:49:57,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.597e+02 4.423e+02 6.282e+02 9.309e+02 2.210e+03, threshold=1.256e+03, percent-clipped=22.0 2023-06-22 19:50:16,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.56 vs. limit=22.5 2023-06-22 19:51:06,578 INFO [train.py:996] (0/4) Epoch 8, batch 4250, loss[loss=0.2669, simple_loss=0.3882, pruned_loss=0.0728, over 21210.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3173, pruned_loss=0.08066, over 4257737.33 frames. ], batch size: 549, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:51:29,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1306332.0, ans=0.125 2023-06-22 19:52:06,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1306392.0, ans=0.0 2023-06-22 19:52:39,609 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-22 19:52:55,190 INFO [train.py:996] (0/4) Epoch 8, batch 4300, loss[loss=0.2987, simple_loss=0.3802, pruned_loss=0.1086, over 20739.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3231, pruned_loss=0.08278, over 4252248.88 frames. ], batch size: 607, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:53:38,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.490e+02 6.473e+02 1.024e+03 2.368e+03, threshold=1.295e+03, percent-clipped=12.0 2023-06-22 19:53:47,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1306692.0, ans=0.025 2023-06-22 19:54:02,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1306752.0, ans=0.95 2023-06-22 19:54:08,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1306752.0, ans=0.0 2023-06-22 19:54:35,552 INFO [train.py:996] (0/4) Epoch 8, batch 4350, loss[loss=0.2235, simple_loss=0.2796, pruned_loss=0.08372, over 21404.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3206, pruned_loss=0.08173, over 4248247.35 frames. ], batch size: 131, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:55:06,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1306932.0, ans=0.0 2023-06-22 19:55:15,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1306932.0, ans=0.0 2023-06-22 19:56:01,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1307112.0, ans=0.125 2023-06-22 19:56:15,891 INFO [train.py:996] (0/4) Epoch 8, batch 4400, loss[loss=0.272, simple_loss=0.3445, pruned_loss=0.09969, over 20727.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3153, pruned_loss=0.08195, over 4247402.94 frames. ], batch size: 608, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:56:26,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1307172.0, ans=0.0 2023-06-22 19:56:33,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1307172.0, ans=0.125 2023-06-22 19:56:33,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1307172.0, ans=0.05 2023-06-22 19:56:43,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1307232.0, ans=0.125 2023-06-22 19:56:53,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.962e+02 4.391e+02 5.978e+02 7.745e+02 1.639e+03, threshold=1.196e+03, percent-clipped=7.0 2023-06-22 19:56:56,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1307232.0, ans=0.125 2023-06-22 19:57:07,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1307292.0, ans=0.125 2023-06-22 19:57:35,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1307412.0, ans=0.125 2023-06-22 19:57:37,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1307412.0, ans=0.125 2023-06-22 19:58:02,009 INFO [train.py:996] (0/4) Epoch 8, batch 4450, loss[loss=0.3318, simple_loss=0.4119, pruned_loss=0.1258, over 21630.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.327, pruned_loss=0.08471, over 4258970.20 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:59:06,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1307652.0, ans=0.125 2023-06-22 19:59:14,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1307652.0, ans=0.0 2023-06-22 19:59:41,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307712.0, ans=0.1 2023-06-22 19:59:48,284 INFO [train.py:996] (0/4) Epoch 8, batch 4500, loss[loss=0.2547, simple_loss=0.3119, pruned_loss=0.09878, over 20220.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3267, pruned_loss=0.08619, over 4266121.33 frames. ], batch size: 702, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:00:06,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-22 20:00:14,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.895e+02 4.151e+02 5.436e+02 7.450e+02 1.876e+03, threshold=1.087e+03, percent-clipped=7.0 2023-06-22 20:00:40,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1307952.0, ans=0.125 2023-06-22 20:01:21,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1308012.0, ans=0.125 2023-06-22 20:01:28,224 INFO [train.py:996] (0/4) Epoch 8, batch 4550, loss[loss=0.2518, simple_loss=0.334, pruned_loss=0.08487, over 21757.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3282, pruned_loss=0.086, over 4269752.69 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:01:37,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1308072.0, ans=0.04949747468305833 2023-06-22 20:02:14,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-22 20:02:34,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-22 20:02:53,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1308312.0, ans=0.125 2023-06-22 20:03:08,600 INFO [train.py:996] (0/4) Epoch 8, batch 4600, loss[loss=0.2218, simple_loss=0.2955, pruned_loss=0.07407, over 21796.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3311, pruned_loss=0.08827, over 4272372.29 frames. ], batch size: 247, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:03:28,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1308432.0, ans=0.2 2023-06-22 20:03:40,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.710e+02 4.165e+02 5.279e+02 6.740e+02 1.716e+03, threshold=1.056e+03, percent-clipped=6.0 2023-06-22 20:03:57,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1308492.0, ans=0.125 2023-06-22 20:04:05,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1308552.0, ans=0.0 2023-06-22 20:04:47,312 INFO [train.py:996] (0/4) Epoch 8, batch 4650, loss[loss=0.1738, simple_loss=0.2436, pruned_loss=0.05204, over 21244.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3244, pruned_loss=0.08591, over 4277315.21 frames. ], batch size: 176, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:04:59,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1308672.0, ans=0.125 2023-06-22 20:05:30,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1308792.0, ans=0.2 2023-06-22 20:06:21,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1308972.0, ans=0.2 2023-06-22 20:06:22,489 INFO [train.py:996] (0/4) Epoch 8, batch 4700, loss[loss=0.2013, simple_loss=0.2687, pruned_loss=0.06696, over 21824.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3129, pruned_loss=0.08259, over 4265229.88 frames. ], batch size: 107, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:06:36,942 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:06:44,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1309032.0, ans=0.125 2023-06-22 20:06:50,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1309032.0, ans=0.125 2023-06-22 20:06:54,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.434e+02 3.523e+02 4.183e+02 5.885e+02 1.412e+03, threshold=8.365e+02, percent-clipped=3.0 2023-06-22 20:08:02,207 INFO [train.py:996] (0/4) Epoch 8, batch 4750, loss[loss=0.2333, simple_loss=0.3032, pruned_loss=0.08175, over 21356.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3072, pruned_loss=0.08237, over 4261003.80 frames. ], batch size: 131, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:08:11,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1309272.0, ans=0.0 2023-06-22 20:08:17,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1309332.0, ans=0.125 2023-06-22 20:09:23,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1309452.0, ans=0.125 2023-06-22 20:09:39,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1309512.0, ans=0.125 2023-06-22 20:09:42,410 INFO [train.py:996] (0/4) Epoch 8, batch 4800, loss[loss=0.2726, simple_loss=0.3696, pruned_loss=0.0878, over 21686.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3092, pruned_loss=0.08357, over 4273060.42 frames. ], batch size: 414, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:09:49,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309572.0, ans=0.1 2023-06-22 20:10:14,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.048e+02 4.179e+02 5.198e+02 6.996e+02 1.429e+03, threshold=1.040e+03, percent-clipped=10.0 2023-06-22 20:10:18,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1309692.0, ans=0.2 2023-06-22 20:10:25,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309692.0, ans=0.1 2023-06-22 20:10:32,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1309692.0, ans=0.025 2023-06-22 20:10:34,755 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=15.0 2023-06-22 20:11:13,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1309812.0, ans=0.0 2023-06-22 20:11:21,066 INFO [train.py:996] (0/4) Epoch 8, batch 4850, loss[loss=0.2422, simple_loss=0.3143, pruned_loss=0.08504, over 21858.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3093, pruned_loss=0.08332, over 4275424.83 frames. ], batch size: 371, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:11:39,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1309872.0, ans=0.125 2023-06-22 20:11:52,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1309932.0, ans=0.125 2023-06-22 20:12:33,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1310052.0, ans=0.125 2023-06-22 20:12:54,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1310112.0, ans=0.0 2023-06-22 20:12:59,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1310172.0, ans=0.125 2023-06-22 20:13:00,640 INFO [train.py:996] (0/4) Epoch 8, batch 4900, loss[loss=0.2414, simple_loss=0.3259, pruned_loss=0.07848, over 21400.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3106, pruned_loss=0.08398, over 4281654.53 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:13:32,434 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.010e+02 3.969e+02 5.003e+02 6.919e+02 1.603e+03, threshold=1.001e+03, percent-clipped=6.0 2023-06-22 20:13:46,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1310292.0, ans=0.035 2023-06-22 20:14:10,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1310352.0, ans=0.2 2023-06-22 20:14:14,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1310352.0, ans=0.0 2023-06-22 20:14:40,875 INFO [train.py:996] (0/4) Epoch 8, batch 4950, loss[loss=0.202, simple_loss=0.3033, pruned_loss=0.05034, over 21565.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3158, pruned_loss=0.08221, over 4280506.54 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:14:41,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1310472.0, ans=0.125 2023-06-22 20:15:04,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1310532.0, ans=0.125 2023-06-22 20:15:04,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1310532.0, ans=0.125 2023-06-22 20:15:23,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1310592.0, ans=0.125 2023-06-22 20:16:25,023 INFO [train.py:996] (0/4) Epoch 8, batch 5000, loss[loss=0.2435, simple_loss=0.3598, pruned_loss=0.06363, over 20688.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3151, pruned_loss=0.07873, over 4278873.55 frames. ], batch size: 607, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:16:30,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-22 20:16:51,894 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.498e+02 3.608e+02 4.622e+02 7.271e+02 1.664e+03, threshold=9.243e+02, percent-clipped=6.0 2023-06-22 20:17:30,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1310952.0, ans=0.2 2023-06-22 20:17:33,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1310952.0, ans=0.2 2023-06-22 20:17:53,511 INFO [train.py:996] (0/4) Epoch 8, batch 5050, loss[loss=0.2783, simple_loss=0.3369, pruned_loss=0.1099, over 21872.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3135, pruned_loss=0.08031, over 4281232.04 frames. ], batch size: 118, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:18:11,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1311072.0, ans=0.125 2023-06-22 20:18:13,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1311132.0, ans=6.0 2023-06-22 20:19:09,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1311252.0, ans=0.5 2023-06-22 20:19:28,965 INFO [train.py:996] (0/4) Epoch 8, batch 5100, loss[loss=0.2099, simple_loss=0.2796, pruned_loss=0.07006, over 21874.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3116, pruned_loss=0.08099, over 4289654.85 frames. ], batch size: 107, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:20:02,315 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.036e+02 3.868e+02 4.783e+02 6.520e+02 1.021e+03, threshold=9.567e+02, percent-clipped=2.0 2023-06-22 20:20:25,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1311552.0, ans=0.0 2023-06-22 20:20:39,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1311552.0, ans=0.125 2023-06-22 20:20:49,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1311552.0, ans=0.125 2023-06-22 20:21:04,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1311612.0, ans=0.2 2023-06-22 20:21:08,417 INFO [train.py:996] (0/4) Epoch 8, batch 5150, loss[loss=0.1969, simple_loss=0.2862, pruned_loss=0.05378, over 19848.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3102, pruned_loss=0.08087, over 4288166.98 frames. ], batch size: 703, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:21:09,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1311672.0, ans=0.125 2023-06-22 20:21:44,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1311792.0, ans=0.1 2023-06-22 20:22:11,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=12.0 2023-06-22 20:22:34,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1311912.0, ans=0.07 2023-06-22 20:22:38,220 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-22 20:22:44,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-22 20:22:52,483 INFO [train.py:996] (0/4) Epoch 8, batch 5200, loss[loss=0.2285, simple_loss=0.3104, pruned_loss=0.07325, over 21253.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.313, pruned_loss=0.08218, over 4288521.39 frames. ], batch size: 176, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:23:05,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1311972.0, ans=0.125 2023-06-22 20:23:09,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1312032.0, ans=0.125 2023-06-22 20:23:26,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.134e+02 4.423e+02 5.674e+02 8.806e+02 1.736e+03, threshold=1.135e+03, percent-clipped=18.0 2023-06-22 20:23:56,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.70 vs. limit=15.0 2023-06-22 20:24:16,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1312212.0, ans=0.0 2023-06-22 20:24:29,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1312212.0, ans=0.0 2023-06-22 20:24:32,954 INFO [train.py:996] (0/4) Epoch 8, batch 5250, loss[loss=0.2207, simple_loss=0.2965, pruned_loss=0.07243, over 21226.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3181, pruned_loss=0.0817, over 4287707.38 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:24:40,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1312272.0, ans=0.1 2023-06-22 20:25:03,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1312332.0, ans=0.125 2023-06-22 20:25:04,418 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-22 20:25:52,120 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:26:11,380 INFO [train.py:996] (0/4) Epoch 8, batch 5300, loss[loss=0.2535, simple_loss=0.3186, pruned_loss=0.09413, over 21893.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3178, pruned_loss=0.08325, over 4291860.19 frames. ], batch size: 371, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:26:36,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=22.5 2023-06-22 20:26:44,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.970e+02 3.721e+02 4.525e+02 6.404e+02 1.262e+03, threshold=9.050e+02, percent-clipped=2.0 2023-06-22 20:26:46,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1312692.0, ans=0.125 2023-06-22 20:27:49,175 INFO [train.py:996] (0/4) Epoch 8, batch 5350, loss[loss=0.2497, simple_loss=0.3191, pruned_loss=0.09017, over 21356.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3167, pruned_loss=0.08469, over 4289871.97 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:28:45,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1313052.0, ans=0.125 2023-06-22 20:28:57,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1313052.0, ans=0.125 2023-06-22 20:29:02,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1313112.0, ans=0.0 2023-06-22 20:29:10,077 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=22.5 2023-06-22 20:29:22,763 INFO [train.py:996] (0/4) Epoch 8, batch 5400, loss[loss=0.2432, simple_loss=0.302, pruned_loss=0.09217, over 21247.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3147, pruned_loss=0.0854, over 4291480.13 frames. ], batch size: 143, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:29:34,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1313172.0, ans=0.0 2023-06-22 20:29:43,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1313232.0, ans=0.1 2023-06-22 20:30:01,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.001e+02 4.337e+02 6.656e+02 9.891e+02 1.935e+03, threshold=1.331e+03, percent-clipped=29.0 2023-06-22 20:30:04,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=12.0 2023-06-22 20:30:06,261 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=8.0 2023-06-22 20:30:20,756 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.64 vs. limit=6.0 2023-06-22 20:30:21,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1313292.0, ans=0.04949747468305833 2023-06-22 20:30:29,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-22 20:30:34,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1313352.0, ans=0.0 2023-06-22 20:31:03,456 INFO [train.py:996] (0/4) Epoch 8, batch 5450, loss[loss=0.2943, simple_loss=0.3927, pruned_loss=0.09797, over 21711.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3169, pruned_loss=0.08321, over 4293038.25 frames. ], batch size: 441, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:31:31,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-22 20:31:34,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1313532.0, ans=0.2 2023-06-22 20:31:54,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1313592.0, ans=0.125 2023-06-22 20:32:34,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1313712.0, ans=0.125 2023-06-22 20:32:49,373 INFO [train.py:996] (0/4) Epoch 8, batch 5500, loss[loss=0.2213, simple_loss=0.3249, pruned_loss=0.05884, over 21583.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3201, pruned_loss=0.07987, over 4281797.47 frames. ], batch size: 441, lr: 3.81e-03, grad_scale: 8.0 2023-06-22 20:32:59,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1313772.0, ans=0.1 2023-06-22 20:33:22,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1313832.0, ans=0.04949747468305833 2023-06-22 20:33:24,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1313832.0, ans=0.2 2023-06-22 20:33:30,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1313892.0, ans=0.0 2023-06-22 20:33:31,692 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.855e+02 4.320e+02 6.103e+02 1.036e+03 2.497e+03, threshold=1.221e+03, percent-clipped=15.0 2023-06-22 20:34:03,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1313952.0, ans=0.125 2023-06-22 20:34:29,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1314012.0, ans=0.0 2023-06-22 20:34:40,447 INFO [train.py:996] (0/4) Epoch 8, batch 5550, loss[loss=0.1826, simple_loss=0.2808, pruned_loss=0.04224, over 21684.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3186, pruned_loss=0.07633, over 4276356.26 frames. ], batch size: 298, lr: 3.81e-03, grad_scale: 8.0 2023-06-22 20:34:42,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1314072.0, ans=0.125 2023-06-22 20:35:56,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1314252.0, ans=0.0 2023-06-22 20:36:20,979 INFO [train.py:996] (0/4) Epoch 8, batch 5600, loss[loss=0.1457, simple_loss=0.2114, pruned_loss=0.04004, over 21904.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3166, pruned_loss=0.07383, over 4275922.68 frames. ], batch size: 98, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:36:25,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1314372.0, ans=0.125 2023-06-22 20:36:29,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1314372.0, ans=0.1 2023-06-22 20:36:40,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1314432.0, ans=0.04949747468305833 2023-06-22 20:36:58,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.586e+02 4.144e+02 6.431e+02 9.394e+02 1.823e+03, threshold=1.286e+03, percent-clipped=11.0 2023-06-22 20:37:09,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1314492.0, ans=0.125 2023-06-22 20:37:15,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1314552.0, ans=0.125 2023-06-22 20:37:55,153 INFO [train.py:996] (0/4) Epoch 8, batch 5650, loss[loss=0.2639, simple_loss=0.332, pruned_loss=0.09792, over 21817.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3209, pruned_loss=0.0766, over 4282228.41 frames. ], batch size: 107, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:38:03,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1314672.0, ans=0.125 2023-06-22 20:38:24,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1314732.0, ans=0.0 2023-06-22 20:39:07,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1314852.0, ans=0.1 2023-06-22 20:39:34,670 INFO [train.py:996] (0/4) Epoch 8, batch 5700, loss[loss=0.2199, simple_loss=0.2877, pruned_loss=0.07606, over 21249.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3211, pruned_loss=0.07888, over 4275061.89 frames. ], batch size: 607, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:40:06,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1315032.0, ans=0.125 2023-06-22 20:40:12,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.709e+02 4.552e+02 6.267e+02 8.726e+02 1.736e+03, threshold=1.253e+03, percent-clipped=4.0 2023-06-22 20:41:19,549 INFO [train.py:996] (0/4) Epoch 8, batch 5750, loss[loss=0.1796, simple_loss=0.2714, pruned_loss=0.04392, over 21430.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3172, pruned_loss=0.07591, over 4276185.53 frames. ], batch size: 194, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:41:34,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.35 vs. limit=10.0 2023-06-22 20:41:35,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1315332.0, ans=0.04949747468305833 2023-06-22 20:42:02,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1315332.0, ans=0.1 2023-06-22 20:42:10,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1315392.0, ans=0.125 2023-06-22 20:42:12,022 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:42:54,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-22 20:42:59,384 INFO [train.py:996] (0/4) Epoch 8, batch 5800, loss[loss=0.2768, simple_loss=0.368, pruned_loss=0.09282, over 21658.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3138, pruned_loss=0.07407, over 4270549.12 frames. ], batch size: 414, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:43:42,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.530e+02 3.870e+02 5.356e+02 7.893e+02 2.349e+03, threshold=1.071e+03, percent-clipped=9.0 2023-06-22 20:44:02,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1315752.0, ans=0.0 2023-06-22 20:44:19,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1315812.0, ans=0.04949747468305833 2023-06-22 20:44:20,174 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-22 20:44:26,516 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-22 20:44:35,118 INFO [train.py:996] (0/4) Epoch 8, batch 5850, loss[loss=0.1911, simple_loss=0.2934, pruned_loss=0.04439, over 21623.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3107, pruned_loss=0.06973, over 4276764.67 frames. ], batch size: 263, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:44:51,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1315872.0, ans=0.2 2023-06-22 20:45:09,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1315932.0, ans=0.0 2023-06-22 20:45:47,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1316052.0, ans=0.0 2023-06-22 20:45:55,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1316112.0, ans=0.2 2023-06-22 20:46:07,390 INFO [train.py:996] (0/4) Epoch 8, batch 5900, loss[loss=0.2157, simple_loss=0.2896, pruned_loss=0.07091, over 21681.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.3045, pruned_loss=0.06488, over 4276708.25 frames. ], batch size: 230, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:46:12,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=15.0 2023-06-22 20:46:14,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1316172.0, ans=0.125 2023-06-22 20:46:14,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.25 vs. limit=10.0 2023-06-22 20:46:23,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1316172.0, ans=0.2 2023-06-22 20:46:48,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.361e+02 4.544e+02 6.338e+02 1.644e+03, threshold=9.088e+02, percent-clipped=4.0 2023-06-22 20:46:58,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1316292.0, ans=0.2 2023-06-22 20:47:16,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1316352.0, ans=10.0 2023-06-22 20:47:28,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1316412.0, ans=15.0 2023-06-22 20:47:28,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-22 20:47:29,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1316412.0, ans=0.0 2023-06-22 20:47:42,266 INFO [train.py:996] (0/4) Epoch 8, batch 5950, loss[loss=0.2613, simple_loss=0.3152, pruned_loss=0.1037, over 21858.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3033, pruned_loss=0.06887, over 4285703.60 frames. ], batch size: 107, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:47:51,711 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:48:01,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1316472.0, ans=0.125 2023-06-22 20:48:09,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1316532.0, ans=0.125 2023-06-22 20:48:14,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1316532.0, ans=0.125 2023-06-22 20:48:31,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1316592.0, ans=0.09899494936611666 2023-06-22 20:48:47,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1316652.0, ans=0.0 2023-06-22 20:49:19,245 INFO [train.py:996] (0/4) Epoch 8, batch 6000, loss[loss=0.2237, simple_loss=0.2828, pruned_loss=0.0823, over 21812.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2992, pruned_loss=0.07219, over 4285005.83 frames. ], batch size: 112, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:49:19,246 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 20:49:40,897 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2636, simple_loss=0.3606, pruned_loss=0.08334, over 1796401.00 frames. 2023-06-22 20:49:40,898 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 20:50:16,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.164e+02 5.189e+02 7.580e+02 1.356e+03, threshold=1.038e+03, percent-clipped=17.0 2023-06-22 20:51:14,228 INFO [train.py:996] (0/4) Epoch 8, batch 6050, loss[loss=0.2835, simple_loss=0.3221, pruned_loss=0.1225, over 21296.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2949, pruned_loss=0.07399, over 4284609.41 frames. ], batch size: 473, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:51:29,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.15 vs. limit=22.5 2023-06-22 20:51:32,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1317132.0, ans=0.1 2023-06-22 20:51:33,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1317132.0, ans=0.125 2023-06-22 20:51:36,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1317132.0, ans=0.125 2023-06-22 20:51:41,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1317132.0, ans=0.125 2023-06-22 20:52:16,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-22 20:52:22,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1317252.0, ans=0.0 2023-06-22 20:52:25,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1317312.0, ans=0.1 2023-06-22 20:52:25,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1317312.0, ans=0.125 2023-06-22 20:52:51,306 INFO [train.py:996] (0/4) Epoch 8, batch 6100, loss[loss=0.2344, simple_loss=0.3489, pruned_loss=0.05996, over 19794.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2946, pruned_loss=0.0727, over 4284583.90 frames. ], batch size: 702, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:53:24,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1317432.0, ans=0.1 2023-06-22 20:53:29,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.740e+02 3.450e+02 4.266e+02 5.544e+02 1.374e+03, threshold=8.532e+02, percent-clipped=4.0 2023-06-22 20:53:32,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1317492.0, ans=0.125 2023-06-22 20:53:41,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1317552.0, ans=0.1 2023-06-22 20:53:49,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1317552.0, ans=0.1 2023-06-22 20:54:22,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1317612.0, ans=0.125 2023-06-22 20:54:28,770 INFO [train.py:996] (0/4) Epoch 8, batch 6150, loss[loss=0.2296, simple_loss=0.2977, pruned_loss=0.08077, over 21111.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2982, pruned_loss=0.07587, over 4287515.36 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:55:05,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1317792.0, ans=0.125 2023-06-22 20:56:06,759 INFO [train.py:996] (0/4) Epoch 8, batch 6200, loss[loss=0.2639, simple_loss=0.3414, pruned_loss=0.09319, over 21849.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3022, pruned_loss=0.07617, over 4279969.61 frames. ], batch size: 351, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:56:20,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1317972.0, ans=0.125 2023-06-22 20:56:44,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.020e+02 4.357e+02 5.420e+02 8.092e+02 2.121e+03, threshold=1.084e+03, percent-clipped=22.0 2023-06-22 20:57:29,565 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=22.5 2023-06-22 20:57:47,422 INFO [train.py:996] (0/4) Epoch 8, batch 6250, loss[loss=0.2364, simple_loss=0.3385, pruned_loss=0.0672, over 21765.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3079, pruned_loss=0.07591, over 4280213.68 frames. ], batch size: 332, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 20:57:48,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1318272.0, ans=0.125 2023-06-22 20:57:49,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1318272.0, ans=0.1 2023-06-22 20:58:01,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-22 20:58:16,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1318332.0, ans=0.125 2023-06-22 20:58:52,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1318452.0, ans=0.125 2023-06-22 20:59:00,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1318452.0, ans=0.1 2023-06-22 20:59:22,310 INFO [train.py:996] (0/4) Epoch 8, batch 6300, loss[loss=0.2651, simple_loss=0.3284, pruned_loss=0.1009, over 21894.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3111, pruned_loss=0.07469, over 4277850.39 frames. ], batch size: 118, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 20:59:44,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1318632.0, ans=0.1 2023-06-22 20:59:59,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 4.288e+02 6.355e+02 8.462e+02 1.476e+03, threshold=1.271e+03, percent-clipped=15.0 2023-06-22 21:00:02,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1318692.0, ans=0.125 2023-06-22 21:00:47,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1318812.0, ans=0.2 2023-06-22 21:01:04,437 INFO [train.py:996] (0/4) Epoch 8, batch 6350, loss[loss=0.3236, simple_loss=0.379, pruned_loss=0.1341, over 21603.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3131, pruned_loss=0.07905, over 4286493.63 frames. ], batch size: 414, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:01:14,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1318872.0, ans=0.125 2023-06-22 21:01:39,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1318992.0, ans=0.125 2023-06-22 21:02:23,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1319112.0, ans=0.0 2023-06-22 21:02:37,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1319112.0, ans=0.125 2023-06-22 21:02:43,844 INFO [train.py:996] (0/4) Epoch 8, batch 6400, loss[loss=0.2941, simple_loss=0.3578, pruned_loss=0.1152, over 21438.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3186, pruned_loss=0.083, over 4290298.31 frames. ], batch size: 471, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:03:14,891 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:03:19,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1319232.0, ans=0.125 2023-06-22 21:03:31,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.536e+02 4.404e+02 5.449e+02 7.418e+02 1.410e+03, threshold=1.090e+03, percent-clipped=1.0 2023-06-22 21:04:22,000 INFO [train.py:996] (0/4) Epoch 8, batch 6450, loss[loss=0.2553, simple_loss=0.3418, pruned_loss=0.08444, over 21882.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3213, pruned_loss=0.08217, over 4280379.87 frames. ], batch size: 317, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:05:05,994 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-22 21:05:18,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1319592.0, ans=0.125 2023-06-22 21:05:25,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1319652.0, ans=0.2 2023-06-22 21:05:51,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1319712.0, ans=0.0 2023-06-22 21:05:59,765 INFO [train.py:996] (0/4) Epoch 8, batch 6500, loss[loss=0.2428, simple_loss=0.2932, pruned_loss=0.09615, over 21256.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3155, pruned_loss=0.08166, over 4274413.30 frames. ], batch size: 471, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:06:24,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1319832.0, ans=0.2 2023-06-22 21:06:48,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.971e+02 4.929e+02 6.597e+02 9.364e+02 1.745e+03, threshold=1.319e+03, percent-clipped=16.0 2023-06-22 21:06:58,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1319892.0, ans=0.0 2023-06-22 21:07:18,201 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-220000.pt 2023-06-22 21:07:44,857 INFO [train.py:996] (0/4) Epoch 8, batch 6550, loss[loss=0.1828, simple_loss=0.3127, pruned_loss=0.02647, over 20840.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3142, pruned_loss=0.08042, over 4278662.62 frames. ], batch size: 607, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:07:48,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-22 21:08:04,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-22 21:08:59,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1320312.0, ans=0.125 2023-06-22 21:09:16,814 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:09:16,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1320372.0, ans=0.2 2023-06-22 21:09:17,843 INFO [train.py:996] (0/4) Epoch 8, batch 6600, loss[loss=0.1899, simple_loss=0.2533, pruned_loss=0.0632, over 21390.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3088, pruned_loss=0.07997, over 4272079.04 frames. ], batch size: 211, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:09:26,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1320372.0, ans=0.125 2023-06-22 21:09:56,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1320492.0, ans=10.0 2023-06-22 21:09:56,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.89 vs. limit=22.5 2023-06-22 21:09:57,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.721e+02 4.019e+02 5.750e+02 8.785e+02 1.668e+03, threshold=1.150e+03, percent-clipped=5.0 2023-06-22 21:10:55,805 INFO [train.py:996] (0/4) Epoch 8, batch 6650, loss[loss=0.1775, simple_loss=0.2575, pruned_loss=0.04878, over 21555.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3017, pruned_loss=0.07715, over 4279021.55 frames. ], batch size: 230, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:10:57,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1320672.0, ans=0.2 2023-06-22 21:11:02,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-22 21:11:06,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-22 21:11:54,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1320852.0, ans=0.125 2023-06-22 21:11:56,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-06-22 21:12:02,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=15.0 2023-06-22 21:12:20,976 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-22 21:12:29,167 INFO [train.py:996] (0/4) Epoch 8, batch 6700, loss[loss=0.2243, simple_loss=0.2837, pruned_loss=0.08247, over 21520.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2954, pruned_loss=0.07634, over 4279200.62 frames. ], batch size: 230, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:12:54,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1321032.0, ans=0.125 2023-06-22 21:13:08,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.668e+02 3.822e+02 4.568e+02 6.367e+02 1.164e+03, threshold=9.137e+02, percent-clipped=1.0 2023-06-22 21:13:26,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1321152.0, ans=0.0 2023-06-22 21:14:02,872 INFO [train.py:996] (0/4) Epoch 8, batch 6750, loss[loss=0.2548, simple_loss=0.3617, pruned_loss=0.07398, over 19825.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2944, pruned_loss=0.07764, over 4281021.51 frames. ], batch size: 703, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:14:05,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1321272.0, ans=0.0 2023-06-22 21:14:28,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1321332.0, ans=0.0 2023-06-22 21:14:33,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1321332.0, ans=0.0 2023-06-22 21:14:54,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1321392.0, ans=0.09899494936611666 2023-06-22 21:14:57,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1321392.0, ans=0.125 2023-06-22 21:15:37,274 INFO [train.py:996] (0/4) Epoch 8, batch 6800, loss[loss=0.2759, simple_loss=0.3207, pruned_loss=0.1155, over 21822.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2974, pruned_loss=0.07984, over 4271448.77 frames. ], batch size: 118, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:15:39,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1321572.0, ans=0.1 2023-06-22 21:15:48,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1321572.0, ans=0.0 2023-06-22 21:15:59,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1321632.0, ans=0.0 2023-06-22 21:16:04,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1321632.0, ans=0.125 2023-06-22 21:16:16,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.952e+02 4.656e+02 6.234e+02 8.844e+02 1.935e+03, threshold=1.247e+03, percent-clipped=22.0 2023-06-22 21:16:29,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1321692.0, ans=0.0 2023-06-22 21:16:31,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1321692.0, ans=0.2 2023-06-22 21:16:39,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-22 21:17:04,209 INFO [train.py:996] (0/4) Epoch 8, batch 6850, loss[loss=0.225, simple_loss=0.2896, pruned_loss=0.08017, over 21546.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.2981, pruned_loss=0.08045, over 4263465.58 frames. ], batch size: 389, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:17:39,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1321932.0, ans=0.2 2023-06-22 21:17:44,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1321992.0, ans=0.2 2023-06-22 21:18:05,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1322052.0, ans=0.0 2023-06-22 21:18:12,048 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:18:14,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-22 21:18:23,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1322112.0, ans=0.125 2023-06-22 21:18:33,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1322112.0, ans=0.1 2023-06-22 21:18:48,721 INFO [train.py:996] (0/4) Epoch 8, batch 6900, loss[loss=0.2159, simple_loss=0.2967, pruned_loss=0.06757, over 21898.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2997, pruned_loss=0.0812, over 4277321.56 frames. ], batch size: 316, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:18:55,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1322172.0, ans=0.05 2023-06-22 21:19:08,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1322232.0, ans=0.2 2023-06-22 21:19:26,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1322292.0, ans=0.125 2023-06-22 21:19:34,854 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 4.516e+02 6.241e+02 9.290e+02 1.863e+03, threshold=1.248e+03, percent-clipped=14.0 2023-06-22 21:19:48,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1322352.0, ans=0.0 2023-06-22 21:20:33,619 INFO [train.py:996] (0/4) Epoch 8, batch 6950, loss[loss=0.3187, simple_loss=0.3721, pruned_loss=0.1326, over 21337.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3034, pruned_loss=0.07887, over 4278265.58 frames. ], batch size: 507, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:21:12,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1322592.0, ans=0.125 2023-06-22 21:22:06,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1322712.0, ans=0.125 2023-06-22 21:22:07,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-22 21:22:12,759 INFO [train.py:996] (0/4) Epoch 8, batch 7000, loss[loss=0.2684, simple_loss=0.3121, pruned_loss=0.1123, over 21125.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3071, pruned_loss=0.08115, over 4271105.95 frames. ], batch size: 143, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:22:26,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1322772.0, ans=0.0 2023-06-22 21:22:54,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.125e+02 4.702e+02 7.118e+02 1.401e+03, threshold=9.403e+02, percent-clipped=1.0 2023-06-22 21:23:08,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1322952.0, ans=0.2 2023-06-22 21:23:51,498 INFO [train.py:996] (0/4) Epoch 8, batch 7050, loss[loss=0.1701, simple_loss=0.2423, pruned_loss=0.04896, over 21492.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3032, pruned_loss=0.08064, over 4268691.24 frames. ], batch size: 131, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:24:04,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.77 vs. limit=5.0 2023-06-22 21:24:22,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1323132.0, ans=0.035 2023-06-22 21:24:53,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1323252.0, ans=0.07 2023-06-22 21:25:31,406 INFO [train.py:996] (0/4) Epoch 8, batch 7100, loss[loss=0.2084, simple_loss=0.286, pruned_loss=0.06535, over 21681.00 frames. ], tot_loss[loss=0.233, simple_loss=0.306, pruned_loss=0.08006, over 4258915.44 frames. ], batch size: 298, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:26:12,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 4.059e+02 5.159e+02 6.413e+02 1.166e+03, threshold=1.032e+03, percent-clipped=5.0 2023-06-22 21:26:26,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1323492.0, ans=0.025 2023-06-22 21:26:43,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1323552.0, ans=0.2 2023-06-22 21:27:15,337 INFO [train.py:996] (0/4) Epoch 8, batch 7150, loss[loss=0.2299, simple_loss=0.2878, pruned_loss=0.08595, over 20700.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3047, pruned_loss=0.07803, over 4260967.42 frames. ], batch size: 607, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:27:19,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1323672.0, ans=0.0 2023-06-22 21:28:54,628 INFO [train.py:996] (0/4) Epoch 8, batch 7200, loss[loss=0.2403, simple_loss=0.3553, pruned_loss=0.06265, over 19770.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3071, pruned_loss=0.08021, over 4260090.95 frames. ], batch size: 703, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:29:28,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1324032.0, ans=0.0 2023-06-22 21:29:31,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1324092.0, ans=0.1 2023-06-22 21:29:35,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.797e+02 4.643e+02 6.367e+02 8.701e+02 1.653e+03, threshold=1.273e+03, percent-clipped=12.0 2023-06-22 21:29:42,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1324092.0, ans=0.125 2023-06-22 21:30:27,776 INFO [train.py:996] (0/4) Epoch 8, batch 7250, loss[loss=0.1824, simple_loss=0.2466, pruned_loss=0.05907, over 21432.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3037, pruned_loss=0.0807, over 4267848.53 frames. ], batch size: 212, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:30:51,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1324332.0, ans=0.0 2023-06-22 21:31:19,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1324392.0, ans=0.125 2023-06-22 21:31:21,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1324392.0, ans=0.125 2023-06-22 21:31:31,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1324452.0, ans=0.0 2023-06-22 21:32:00,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1324512.0, ans=0.2 2023-06-22 21:32:03,432 INFO [train.py:996] (0/4) Epoch 8, batch 7300, loss[loss=0.2078, simple_loss=0.2709, pruned_loss=0.07231, over 21976.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2966, pruned_loss=0.07895, over 4272399.23 frames. ], batch size: 103, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:32:20,489 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-22 21:32:49,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.041e+02 5.063e+02 7.441e+02 1.428e+03, threshold=1.013e+03, percent-clipped=2.0 2023-06-22 21:33:43,008 INFO [train.py:996] (0/4) Epoch 8, batch 7350, loss[loss=0.2513, simple_loss=0.3227, pruned_loss=0.08992, over 21485.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2953, pruned_loss=0.07949, over 4267351.24 frames. ], batch size: 131, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:33:49,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-22 21:33:52,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1324872.0, ans=15.0 2023-06-22 21:34:24,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1324992.0, ans=0.0 2023-06-22 21:34:52,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1325052.0, ans=0.0 2023-06-22 21:35:10,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1325112.0, ans=0.0 2023-06-22 21:35:24,115 INFO [train.py:996] (0/4) Epoch 8, batch 7400, loss[loss=0.2422, simple_loss=0.3357, pruned_loss=0.07431, over 21704.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3006, pruned_loss=0.08181, over 4268643.29 frames. ], batch size: 415, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:36:10,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.225e+02 4.086e+02 4.943e+02 6.567e+02 1.302e+03, threshold=9.886e+02, percent-clipped=5.0 2023-06-22 21:36:51,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1325412.0, ans=0.0 2023-06-22 21:36:58,176 INFO [train.py:996] (0/4) Epoch 8, batch 7450, loss[loss=0.1846, simple_loss=0.2563, pruned_loss=0.05641, over 21397.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.298, pruned_loss=0.08047, over 4255811.96 frames. ], batch size: 131, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:38:09,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-22 21:38:10,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1325652.0, ans=0.125 2023-06-22 21:38:31,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1325712.0, ans=0.1 2023-06-22 21:38:38,918 INFO [train.py:996] (0/4) Epoch 8, batch 7500, loss[loss=0.2474, simple_loss=0.3524, pruned_loss=0.07123, over 21625.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3033, pruned_loss=0.08145, over 4263363.50 frames. ], batch size: 263, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:39:35,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.043e+02 4.495e+02 6.774e+02 8.927e+02 1.705e+03, threshold=1.355e+03, percent-clipped=18.0 2023-06-22 21:39:40,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-22 21:39:46,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1325952.0, ans=0.125 2023-06-22 21:40:24,013 INFO [train.py:996] (0/4) Epoch 8, batch 7550, loss[loss=0.2399, simple_loss=0.313, pruned_loss=0.08341, over 21795.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3115, pruned_loss=0.08097, over 4271694.12 frames. ], batch size: 118, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:40:42,709 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=22.5 2023-06-22 21:41:14,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1326192.0, ans=0.0 2023-06-22 21:41:38,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1326252.0, ans=0.125 2023-06-22 21:41:43,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-22 21:41:46,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1326312.0, ans=0.125 2023-06-22 21:41:51,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1326312.0, ans=0.1 2023-06-22 21:41:54,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.34 vs. limit=10.0 2023-06-22 21:42:00,789 INFO [train.py:996] (0/4) Epoch 8, batch 7600, loss[loss=0.2234, simple_loss=0.2998, pruned_loss=0.07347, over 21744.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3106, pruned_loss=0.08016, over 4274801.62 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:42:13,196 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=22.5 2023-06-22 21:42:50,588 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.183e+02 4.607e+02 6.581e+02 9.726e+02 1.530e+03, threshold=1.316e+03, percent-clipped=7.0 2023-06-22 21:43:15,321 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:43:28,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1326612.0, ans=0.125 2023-06-22 21:43:38,632 INFO [train.py:996] (0/4) Epoch 8, batch 7650, loss[loss=0.2535, simple_loss=0.3175, pruned_loss=0.09476, over 21768.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3113, pruned_loss=0.0812, over 4283540.86 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:44:00,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1326672.0, ans=0.0 2023-06-22 21:44:20,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1326732.0, ans=0.0 2023-06-22 21:44:23,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1326792.0, ans=0.125 2023-06-22 21:44:33,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1326792.0, ans=0.125 2023-06-22 21:44:36,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1326792.0, ans=0.125 2023-06-22 21:45:01,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-22 21:45:18,114 INFO [train.py:996] (0/4) Epoch 8, batch 7700, loss[loss=0.2681, simple_loss=0.3301, pruned_loss=0.103, over 21403.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3128, pruned_loss=0.0838, over 4286353.41 frames. ], batch size: 548, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:45:39,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1327032.0, ans=0.0 2023-06-22 21:46:03,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1327092.0, ans=0.125 2023-06-22 21:46:04,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.953e+02 3.847e+02 4.781e+02 6.112e+02 1.345e+03, threshold=9.563e+02, percent-clipped=1.0 2023-06-22 21:46:27,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.67 vs. limit=10.0 2023-06-22 21:46:34,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1327152.0, ans=0.1 2023-06-22 21:46:44,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1327212.0, ans=0.125 2023-06-22 21:46:58,554 INFO [train.py:996] (0/4) Epoch 8, batch 7750, loss[loss=0.2565, simple_loss=0.3482, pruned_loss=0.08239, over 21591.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3167, pruned_loss=0.08442, over 4280214.84 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:47:38,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1327392.0, ans=0.0 2023-06-22 21:47:41,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1327392.0, ans=0.0 2023-06-22 21:48:10,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1327452.0, ans=0.125 2023-06-22 21:48:18,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1327512.0, ans=0.1 2023-06-22 21:48:42,274 INFO [train.py:996] (0/4) Epoch 8, batch 7800, loss[loss=0.2484, simple_loss=0.3179, pruned_loss=0.08952, over 21856.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3185, pruned_loss=0.08486, over 4277544.57 frames. ], batch size: 373, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:48:48,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1327572.0, ans=0.125 2023-06-22 21:49:19,174 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.166e+02 4.676e+02 6.316e+02 9.000e+02 2.015e+03, threshold=1.263e+03, percent-clipped=20.0 2023-06-22 21:49:29,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1327752.0, ans=0.125 2023-06-22 21:49:53,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1327812.0, ans=0.0 2023-06-22 21:49:56,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-22 21:50:02,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1327812.0, ans=0.125 2023-06-22 21:50:09,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-06-22 21:50:14,995 INFO [train.py:996] (0/4) Epoch 8, batch 7850, loss[loss=0.2567, simple_loss=0.327, pruned_loss=0.09318, over 20673.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3131, pruned_loss=0.08466, over 4263464.73 frames. ], batch size: 607, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:50:32,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1327872.0, ans=0.2 2023-06-22 21:50:36,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-22 21:51:15,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1328052.0, ans=0.0 2023-06-22 21:51:57,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-06-22 21:52:00,794 INFO [train.py:996] (0/4) Epoch 8, batch 7900, loss[loss=0.1883, simple_loss=0.2659, pruned_loss=0.05537, over 21232.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3102, pruned_loss=0.08497, over 4261399.64 frames. ], batch size: 176, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:52:01,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-22 21:52:16,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1328232.0, ans=0.0 2023-06-22 21:52:44,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 3.830e+02 4.891e+02 7.312e+02 1.897e+03, threshold=9.781e+02, percent-clipped=5.0 2023-06-22 21:53:06,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1328352.0, ans=0.0 2023-06-22 21:53:32,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1328412.0, ans=0.125 2023-06-22 21:53:41,633 INFO [train.py:996] (0/4) Epoch 8, batch 7950, loss[loss=0.2384, simple_loss=0.3259, pruned_loss=0.07549, over 21892.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3138, pruned_loss=0.08391, over 4255776.33 frames. ], batch size: 316, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:53:45,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-22 21:53:51,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1328472.0, ans=0.125 2023-06-22 21:54:17,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=15.0 2023-06-22 21:54:45,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1328652.0, ans=0.05 2023-06-22 21:55:24,233 INFO [train.py:996] (0/4) Epoch 8, batch 8000, loss[loss=0.2609, simple_loss=0.3476, pruned_loss=0.08711, over 21600.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3171, pruned_loss=0.08588, over 4257994.45 frames. ], batch size: 414, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:55:35,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1328772.0, ans=0.125 2023-06-22 21:55:36,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1328772.0, ans=0.0 2023-06-22 21:56:24,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1328892.0, ans=0.0 2023-06-22 21:56:25,235 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.345e+02 4.453e+02 5.766e+02 9.258e+02 3.143e+03, threshold=1.153e+03, percent-clipped=22.0 2023-06-22 21:56:34,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1328952.0, ans=0.07 2023-06-22 21:56:47,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1328952.0, ans=0.0 2023-06-22 21:57:10,678 INFO [train.py:996] (0/4) Epoch 8, batch 8050, loss[loss=0.2325, simple_loss=0.337, pruned_loss=0.06395, over 20787.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3177, pruned_loss=0.0859, over 4248190.59 frames. ], batch size: 609, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:57:47,735 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.37 vs. limit=10.0 2023-06-22 21:58:43,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1329312.0, ans=0.0 2023-06-22 21:58:56,257 INFO [train.py:996] (0/4) Epoch 8, batch 8100, loss[loss=0.2667, simple_loss=0.3379, pruned_loss=0.09775, over 21733.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3187, pruned_loss=0.08623, over 4253620.62 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:59:15,668 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-22 21:59:47,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.065e+02 4.714e+02 7.153e+02 1.179e+03 2.402e+03, threshold=1.431e+03, percent-clipped=27.0 2023-06-22 22:00:43,226 INFO [train.py:996] (0/4) Epoch 8, batch 8150, loss[loss=0.2259, simple_loss=0.2994, pruned_loss=0.07625, over 21494.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3276, pruned_loss=0.08765, over 4261332.87 frames. ], batch size: 195, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:00:59,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1329732.0, ans=0.0 2023-06-22 22:00:59,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329732.0, ans=0.1 2023-06-22 22:01:06,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1329732.0, ans=0.0 2023-06-22 22:01:18,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1329792.0, ans=0.125 2023-06-22 22:01:57,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1329912.0, ans=0.0 2023-06-22 22:02:22,166 INFO [train.py:996] (0/4) Epoch 8, batch 8200, loss[loss=0.2405, simple_loss=0.299, pruned_loss=0.091, over 21560.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3203, pruned_loss=0.08601, over 4251334.33 frames. ], batch size: 391, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:03:02,699 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 4.983e+02 6.789e+02 1.065e+03 2.564e+03, threshold=1.358e+03, percent-clipped=14.0 2023-06-22 22:03:49,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1330212.0, ans=0.125 2023-06-22 22:03:56,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1330212.0, ans=0.1 2023-06-22 22:04:02,077 INFO [train.py:996] (0/4) Epoch 8, batch 8250, loss[loss=0.2742, simple_loss=0.3662, pruned_loss=0.09108, over 21644.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3197, pruned_loss=0.08556, over 4249489.31 frames. ], batch size: 414, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:04:25,631 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-06-22 22:04:35,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1330392.0, ans=0.125 2023-06-22 22:05:00,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1330452.0, ans=0.2 2023-06-22 22:05:07,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-22 22:05:29,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1330512.0, ans=0.0 2023-06-22 22:05:41,305 INFO [train.py:996] (0/4) Epoch 8, batch 8300, loss[loss=0.2609, simple_loss=0.3432, pruned_loss=0.08933, over 21622.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3169, pruned_loss=0.08271, over 4252854.23 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:05:48,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1330572.0, ans=0.125 2023-06-22 22:06:00,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1330632.0, ans=0.2 2023-06-22 22:06:27,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.942e+02 4.371e+02 5.179e+02 7.811e+02 1.980e+03, threshold=1.036e+03, percent-clipped=4.0 2023-06-22 22:06:38,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1330752.0, ans=10.0 2023-06-22 22:06:39,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-22 22:07:14,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1330812.0, ans=0.125 2023-06-22 22:07:17,087 INFO [train.py:996] (0/4) Epoch 8, batch 8350, loss[loss=0.234, simple_loss=0.3169, pruned_loss=0.07552, over 21753.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3172, pruned_loss=0.08106, over 4260351.98 frames. ], batch size: 282, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:07:19,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.87 vs. limit=22.5 2023-06-22 22:07:34,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1330932.0, ans=0.125 2023-06-22 22:07:49,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-22 22:07:57,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1330992.0, ans=0.125 2023-06-22 22:07:58,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1330992.0, ans=0.0 2023-06-22 22:08:01,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1330992.0, ans=0.0 2023-06-22 22:08:56,615 INFO [train.py:996] (0/4) Epoch 8, batch 8400, loss[loss=0.187, simple_loss=0.2781, pruned_loss=0.04798, over 21484.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.314, pruned_loss=0.0785, over 4262222.33 frames. ], batch size: 212, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:09:09,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1331172.0, ans=0.125 2023-06-22 22:09:32,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1331292.0, ans=0.0 2023-06-22 22:09:35,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.911e+02 4.503e+02 6.126e+02 1.860e+03, threshold=9.006e+02, percent-clipped=8.0 2023-06-22 22:09:46,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1331352.0, ans=0.5 2023-06-22 22:10:20,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1331412.0, ans=0.125 2023-06-22 22:10:23,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1331412.0, ans=0.0 2023-06-22 22:10:34,280 INFO [train.py:996] (0/4) Epoch 8, batch 8450, loss[loss=0.1991, simple_loss=0.2379, pruned_loss=0.08012, over 20057.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3108, pruned_loss=0.07765, over 4273797.64 frames. ], batch size: 704, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:11:25,456 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-22 22:11:34,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1331652.0, ans=0.125 2023-06-22 22:12:12,549 INFO [train.py:996] (0/4) Epoch 8, batch 8500, loss[loss=0.2398, simple_loss=0.2947, pruned_loss=0.09246, over 21365.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3073, pruned_loss=0.07872, over 4275767.29 frames. ], batch size: 473, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:12:21,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1331772.0, ans=0.1 2023-06-22 22:12:24,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1331772.0, ans=0.125 2023-06-22 22:12:29,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1331832.0, ans=0.125 2023-06-22 22:12:57,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1331892.0, ans=0.1 2023-06-22 22:12:58,445 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.125e+02 4.200e+02 5.679e+02 8.112e+02 1.673e+03, threshold=1.136e+03, percent-clipped=13.0 2023-06-22 22:13:10,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1331952.0, ans=0.125 2023-06-22 22:13:30,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1331952.0, ans=0.1 2023-06-22 22:13:39,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1332012.0, ans=0.125 2023-06-22 22:13:53,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1332072.0, ans=0.0 2023-06-22 22:13:54,930 INFO [train.py:996] (0/4) Epoch 8, batch 8550, loss[loss=0.2449, simple_loss=0.327, pruned_loss=0.08136, over 21410.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3133, pruned_loss=0.08184, over 4276061.44 frames. ], batch size: 211, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:14:09,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1332132.0, ans=0.125 2023-06-22 22:14:35,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=12.0 2023-06-22 22:14:59,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1332252.0, ans=0.0 2023-06-22 22:15:31,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2023-06-22 22:15:35,657 INFO [train.py:996] (0/4) Epoch 8, batch 8600, loss[loss=0.296, simple_loss=0.3639, pruned_loss=0.114, over 21533.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3193, pruned_loss=0.08322, over 4274012.71 frames. ], batch size: 131, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:16:22,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1332492.0, ans=0.1 2023-06-22 22:16:30,667 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.60 vs. limit=8.0 2023-06-22 22:16:31,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1332492.0, ans=0.125 2023-06-22 22:16:32,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.201e+02 4.143e+02 4.841e+02 5.659e+02 1.807e+03, threshold=9.683e+02, percent-clipped=7.0 2023-06-22 22:17:15,236 INFO [train.py:996] (0/4) Epoch 8, batch 8650, loss[loss=0.2287, simple_loss=0.3283, pruned_loss=0.0646, over 21647.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3264, pruned_loss=0.08438, over 4272513.20 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:17:24,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1332672.0, ans=0.0 2023-06-22 22:18:33,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1332852.0, ans=0.0 2023-06-22 22:18:38,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1332912.0, ans=0.0 2023-06-22 22:18:38,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1332912.0, ans=0.05 2023-06-22 22:18:43,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1332912.0, ans=0.125 2023-06-22 22:18:53,781 INFO [train.py:996] (0/4) Epoch 8, batch 8700, loss[loss=0.2566, simple_loss=0.3096, pruned_loss=0.1018, over 21458.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3173, pruned_loss=0.08054, over 4277685.70 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:19:08,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-22 22:19:14,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1333032.0, ans=0.125 2023-06-22 22:19:48,718 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.793e+02 4.063e+02 5.741e+02 9.934e+02 1.995e+03, threshold=1.148e+03, percent-clipped=26.0 2023-06-22 22:19:49,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1333092.0, ans=0.0 2023-06-22 22:20:18,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1333212.0, ans=0.2 2023-06-22 22:20:32,244 INFO [train.py:996] (0/4) Epoch 8, batch 8750, loss[loss=0.2357, simple_loss=0.2986, pruned_loss=0.08634, over 21865.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3133, pruned_loss=0.08135, over 4286830.38 frames. ], batch size: 351, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:20:44,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1333272.0, ans=0.0 2023-06-22 22:20:52,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-06-22 22:21:11,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1333332.0, ans=0.0 2023-06-22 22:22:16,516 INFO [train.py:996] (0/4) Epoch 8, batch 8800, loss[loss=0.2503, simple_loss=0.3261, pruned_loss=0.0872, over 21626.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3206, pruned_loss=0.08472, over 4288396.38 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:22:42,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1333632.0, ans=0.125 2023-06-22 22:22:56,211 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.53 vs. limit=10.0 2023-06-22 22:23:07,921 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.331e+02 4.564e+02 6.230e+02 9.935e+02 2.348e+03, threshold=1.246e+03, percent-clipped=15.0 2023-06-22 22:23:14,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1333752.0, ans=0.0 2023-06-22 22:23:19,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1333752.0, ans=0.0 2023-06-22 22:23:50,302 INFO [train.py:996] (0/4) Epoch 8, batch 8850, loss[loss=0.2274, simple_loss=0.2876, pruned_loss=0.08358, over 20119.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3279, pruned_loss=0.08677, over 4275260.18 frames. ], batch size: 702, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:23:50,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1333872.0, ans=0.1 2023-06-22 22:24:26,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1333932.0, ans=0.2 2023-06-22 22:24:56,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1334052.0, ans=22.5 2023-06-22 22:25:26,325 INFO [train.py:996] (0/4) Epoch 8, batch 8900, loss[loss=0.2367, simple_loss=0.2953, pruned_loss=0.08909, over 21778.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3208, pruned_loss=0.08514, over 4275771.66 frames. ], batch size: 317, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:26:20,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.244e+02 4.551e+02 5.330e+02 7.944e+02 2.391e+03, threshold=1.066e+03, percent-clipped=3.0 2023-06-22 22:27:02,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1334412.0, ans=0.125 2023-06-22 22:27:11,652 INFO [train.py:996] (0/4) Epoch 8, batch 8950, loss[loss=0.2079, simple_loss=0.2746, pruned_loss=0.07066, over 21359.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3211, pruned_loss=0.084, over 4272349.86 frames. ], batch size: 194, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:27:24,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1334472.0, ans=0.0 2023-06-22 22:27:38,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1334532.0, ans=0.125 2023-06-22 22:27:56,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1334592.0, ans=0.0 2023-06-22 22:28:50,988 INFO [train.py:996] (0/4) Epoch 8, batch 9000, loss[loss=0.2503, simple_loss=0.3188, pruned_loss=0.09086, over 21603.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.316, pruned_loss=0.08361, over 4262732.02 frames. ], batch size: 391, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:28:50,989 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-22 22:29:12,147 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2658, simple_loss=0.3603, pruned_loss=0.0856, over 1796401.00 frames. 2023-06-22 22:29:12,148 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-22 22:29:22,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1334772.0, ans=0.0 2023-06-22 22:29:35,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2023-06-22 22:29:51,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1334892.0, ans=0.1 2023-06-22 22:30:00,393 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.159e+02 6.404e+02 9.275e+02 1.956e+03, threshold=1.281e+03, percent-clipped=15.0 2023-06-22 22:30:03,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1334952.0, ans=0.125 2023-06-22 22:30:51,427 INFO [train.py:996] (0/4) Epoch 8, batch 9050, loss[loss=0.2183, simple_loss=0.3005, pruned_loss=0.06805, over 21775.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.312, pruned_loss=0.08052, over 4258201.88 frames. ], batch size: 282, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:31:05,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-22 22:31:35,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1335192.0, ans=0.125 2023-06-22 22:32:20,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1335312.0, ans=0.125 2023-06-22 22:32:34,027 INFO [train.py:996] (0/4) Epoch 8, batch 9100, loss[loss=0.2706, simple_loss=0.3533, pruned_loss=0.09397, over 21319.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3158, pruned_loss=0.08271, over 4257489.80 frames. ], batch size: 549, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:32:40,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1335372.0, ans=0.2 2023-06-22 22:33:32,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 4.374e+02 5.511e+02 8.272e+02 1.713e+03, threshold=1.102e+03, percent-clipped=4.0 2023-06-22 22:33:46,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1335552.0, ans=0.125 2023-06-22 22:33:57,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1335552.0, ans=0.125 2023-06-22 22:34:01,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1335612.0, ans=0.0 2023-06-22 22:34:07,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1335612.0, ans=0.125 2023-06-22 22:34:15,748 INFO [train.py:996] (0/4) Epoch 8, batch 9150, loss[loss=0.3107, simple_loss=0.3937, pruned_loss=0.1139, over 21633.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.321, pruned_loss=0.08119, over 4254214.22 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:34:16,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-22 22:34:32,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1335672.0, ans=0.0 2023-06-22 22:34:39,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1335732.0, ans=0.125 2023-06-22 22:35:38,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1335852.0, ans=0.0 2023-06-22 22:36:01,123 INFO [train.py:996] (0/4) Epoch 8, batch 9200, loss[loss=0.2917, simple_loss=0.3756, pruned_loss=0.1039, over 21610.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3251, pruned_loss=0.08091, over 4262033.19 frames. ], batch size: 414, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:36:59,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.939e+02 4.377e+02 5.436e+02 8.538e+02 1.737e+03, threshold=1.087e+03, percent-clipped=12.0 2023-06-22 22:37:11,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1336152.0, ans=0.125 2023-06-22 22:37:39,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1336272.0, ans=0.2 2023-06-22 22:37:41,000 INFO [train.py:996] (0/4) Epoch 8, batch 9250, loss[loss=0.2285, simple_loss=0.3023, pruned_loss=0.07731, over 21195.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3275, pruned_loss=0.08466, over 4256339.24 frames. ], batch size: 143, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:38:51,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1336452.0, ans=0.0 2023-06-22 22:39:15,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1336572.0, ans=0.125 2023-06-22 22:39:16,221 INFO [train.py:996] (0/4) Epoch 8, batch 9300, loss[loss=0.2045, simple_loss=0.2681, pruned_loss=0.07046, over 21498.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3208, pruned_loss=0.08412, over 4257224.60 frames. ], batch size: 230, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:39:49,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1336632.0, ans=0.125 2023-06-22 22:40:02,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1336692.0, ans=0.0 2023-06-22 22:40:03,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1336692.0, ans=0.125 2023-06-22 22:40:11,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.867e+02 5.198e+02 7.448e+02 1.175e+03 2.635e+03, threshold=1.490e+03, percent-clipped=31.0 2023-06-22 22:40:48,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1336812.0, ans=0.125 2023-06-22 22:40:50,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-22 22:40:51,251 INFO [train.py:996] (0/4) Epoch 8, batch 9350, loss[loss=0.2498, simple_loss=0.3307, pruned_loss=0.08446, over 21412.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.326, pruned_loss=0.08507, over 4262296.87 frames. ], batch size: 131, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:41:41,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1336992.0, ans=0.125 2023-06-22 22:41:41,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1336992.0, ans=0.125 2023-06-22 22:42:13,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.40 vs. limit=10.0 2023-06-22 22:42:36,766 INFO [train.py:996] (0/4) Epoch 8, batch 9400, loss[loss=0.2459, simple_loss=0.3187, pruned_loss=0.08658, over 20158.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3277, pruned_loss=0.0863, over 4262502.62 frames. ], batch size: 702, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:43:00,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1337232.0, ans=0.0 2023-06-22 22:43:04,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1337232.0, ans=0.125 2023-06-22 22:43:30,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1337292.0, ans=0.2 2023-06-22 22:43:32,554 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.205e+02 4.546e+02 6.111e+02 8.751e+02 2.078e+03, threshold=1.222e+03, percent-clipped=3.0 2023-06-22 22:43:56,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1337412.0, ans=0.0 2023-06-22 22:44:16,702 INFO [train.py:996] (0/4) Epoch 8, batch 9450, loss[loss=0.2123, simple_loss=0.2737, pruned_loss=0.07539, over 21868.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3181, pruned_loss=0.08456, over 4264757.86 frames. ], batch size: 373, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:44:33,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-22 22:44:57,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1337592.0, ans=0.1 2023-06-22 22:45:10,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1337592.0, ans=0.125 2023-06-22 22:45:12,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1337652.0, ans=0.1 2023-06-22 22:45:44,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1337712.0, ans=0.1 2023-06-22 22:45:54,752 INFO [train.py:996] (0/4) Epoch 8, batch 9500, loss[loss=0.2464, simple_loss=0.3016, pruned_loss=0.09557, over 21850.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3113, pruned_loss=0.08212, over 4259118.07 frames. ], batch size: 107, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:46:50,871 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.196e+02 5.640e+02 7.713e+02 1.096e+03 2.487e+03, threshold=1.543e+03, percent-clipped=16.0 2023-06-22 22:46:55,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-22 22:47:12,747 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:47:26,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-22 22:47:34,347 INFO [train.py:996] (0/4) Epoch 8, batch 9550, loss[loss=0.2547, simple_loss=0.3341, pruned_loss=0.08761, over 21748.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3155, pruned_loss=0.08428, over 4255955.19 frames. ], batch size: 332, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:48:53,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1338312.0, ans=0.125 2023-06-22 22:49:00,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.63 vs. limit=6.0 2023-06-22 22:49:14,079 INFO [train.py:996] (0/4) Epoch 8, batch 9600, loss[loss=0.2614, simple_loss=0.3156, pruned_loss=0.1036, over 21598.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3172, pruned_loss=0.08532, over 4263109.39 frames. ], batch size: 548, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:49:26,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-22 22:49:30,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-22 22:49:31,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1338432.0, ans=0.125 2023-06-22 22:49:33,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1338432.0, ans=0.125 2023-06-22 22:49:35,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1338432.0, ans=0.0 2023-06-22 22:49:37,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1338432.0, ans=0.125 2023-06-22 22:50:02,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1338492.0, ans=0.125 2023-06-22 22:50:03,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.129e+02 4.133e+02 5.747e+02 7.464e+02 1.666e+03, threshold=1.149e+03, percent-clipped=1.0 2023-06-22 22:50:04,279 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-22 22:50:49,772 INFO [train.py:996] (0/4) Epoch 8, batch 9650, loss[loss=0.2651, simple_loss=0.3306, pruned_loss=0.09982, over 21733.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3176, pruned_loss=0.08514, over 4267683.62 frames. ], batch size: 298, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:50:52,239 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-22 22:51:07,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1338732.0, ans=0.04949747468305833 2023-06-22 22:52:28,544 INFO [train.py:996] (0/4) Epoch 8, batch 9700, loss[loss=0.238, simple_loss=0.3084, pruned_loss=0.08385, over 21791.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3221, pruned_loss=0.08633, over 4271546.39 frames. ], batch size: 124, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:52:55,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1339032.0, ans=0.0 2023-06-22 22:53:18,054 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.568e+02 6.321e+02 8.796e+02 1.656e+03, threshold=1.264e+03, percent-clipped=3.0 2023-06-22 22:53:20,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1339152.0, ans=0.04949747468305833 2023-06-22 22:54:05,590 INFO [train.py:996] (0/4) Epoch 8, batch 9750, loss[loss=0.2913, simple_loss=0.3812, pruned_loss=0.1006, over 21833.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3166, pruned_loss=0.08495, over 4273225.60 frames. ], batch size: 118, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:54:47,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-22 22:54:48,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1339392.0, ans=0.125 2023-06-22 22:54:56,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1339452.0, ans=0.0 2023-06-22 22:55:01,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=22.5 2023-06-22 22:55:31,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1339512.0, ans=0.125 2023-06-22 22:55:42,080 INFO [train.py:996] (0/4) Epoch 8, batch 9800, loss[loss=0.2242, simple_loss=0.2954, pruned_loss=0.07647, over 21656.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3158, pruned_loss=0.08489, over 4276799.96 frames. ], batch size: 230, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:56:02,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1339632.0, ans=0.1 2023-06-22 22:56:04,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1339632.0, ans=0.2 2023-06-22 22:56:09,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-22 22:56:21,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1339692.0, ans=0.2 2023-06-22 22:56:31,484 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.189e+02 3.646e+02 4.309e+02 6.187e+02 1.699e+03, threshold=8.618e+02, percent-clipped=3.0 2023-06-22 22:57:19,962 INFO [train.py:996] (0/4) Epoch 8, batch 9850, loss[loss=0.2276, simple_loss=0.3207, pruned_loss=0.06726, over 16029.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.312, pruned_loss=0.08491, over 4277515.71 frames. ], batch size: 60, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:57:27,396 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.64 vs. limit=15.0 2023-06-22 22:58:12,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.77 vs. limit=15.0 2023-06-22 22:58:54,092 INFO [train.py:996] (0/4) Epoch 8, batch 9900, loss[loss=0.2739, simple_loss=0.3427, pruned_loss=0.1025, over 21376.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3084, pruned_loss=0.08439, over 4269546.53 frames. ], batch size: 471, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:59:45,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.205e+02 4.510e+02 5.793e+02 9.115e+02 1.830e+03, threshold=1.159e+03, percent-clipped=29.0 2023-06-22 23:00:13,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1340412.0, ans=0.2 2023-06-22 23:00:33,433 INFO [train.py:996] (0/4) Epoch 8, batch 9950, loss[loss=0.2662, simple_loss=0.3161, pruned_loss=0.1082, over 21705.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3131, pruned_loss=0.08699, over 4256730.20 frames. ], batch size: 112, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:00:45,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1340472.0, ans=0.125 2023-06-22 23:00:50,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.95 vs. limit=22.5 2023-06-22 23:00:56,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1340532.0, ans=0.0 2023-06-22 23:01:09,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1340592.0, ans=0.1 2023-06-22 23:01:15,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1340592.0, ans=0.05 2023-06-22 23:01:41,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1340652.0, ans=0.0 2023-06-22 23:02:13,507 INFO [train.py:996] (0/4) Epoch 8, batch 10000, loss[loss=0.2236, simple_loss=0.2931, pruned_loss=0.07699, over 20003.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3098, pruned_loss=0.08606, over 4259867.42 frames. ], batch size: 703, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:02:25,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1340772.0, ans=0.0 2023-06-22 23:02:45,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1340832.0, ans=0.0 2023-06-22 23:02:48,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1340832.0, ans=0.0 2023-06-22 23:03:04,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1340892.0, ans=0.0 2023-06-22 23:03:05,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.683e+02 4.495e+02 6.092e+02 8.521e+02 2.124e+03, threshold=1.218e+03, percent-clipped=12.0 2023-06-22 23:03:54,455 INFO [train.py:996] (0/4) Epoch 8, batch 10050, loss[loss=0.1926, simple_loss=0.2731, pruned_loss=0.05605, over 21745.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.311, pruned_loss=0.08591, over 4260424.29 frames. ], batch size: 282, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:03:58,613 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-22 23:04:01,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1341072.0, ans=0.125 2023-06-22 23:04:18,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1341132.0, ans=0.125 2023-06-22 23:04:20,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1341132.0, ans=0.09899494936611666 2023-06-22 23:04:21,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1341132.0, ans=0.1 2023-06-22 23:04:45,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1341192.0, ans=0.125 2023-06-22 23:05:33,513 INFO [train.py:996] (0/4) Epoch 8, batch 10100, loss[loss=0.2445, simple_loss=0.3266, pruned_loss=0.08117, over 21651.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3053, pruned_loss=0.08245, over 4261075.27 frames. ], batch size: 414, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:06:36,631 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-22 23:06:40,193 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.964e+02 4.513e+02 5.773e+02 8.039e+02 1.456e+03, threshold=1.155e+03, percent-clipped=7.0 2023-06-22 23:07:16,832 INFO [train.py:996] (0/4) Epoch 8, batch 10150, loss[loss=0.2436, simple_loss=0.3182, pruned_loss=0.08451, over 21634.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3116, pruned_loss=0.08471, over 4261156.06 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:08:29,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1341852.0, ans=0.125 2023-06-22 23:08:30,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1341852.0, ans=0.0 2023-06-22 23:08:55,277 INFO [train.py:996] (0/4) Epoch 8, batch 10200, loss[loss=0.2091, simple_loss=0.2973, pruned_loss=0.06045, over 21724.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3112, pruned_loss=0.08336, over 4266940.15 frames. ], batch size: 351, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:09:30,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1342032.0, ans=0.1 2023-06-22 23:09:59,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.693e+02 4.053e+02 5.323e+02 7.160e+02 1.292e+03, threshold=1.065e+03, percent-clipped=4.0 2023-06-22 23:10:00,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1342152.0, ans=0.2 2023-06-22 23:10:28,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1342212.0, ans=0.125 2023-06-22 23:10:35,055 INFO [train.py:996] (0/4) Epoch 8, batch 10250, loss[loss=0.1999, simple_loss=0.2798, pruned_loss=0.06004, over 21621.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3066, pruned_loss=0.07791, over 4275871.83 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:10:54,355 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:11:20,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1342392.0, ans=0.125 2023-06-22 23:12:14,726 INFO [train.py:996] (0/4) Epoch 8, batch 10300, loss[loss=0.2746, simple_loss=0.341, pruned_loss=0.1041, over 21259.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3089, pruned_loss=0.07858, over 4272426.80 frames. ], batch size: 176, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:12:21,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1342572.0, ans=0.025 2023-06-22 23:12:39,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-22 23:12:46,530 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:12:46,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1342632.0, ans=0.2 2023-06-22 23:12:51,727 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-22 23:13:04,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1342692.0, ans=0.0 2023-06-22 23:13:15,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1342692.0, ans=0.0 2023-06-22 23:13:20,052 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.911e+02 4.156e+02 6.277e+02 8.296e+02 2.131e+03, threshold=1.255e+03, percent-clipped=15.0 2023-06-22 23:14:00,724 INFO [train.py:996] (0/4) Epoch 8, batch 10350, loss[loss=0.2417, simple_loss=0.324, pruned_loss=0.07967, over 21598.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3104, pruned_loss=0.07865, over 4270818.18 frames. ], batch size: 389, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:14:08,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1342872.0, ans=0.125 2023-06-22 23:14:19,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1342872.0, ans=0.125 2023-06-22 23:14:30,965 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:15:10,577 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.12 vs. limit=5.0 2023-06-22 23:15:17,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1343052.0, ans=0.2 2023-06-22 23:15:51,362 INFO [train.py:996] (0/4) Epoch 8, batch 10400, loss[loss=0.2472, simple_loss=0.3205, pruned_loss=0.08689, over 21728.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3052, pruned_loss=0.07804, over 4276646.78 frames. ], batch size: 391, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:15:57,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.08 vs. limit=15.0 2023-06-22 23:16:22,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-22 23:16:24,261 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=12.0 2023-06-22 23:16:32,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-22 23:16:45,513 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.450e+02 4.781e+02 6.358e+02 9.315e+02 2.129e+03, threshold=1.272e+03, percent-clipped=10.0 2023-06-22 23:16:48,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-22 23:17:31,417 INFO [train.py:996] (0/4) Epoch 8, batch 10450, loss[loss=0.2449, simple_loss=0.3078, pruned_loss=0.09097, over 21409.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3091, pruned_loss=0.08072, over 4270898.35 frames. ], batch size: 131, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:17:38,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1343472.0, ans=0.0 2023-06-22 23:19:09,680 INFO [train.py:996] (0/4) Epoch 8, batch 10500, loss[loss=0.2503, simple_loss=0.3004, pruned_loss=0.1, over 21179.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.309, pruned_loss=0.08031, over 4270237.61 frames. ], batch size: 143, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:19:15,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1343772.0, ans=0.04949747468305833 2023-06-22 23:19:17,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=12.0 2023-06-22 23:19:21,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1343772.0, ans=0.0 2023-06-22 23:19:59,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.74 vs. limit=22.5 2023-06-22 23:20:03,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.164e+02 4.877e+02 7.502e+02 1.116e+03 2.000e+03, threshold=1.500e+03, percent-clipped=17.0 2023-06-22 23:20:18,023 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-224000.pt 2023-06-22 23:20:27,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1344012.0, ans=10.0 2023-06-22 23:20:51,452 INFO [train.py:996] (0/4) Epoch 8, batch 10550, loss[loss=0.2169, simple_loss=0.2757, pruned_loss=0.07904, over 21760.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3037, pruned_loss=0.07998, over 4273098.53 frames. ], batch size: 124, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:20:52,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-22 23:21:25,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-22 23:21:31,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1344192.0, ans=0.05 2023-06-22 23:21:40,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1344192.0, ans=0.125 2023-06-22 23:22:29,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1344312.0, ans=0.125 2023-06-22 23:22:43,178 INFO [train.py:996] (0/4) Epoch 8, batch 10600, loss[loss=0.2303, simple_loss=0.3265, pruned_loss=0.06706, over 21477.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2995, pruned_loss=0.07803, over 4268876.98 frames. ], batch size: 471, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:23:44,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1344492.0, ans=0.5 2023-06-22 23:23:49,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 4.057e+02 5.623e+02 8.035e+02 1.796e+03, threshold=1.125e+03, percent-clipped=5.0 2023-06-22 23:24:11,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1344552.0, ans=0.125 2023-06-22 23:24:16,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-06-22 23:24:20,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1344612.0, ans=0.09899494936611666 2023-06-22 23:24:27,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1344612.0, ans=0.04949747468305833 2023-06-22 23:24:32,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1344612.0, ans=0.1 2023-06-22 23:24:35,514 INFO [train.py:996] (0/4) Epoch 8, batch 10650, loss[loss=0.1881, simple_loss=0.2715, pruned_loss=0.05241, over 21747.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3023, pruned_loss=0.07657, over 4272154.74 frames. ], batch size: 332, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:25:04,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=1344732.0, ans=12.0 2023-06-22 23:25:12,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1344732.0, ans=0.0 2023-06-22 23:25:30,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1344792.0, ans=0.125 2023-06-22 23:25:33,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-22 23:26:28,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1344912.0, ans=0.125 2023-06-22 23:26:30,734 INFO [train.py:996] (0/4) Epoch 8, batch 10700, loss[loss=0.3139, simple_loss=0.3715, pruned_loss=0.1282, over 21402.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3005, pruned_loss=0.07642, over 4266746.19 frames. ], batch size: 471, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:27:20,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1345092.0, ans=0.125 2023-06-22 23:27:36,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.264e+02 5.189e+02 6.885e+02 9.178e+02 1.741e+03, threshold=1.377e+03, percent-clipped=11.0 2023-06-22 23:27:40,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1345152.0, ans=0.0 2023-06-22 23:28:13,112 INFO [train.py:996] (0/4) Epoch 8, batch 10750, loss[loss=0.2736, simple_loss=0.3417, pruned_loss=0.1027, over 21415.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3111, pruned_loss=0.08111, over 4269514.70 frames. ], batch size: 131, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:28:15,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1345272.0, ans=0.125 2023-06-22 23:28:17,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1345272.0, ans=0.0 2023-06-22 23:28:23,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1345272.0, ans=0.0 2023-06-22 23:28:39,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345332.0, ans=0.1 2023-06-22 23:28:41,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1345332.0, ans=0.2 2023-06-22 23:28:51,722 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:29:09,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1345392.0, ans=0.2 2023-06-22 23:29:15,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1345452.0, ans=0.2 2023-06-22 23:29:50,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1345512.0, ans=0.0 2023-06-22 23:29:55,300 INFO [train.py:996] (0/4) Epoch 8, batch 10800, loss[loss=0.2676, simple_loss=0.3434, pruned_loss=0.09592, over 21680.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3174, pruned_loss=0.08214, over 4270432.86 frames. ], batch size: 351, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:30:09,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1345572.0, ans=0.2 2023-06-22 23:30:15,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1345632.0, ans=0.125 2023-06-22 23:31:04,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.247e+02 4.824e+02 6.509e+02 9.810e+02 2.428e+03, threshold=1.302e+03, percent-clipped=4.0 2023-06-22 23:31:08,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1345752.0, ans=10.0 2023-06-22 23:31:39,598 INFO [train.py:996] (0/4) Epoch 8, batch 10850, loss[loss=0.2268, simple_loss=0.288, pruned_loss=0.08278, over 21306.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3182, pruned_loss=0.08257, over 4272437.39 frames. ], batch size: 144, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:33:16,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1346112.0, ans=0.1 2023-06-22 23:33:19,391 INFO [train.py:996] (0/4) Epoch 8, batch 10900, loss[loss=0.2124, simple_loss=0.3008, pruned_loss=0.06198, over 21703.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3101, pruned_loss=0.08056, over 4272582.81 frames. ], batch size: 247, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:34:15,196 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-22 23:34:23,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.825e+02 3.987e+02 5.547e+02 7.924e+02 1.642e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-22 23:34:27,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1346352.0, ans=0.0 2023-06-22 23:34:41,642 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-22 23:34:56,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1346412.0, ans=10.0 2023-06-22 23:35:00,092 INFO [train.py:996] (0/4) Epoch 8, batch 10950, loss[loss=0.2529, simple_loss=0.3043, pruned_loss=0.1007, over 21278.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3066, pruned_loss=0.07897, over 4264575.82 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:35:31,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1346532.0, ans=0.125 2023-06-22 23:35:45,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346592.0, ans=0.1 2023-06-22 23:36:02,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1346652.0, ans=0.0 2023-06-22 23:36:13,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1346652.0, ans=0.0 2023-06-22 23:36:38,548 INFO [train.py:996] (0/4) Epoch 8, batch 11000, loss[loss=0.2585, simple_loss=0.3207, pruned_loss=0.0981, over 21884.00 frames. ], tot_loss[loss=0.232, simple_loss=0.305, pruned_loss=0.07948, over 4266088.50 frames. ], batch size: 414, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:37:36,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346892.0, ans=0.1 2023-06-22 23:37:43,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.878e+02 3.827e+02 4.499e+02 6.468e+02 1.217e+03, threshold=8.999e+02, percent-clipped=2.0 2023-06-22 23:37:49,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1346952.0, ans=0.125 2023-06-22 23:37:52,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1346952.0, ans=0.2 2023-06-22 23:38:08,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1347012.0, ans=0.0 2023-06-22 23:38:15,784 INFO [train.py:996] (0/4) Epoch 8, batch 11050, loss[loss=0.2107, simple_loss=0.2725, pruned_loss=0.07449, over 21489.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3018, pruned_loss=0.08027, over 4274936.21 frames. ], batch size: 195, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:38:57,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1347192.0, ans=10.0 2023-06-22 23:39:23,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1347252.0, ans=0.1 2023-06-22 23:39:26,303 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:39:54,418 INFO [train.py:996] (0/4) Epoch 8, batch 11100, loss[loss=0.2468, simple_loss=0.3026, pruned_loss=0.09552, over 20062.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3022, pruned_loss=0.0811, over 4279434.41 frames. ], batch size: 703, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:39:57,042 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-22 23:40:02,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1347372.0, ans=0.0 2023-06-22 23:40:59,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2023-06-22 23:41:00,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.372e+02 5.317e+02 7.818e+02 1.562e+03, threshold=1.063e+03, percent-clipped=13.0 2023-06-22 23:41:21,581 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-22 23:41:32,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1347612.0, ans=0.5 2023-06-22 23:41:34,691 INFO [train.py:996] (0/4) Epoch 8, batch 11150, loss[loss=0.2535, simple_loss=0.337, pruned_loss=0.08505, over 21802.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.301, pruned_loss=0.08125, over 4272104.94 frames. ], batch size: 317, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:42:27,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1347792.0, ans=0.125 2023-06-22 23:42:31,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1347792.0, ans=0.125 2023-06-22 23:42:34,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1347792.0, ans=0.125 2023-06-22 23:42:39,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1347852.0, ans=0.1 2023-06-22 23:43:02,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1347912.0, ans=0.2 2023-06-22 23:43:06,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.54 vs. limit=15.0 2023-06-22 23:43:15,320 INFO [train.py:996] (0/4) Epoch 8, batch 11200, loss[loss=0.2218, simple_loss=0.2783, pruned_loss=0.08263, over 21384.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.2991, pruned_loss=0.08056, over 4262680.56 frames. ], batch size: 212, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:43:27,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1347972.0, ans=0.2 2023-06-22 23:44:19,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-22 23:44:19,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 4.266e+02 5.477e+02 7.611e+02 1.407e+03, threshold=1.095e+03, percent-clipped=4.0 2023-06-22 23:44:53,127 INFO [train.py:996] (0/4) Epoch 8, batch 11250, loss[loss=0.2733, simple_loss=0.3213, pruned_loss=0.1127, over 21433.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.299, pruned_loss=0.08035, over 4265180.28 frames. ], batch size: 509, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:46:31,408 INFO [train.py:996] (0/4) Epoch 8, batch 11300, loss[loss=0.1873, simple_loss=0.2673, pruned_loss=0.05362, over 21453.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3001, pruned_loss=0.08003, over 4272712.38 frames. ], batch size: 211, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:47:15,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1348692.0, ans=0.0 2023-06-22 23:47:16,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-22 23:47:26,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1348692.0, ans=0.125 2023-06-22 23:47:39,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 3.884e+02 4.784e+02 6.961e+02 1.768e+03, threshold=9.568e+02, percent-clipped=7.0 2023-06-22 23:47:57,988 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:48:11,961 INFO [train.py:996] (0/4) Epoch 8, batch 11350, loss[loss=0.2729, simple_loss=0.3428, pruned_loss=0.1015, over 21746.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3026, pruned_loss=0.08003, over 4269197.06 frames. ], batch size: 124, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:48:22,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1348872.0, ans=0.125 2023-06-22 23:48:34,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1348932.0, ans=0.125 2023-06-22 23:48:50,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=22.5 2023-06-22 23:48:55,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-22 23:49:01,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-22 23:49:06,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-22 23:49:08,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1348992.0, ans=0.1 2023-06-22 23:49:10,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=12.0 2023-06-22 23:49:14,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1349052.0, ans=0.035 2023-06-22 23:49:54,107 INFO [train.py:996] (0/4) Epoch 8, batch 11400, loss[loss=0.2913, simple_loss=0.3604, pruned_loss=0.1111, over 21703.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3091, pruned_loss=0.08256, over 4271042.69 frames. ], batch size: 441, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:49:58,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1349172.0, ans=0.125 2023-06-22 23:50:37,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.42 vs. limit=15.0 2023-06-22 23:50:51,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1349292.0, ans=0.125 2023-06-22 23:51:07,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.908e+02 4.448e+02 6.055e+02 8.360e+02 1.667e+03, threshold=1.211e+03, percent-clipped=10.0 2023-06-22 23:51:39,992 INFO [train.py:996] (0/4) Epoch 8, batch 11450, loss[loss=0.2847, simple_loss=0.3628, pruned_loss=0.1033, over 21600.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3087, pruned_loss=0.08054, over 4272900.64 frames. ], batch size: 414, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:52:17,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1349592.0, ans=0.125 2023-06-22 23:53:17,207 INFO [train.py:996] (0/4) Epoch 8, batch 11500, loss[loss=0.2188, simple_loss=0.2998, pruned_loss=0.06893, over 21289.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.313, pruned_loss=0.08204, over 4274474.99 frames. ], batch size: 176, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:54:22,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.405e+02 5.880e+02 8.915e+02 1.909e+03, threshold=1.176e+03, percent-clipped=7.0 2023-06-22 23:54:23,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.64 vs. limit=10.0 2023-06-22 23:55:00,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1350012.0, ans=0.0 2023-06-22 23:55:04,817 INFO [train.py:996] (0/4) Epoch 8, batch 11550, loss[loss=0.2354, simple_loss=0.3435, pruned_loss=0.06369, over 21220.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3177, pruned_loss=0.08156, over 4270913.47 frames. ], batch size: 548, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:55:15,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1350072.0, ans=0.0 2023-06-22 23:55:39,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.07 vs. limit=10.0 2023-06-22 23:55:56,114 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-06-22 23:56:42,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1350312.0, ans=0.07 2023-06-22 23:56:43,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1350312.0, ans=0.2 2023-06-22 23:56:46,824 INFO [train.py:996] (0/4) Epoch 8, batch 11600, loss[loss=0.2217, simple_loss=0.288, pruned_loss=0.07776, over 20710.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3317, pruned_loss=0.08391, over 4265976.53 frames. ], batch size: 607, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:57:21,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1350432.0, ans=0.125 2023-06-22 23:57:40,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1350492.0, ans=0.125 2023-06-22 23:57:49,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.035e+02 5.071e+02 7.210e+02 9.611e+02 2.245e+03, threshold=1.442e+03, percent-clipped=13.0 2023-06-22 23:58:16,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1350612.0, ans=0.0 2023-06-22 23:58:27,177 INFO [train.py:996] (0/4) Epoch 8, batch 11650, loss[loss=0.226, simple_loss=0.2997, pruned_loss=0.07614, over 21224.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3354, pruned_loss=0.08471, over 4266445.09 frames. ], batch size: 176, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:59:19,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1350792.0, ans=0.04949747468305833 2023-06-22 23:59:26,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=12.0 2023-06-23 00:00:05,891 INFO [train.py:996] (0/4) Epoch 8, batch 11700, loss[loss=0.2359, simple_loss=0.2912, pruned_loss=0.09035, over 21481.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3271, pruned_loss=0.08358, over 4252800.29 frames. ], batch size: 212, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:00:30,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1351032.0, ans=0.0 2023-06-23 00:00:33,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1351032.0, ans=0.2 2023-06-23 00:00:45,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-23 00:01:05,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1351152.0, ans=0.0 2023-06-23 00:01:08,223 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.318e+02 4.484e+02 5.547e+02 7.973e+02 1.731e+03, threshold=1.109e+03, percent-clipped=2.0 2023-06-23 00:01:12,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1351152.0, ans=0.0 2023-06-23 00:01:26,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1351152.0, ans=0.125 2023-06-23 00:01:45,127 INFO [train.py:996] (0/4) Epoch 8, batch 11750, loss[loss=0.2442, simple_loss=0.3167, pruned_loss=0.08579, over 21722.00 frames. ], tot_loss[loss=0.242, simple_loss=0.318, pruned_loss=0.083, over 4263563.10 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:01:59,362 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-23 00:02:56,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1351452.0, ans=0.125 2023-06-23 00:03:31,060 INFO [train.py:996] (0/4) Epoch 8, batch 11800, loss[loss=0.233, simple_loss=0.3232, pruned_loss=0.07141, over 21448.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3203, pruned_loss=0.08592, over 4258124.16 frames. ], batch size: 211, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:03:36,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1351572.0, ans=0.125 2023-06-23 00:03:40,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.31 vs. limit=10.0 2023-06-23 00:03:53,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-23 00:04:02,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1351692.0, ans=0.125 2023-06-23 00:04:14,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1351692.0, ans=0.0 2023-06-23 00:04:32,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1351752.0, ans=0.125 2023-06-23 00:04:33,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.952e+02 6.755e+02 1.112e+03 2.056e+03, threshold=1.351e+03, percent-clipped=25.0 2023-06-23 00:04:54,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1351812.0, ans=0.125 2023-06-23 00:05:11,136 INFO [train.py:996] (0/4) Epoch 8, batch 11850, loss[loss=0.3454, simple_loss=0.4001, pruned_loss=0.1453, over 21547.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.322, pruned_loss=0.08483, over 4262328.50 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:05:12,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-23 00:05:19,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1351872.0, ans=0.125 2023-06-23 00:05:21,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1351872.0, ans=0.2 2023-06-23 00:05:23,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-23 00:05:39,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1351932.0, ans=0.2 2023-06-23 00:05:49,823 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.28 vs. limit=22.5 2023-06-23 00:06:04,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-23 00:06:21,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1352052.0, ans=0.0 2023-06-23 00:06:42,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1352112.0, ans=0.2 2023-06-23 00:06:52,054 INFO [train.py:996] (0/4) Epoch 8, batch 11900, loss[loss=0.2416, simple_loss=0.3316, pruned_loss=0.07578, over 21617.00 frames. ], tot_loss[loss=0.244, simple_loss=0.322, pruned_loss=0.08297, over 4262661.90 frames. ], batch size: 441, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:06:54,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1352172.0, ans=0.125 2023-06-23 00:07:15,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1352232.0, ans=0.1 2023-06-23 00:07:22,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-23 00:07:52,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-23 00:08:07,970 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.952e+02 4.101e+02 5.216e+02 6.925e+02 1.642e+03, threshold=1.043e+03, percent-clipped=1.0 2023-06-23 00:08:18,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-23 00:08:35,019 INFO [train.py:996] (0/4) Epoch 8, batch 11950, loss[loss=0.2627, simple_loss=0.3772, pruned_loss=0.07407, over 21194.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3226, pruned_loss=0.07983, over 4261670.64 frames. ], batch size: 548, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:09:43,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=22.5 2023-06-23 00:09:45,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1352652.0, ans=0.125 2023-06-23 00:09:57,325 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.70 vs. limit=5.0 2023-06-23 00:10:03,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-23 00:10:13,544 INFO [train.py:996] (0/4) Epoch 8, batch 12000, loss[loss=0.236, simple_loss=0.2967, pruned_loss=0.08771, over 21840.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3149, pruned_loss=0.07729, over 4252156.76 frames. ], batch size: 98, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:10:13,546 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 00:10:32,701 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2606, simple_loss=0.356, pruned_loss=0.08257, over 1796401.00 frames. 2023-06-23 00:10:32,702 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 00:10:42,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1352772.0, ans=0.125 2023-06-23 00:10:47,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1352832.0, ans=0.0 2023-06-23 00:11:39,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 4.103e+02 5.711e+02 8.012e+02 1.968e+03, threshold=1.142e+03, percent-clipped=13.0 2023-06-23 00:11:42,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1352952.0, ans=0.05 2023-06-23 00:12:11,472 INFO [train.py:996] (0/4) Epoch 8, batch 12050, loss[loss=0.2124, simple_loss=0.2764, pruned_loss=0.07418, over 21399.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3115, pruned_loss=0.07887, over 4259925.62 frames. ], batch size: 177, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:13:30,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1353312.0, ans=0.5 2023-06-23 00:13:53,234 INFO [train.py:996] (0/4) Epoch 8, batch 12100, loss[loss=0.2812, simple_loss=0.404, pruned_loss=0.07923, over 19748.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3206, pruned_loss=0.08293, over 4260110.38 frames. ], batch size: 702, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:14:23,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1353432.0, ans=0.125 2023-06-23 00:14:29,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1353432.0, ans=0.125 2023-06-23 00:15:04,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.947e+02 5.144e+02 7.244e+02 1.095e+03 2.232e+03, threshold=1.449e+03, percent-clipped=22.0 2023-06-23 00:15:06,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353552.0, ans=0.1 2023-06-23 00:15:09,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1353552.0, ans=0.125 2023-06-23 00:15:17,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1353552.0, ans=0.125 2023-06-23 00:15:17,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1353552.0, ans=0.0 2023-06-23 00:15:21,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1353612.0, ans=0.2 2023-06-23 00:15:45,920 INFO [train.py:996] (0/4) Epoch 8, batch 12150, loss[loss=0.2649, simple_loss=0.3659, pruned_loss=0.08191, over 21699.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3254, pruned_loss=0.08292, over 4262347.98 frames. ], batch size: 414, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:16:07,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353732.0, ans=0.1 2023-06-23 00:17:10,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1353912.0, ans=0.125 2023-06-23 00:17:15,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-23 00:17:24,305 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:17:25,432 INFO [train.py:996] (0/4) Epoch 8, batch 12200, loss[loss=0.2256, simple_loss=0.2898, pruned_loss=0.08073, over 21803.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.321, pruned_loss=0.08273, over 4256019.76 frames. ], batch size: 352, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:17:54,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1354032.0, ans=0.0 2023-06-23 00:18:00,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1354092.0, ans=0.0 2023-06-23 00:18:13,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1354092.0, ans=0.2 2023-06-23 00:18:27,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.597e+02 6.328e+02 9.392e+02 1.574e+03, threshold=1.266e+03, percent-clipped=2.0 2023-06-23 00:19:03,053 INFO [train.py:996] (0/4) Epoch 8, batch 12250, loss[loss=0.2389, simple_loss=0.3196, pruned_loss=0.07913, over 20742.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3116, pruned_loss=0.07962, over 4249176.64 frames. ], batch size: 611, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:20:15,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1354452.0, ans=0.1 2023-06-23 00:20:30,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.69 vs. limit=6.0 2023-06-23 00:20:35,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1354512.0, ans=0.05 2023-06-23 00:20:41,602 INFO [train.py:996] (0/4) Epoch 8, batch 12300, loss[loss=0.2241, simple_loss=0.3108, pruned_loss=0.06865, over 19924.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3033, pruned_loss=0.0741, over 4246926.37 frames. ], batch size: 704, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:21:14,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1354632.0, ans=0.125 2023-06-23 00:21:41,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.472e+02 4.122e+02 6.339e+02 8.293e+02 1.636e+03, threshold=1.268e+03, percent-clipped=3.0 2023-06-23 00:22:22,503 INFO [train.py:996] (0/4) Epoch 8, batch 12350, loss[loss=0.2388, simple_loss=0.3688, pruned_loss=0.05435, over 19880.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3066, pruned_loss=0.07447, over 4247086.58 frames. ], batch size: 702, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:22:40,929 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:22:56,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1354932.0, ans=0.125 2023-06-23 00:23:48,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1355112.0, ans=0.05 2023-06-23 00:24:01,249 INFO [train.py:996] (0/4) Epoch 8, batch 12400, loss[loss=0.2717, simple_loss=0.3287, pruned_loss=0.1073, over 21883.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3098, pruned_loss=0.07804, over 4257131.44 frames. ], batch size: 351, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:24:23,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1355232.0, ans=0.125 2023-06-23 00:24:33,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1355232.0, ans=0.2 2023-06-23 00:24:38,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1355292.0, ans=0.125 2023-06-23 00:25:07,647 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.244e+02 4.565e+02 7.015e+02 1.038e+03 2.241e+03, threshold=1.403e+03, percent-clipped=10.0 2023-06-23 00:25:08,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1355352.0, ans=0.125 2023-06-23 00:25:29,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1355412.0, ans=0.2 2023-06-23 00:25:32,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1355412.0, ans=0.0 2023-06-23 00:25:45,038 INFO [train.py:996] (0/4) Epoch 8, batch 12450, loss[loss=0.1908, simple_loss=0.231, pruned_loss=0.07525, over 20067.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3125, pruned_loss=0.08106, over 4262394.19 frames. ], batch size: 703, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:25:59,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-23 00:26:17,117 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.63 vs. limit=22.5 2023-06-23 00:26:48,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1355652.0, ans=0.1 2023-06-23 00:26:48,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1355652.0, ans=0.1 2023-06-23 00:27:27,324 INFO [train.py:996] (0/4) Epoch 8, batch 12500, loss[loss=0.2895, simple_loss=0.376, pruned_loss=0.1015, over 21615.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3221, pruned_loss=0.08388, over 4265577.10 frames. ], batch size: 230, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:27:31,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1355772.0, ans=0.2 2023-06-23 00:27:39,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1355772.0, ans=0.125 2023-06-23 00:27:41,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1355772.0, ans=0.07 2023-06-23 00:27:42,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1355832.0, ans=0.125 2023-06-23 00:28:28,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1355892.0, ans=0.0 2023-06-23 00:28:44,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 4.947e+02 7.092e+02 9.787e+02 2.648e+03, threshold=1.418e+03, percent-clipped=11.0 2023-06-23 00:28:48,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-06-23 00:29:01,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1356012.0, ans=0.125 2023-06-23 00:29:12,289 INFO [train.py:996] (0/4) Epoch 8, batch 12550, loss[loss=0.2625, simple_loss=0.3375, pruned_loss=0.09377, over 21837.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.329, pruned_loss=0.08792, over 4270794.09 frames. ], batch size: 124, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:29:25,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1356072.0, ans=0.0 2023-06-23 00:30:43,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1356312.0, ans=0.125 2023-06-23 00:30:46,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1356312.0, ans=0.0 2023-06-23 00:30:58,484 INFO [train.py:996] (0/4) Epoch 8, batch 12600, loss[loss=0.202, simple_loss=0.2929, pruned_loss=0.05557, over 21616.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3278, pruned_loss=0.08508, over 4272491.36 frames. ], batch size: 230, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:31:32,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1356432.0, ans=0.125 2023-06-23 00:31:40,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1356492.0, ans=0.125 2023-06-23 00:32:06,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 4.323e+02 5.935e+02 8.611e+02 2.067e+03, threshold=1.187e+03, percent-clipped=5.0 2023-06-23 00:32:18,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1356612.0, ans=0.2 2023-06-23 00:32:36,780 INFO [train.py:996] (0/4) Epoch 8, batch 12650, loss[loss=0.2318, simple_loss=0.299, pruned_loss=0.08236, over 21933.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3225, pruned_loss=0.08162, over 4273937.09 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:33:58,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1356912.0, ans=0.1 2023-06-23 00:34:08,068 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:34:21,579 INFO [train.py:996] (0/4) Epoch 8, batch 12700, loss[loss=0.2578, simple_loss=0.3219, pruned_loss=0.09681, over 21844.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.322, pruned_loss=0.08436, over 4276599.70 frames. ], batch size: 247, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:34:30,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1356972.0, ans=0.125 2023-06-23 00:35:25,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.987e+02 4.487e+02 5.893e+02 8.124e+02 1.594e+03, threshold=1.179e+03, percent-clipped=3.0 2023-06-23 00:35:59,908 INFO [train.py:996] (0/4) Epoch 8, batch 12750, loss[loss=0.2425, simple_loss=0.3197, pruned_loss=0.08262, over 21694.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3228, pruned_loss=0.08398, over 4277023.31 frames. ], batch size: 263, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:36:00,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1357272.0, ans=0.125 2023-06-23 00:36:55,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1357392.0, ans=0.125 2023-06-23 00:36:56,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1357452.0, ans=0.0 2023-06-23 00:37:42,800 INFO [train.py:996] (0/4) Epoch 8, batch 12800, loss[loss=0.2483, simple_loss=0.3133, pruned_loss=0.09166, over 21756.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3211, pruned_loss=0.08389, over 4283561.42 frames. ], batch size: 112, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:38:42,082 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 4.784e+02 6.128e+02 7.998e+02 1.838e+03, threshold=1.226e+03, percent-clipped=10.0 2023-06-23 00:39:18,538 INFO [train.py:996] (0/4) Epoch 8, batch 12850, loss[loss=0.2227, simple_loss=0.3186, pruned_loss=0.06343, over 21887.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3223, pruned_loss=0.08535, over 4285414.66 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:39:34,716 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-23 00:41:03,896 INFO [train.py:996] (0/4) Epoch 8, batch 12900, loss[loss=0.2335, simple_loss=0.3125, pruned_loss=0.07725, over 21781.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3209, pruned_loss=0.08204, over 4287929.99 frames. ], batch size: 282, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:41:25,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1358232.0, ans=0.0 2023-06-23 00:41:29,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-23 00:41:31,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-23 00:42:12,806 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.798e+02 4.070e+02 5.539e+02 8.932e+02 2.008e+03, threshold=1.108e+03, percent-clipped=7.0 2023-06-23 00:42:25,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-23 00:42:43,903 INFO [train.py:996] (0/4) Epoch 8, batch 12950, loss[loss=0.3021, simple_loss=0.4346, pruned_loss=0.08485, over 19736.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3229, pruned_loss=0.08043, over 4280017.84 frames. ], batch size: 702, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:43:02,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1358532.0, ans=0.0 2023-06-23 00:43:28,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1358592.0, ans=0.125 2023-06-23 00:43:51,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1358652.0, ans=0.0 2023-06-23 00:43:55,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1358652.0, ans=0.1 2023-06-23 00:44:19,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1358712.0, ans=0.125 2023-06-23 00:44:24,122 INFO [train.py:996] (0/4) Epoch 8, batch 13000, loss[loss=0.2399, simple_loss=0.3096, pruned_loss=0.08509, over 21369.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3251, pruned_loss=0.08211, over 4273239.76 frames. ], batch size: 211, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:44:31,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-23 00:44:33,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1358772.0, ans=0.125 2023-06-23 00:45:15,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1358892.0, ans=0.125 2023-06-23 00:45:30,875 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.055e+02 4.812e+02 8.002e+02 1.036e+03 2.306e+03, threshold=1.600e+03, percent-clipped=23.0 2023-06-23 00:46:00,848 INFO [train.py:996] (0/4) Epoch 8, batch 13050, loss[loss=0.1922, simple_loss=0.2719, pruned_loss=0.05622, over 21284.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3175, pruned_loss=0.07886, over 4269691.57 frames. ], batch size: 159, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:47:24,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1359312.0, ans=0.2 2023-06-23 00:47:28,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1359312.0, ans=0.0 2023-06-23 00:47:39,019 INFO [train.py:996] (0/4) Epoch 8, batch 13100, loss[loss=0.2156, simple_loss=0.2881, pruned_loss=0.0715, over 21002.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3161, pruned_loss=0.07842, over 4278034.83 frames. ], batch size: 608, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:47:43,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-23 00:48:25,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-23 00:48:28,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.23 vs. limit=22.5 2023-06-23 00:48:53,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.010e+02 4.152e+02 4.803e+02 6.425e+02 1.389e+03, threshold=9.605e+02, percent-clipped=0.0 2023-06-23 00:49:00,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1359552.0, ans=0.125 2023-06-23 00:49:08,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1359612.0, ans=0.2 2023-06-23 00:49:23,657 INFO [train.py:996] (0/4) Epoch 8, batch 13150, loss[loss=0.2765, simple_loss=0.3451, pruned_loss=0.104, over 21619.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3189, pruned_loss=0.08163, over 4271826.42 frames. ], batch size: 441, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:49:41,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1359672.0, ans=0.125 2023-06-23 00:49:53,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1359732.0, ans=0.0 2023-06-23 00:50:02,352 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-23 00:50:05,221 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-23 00:50:28,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1359852.0, ans=0.1 2023-06-23 00:50:46,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-23 00:51:08,221 INFO [train.py:996] (0/4) Epoch 8, batch 13200, loss[loss=0.1744, simple_loss=0.218, pruned_loss=0.0654, over 17468.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3172, pruned_loss=0.0817, over 4266125.02 frames. ], batch size: 61, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:51:10,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1359972.0, ans=0.125 2023-06-23 00:51:45,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1360092.0, ans=0.125 2023-06-23 00:51:49,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1360092.0, ans=0.125 2023-06-23 00:52:13,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.899e+02 4.724e+02 6.289e+02 8.620e+02 1.453e+03, threshold=1.258e+03, percent-clipped=16.0 2023-06-23 00:52:19,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1360152.0, ans=22.5 2023-06-23 00:52:34,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1360212.0, ans=0.2 2023-06-23 00:52:41,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1360212.0, ans=0.125 2023-06-23 00:52:45,213 INFO [train.py:996] (0/4) Epoch 8, batch 13250, loss[loss=0.2373, simple_loss=0.3063, pruned_loss=0.08415, over 21881.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3165, pruned_loss=0.08331, over 4275871.38 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:53:05,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1360332.0, ans=0.125 2023-06-23 00:54:07,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1360452.0, ans=0.0 2023-06-23 00:54:31,050 INFO [train.py:996] (0/4) Epoch 8, batch 13300, loss[loss=0.2045, simple_loss=0.325, pruned_loss=0.04204, over 19765.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3199, pruned_loss=0.08271, over 4272478.28 frames. ], batch size: 702, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:54:39,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1360572.0, ans=0.2 2023-06-23 00:55:12,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1360692.0, ans=0.0 2023-06-23 00:55:41,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.129e+02 4.712e+02 5.675e+02 7.796e+02 1.493e+03, threshold=1.135e+03, percent-clipped=5.0 2023-06-23 00:55:55,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1360812.0, ans=0.1 2023-06-23 00:56:00,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=8.0 2023-06-23 00:56:11,989 INFO [train.py:996] (0/4) Epoch 8, batch 13350, loss[loss=0.2567, simple_loss=0.3463, pruned_loss=0.08359, over 21616.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3243, pruned_loss=0.08564, over 4276863.88 frames. ], batch size: 389, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:56:51,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1360992.0, ans=0.0 2023-06-23 00:56:53,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-23 00:57:49,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1361112.0, ans=0.125 2023-06-23 00:57:57,204 INFO [train.py:996] (0/4) Epoch 8, batch 13400, loss[loss=0.2706, simple_loss=0.3343, pruned_loss=0.1034, over 21209.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.325, pruned_loss=0.08726, over 4277609.72 frames. ], batch size: 143, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:58:58,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-23 00:59:05,830 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.334e+02 4.523e+02 5.899e+02 7.481e+02 1.405e+03, threshold=1.180e+03, percent-clipped=3.0 2023-06-23 00:59:06,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1361352.0, ans=0.125 2023-06-23 00:59:06,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-23 00:59:36,212 INFO [train.py:996] (0/4) Epoch 8, batch 13450, loss[loss=0.2532, simple_loss=0.3249, pruned_loss=0.09081, over 21719.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3261, pruned_loss=0.08967, over 4284903.58 frames. ], batch size: 124, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:59:50,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1361532.0, ans=0.1 2023-06-23 01:00:08,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1361532.0, ans=0.125 2023-06-23 01:00:43,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=8.0 2023-06-23 01:01:15,855 INFO [train.py:996] (0/4) Epoch 8, batch 13500, loss[loss=0.1899, simple_loss=0.2629, pruned_loss=0.0585, over 21748.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3159, pruned_loss=0.08605, over 4281571.39 frames. ], batch size: 282, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:01:20,143 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-23 01:01:21,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1361772.0, ans=0.125 2023-06-23 01:01:30,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-23 01:01:54,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1361832.0, ans=0.125 2023-06-23 01:02:13,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1361892.0, ans=0.0 2023-06-23 01:02:35,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.391e+02 6.972e+02 1.115e+03 2.286e+03, threshold=1.394e+03, percent-clipped=24.0 2023-06-23 01:02:38,969 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:02:57,284 INFO [train.py:996] (0/4) Epoch 8, batch 13550, loss[loss=0.403, simple_loss=0.4732, pruned_loss=0.1664, over 21455.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3206, pruned_loss=0.08495, over 4275693.85 frames. ], batch size: 507, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:03:04,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1362072.0, ans=0.125 2023-06-23 01:03:40,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1362192.0, ans=0.0 2023-06-23 01:04:21,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-23 01:04:31,411 INFO [train.py:996] (0/4) Epoch 8, batch 13600, loss[loss=0.1978, simple_loss=0.2672, pruned_loss=0.06419, over 19985.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3214, pruned_loss=0.08532, over 4278842.46 frames. ], batch size: 703, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:05:00,790 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-06-23 01:05:47,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.174e+02 4.391e+02 6.199e+02 8.558e+02 2.268e+03, threshold=1.240e+03, percent-clipped=7.0 2023-06-23 01:05:49,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1362552.0, ans=0.2 2023-06-23 01:06:01,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1362612.0, ans=0.1 2023-06-23 01:06:09,029 INFO [train.py:996] (0/4) Epoch 8, batch 13650, loss[loss=0.231, simple_loss=0.2892, pruned_loss=0.08635, over 21521.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3177, pruned_loss=0.08203, over 4281338.07 frames. ], batch size: 441, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:06:19,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.86 vs. limit=5.0 2023-06-23 01:06:25,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1362672.0, ans=0.125 2023-06-23 01:07:03,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1362792.0, ans=0.0 2023-06-23 01:07:07,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1362852.0, ans=0.0 2023-06-23 01:07:16,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1362852.0, ans=0.0 2023-06-23 01:07:43,708 INFO [train.py:996] (0/4) Epoch 8, batch 13700, loss[loss=0.16, simple_loss=0.2089, pruned_loss=0.05558, over 17377.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3136, pruned_loss=0.08237, over 4277447.80 frames. ], batch size: 66, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:08:16,581 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:08:26,364 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-23 01:09:00,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.296e+02 7.505e+02 1.141e+03 2.334e+03, threshold=1.501e+03, percent-clipped=22.0 2023-06-23 01:09:25,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363212.0, ans=0.1 2023-06-23 01:09:32,556 INFO [train.py:996] (0/4) Epoch 8, batch 13750, loss[loss=0.2616, simple_loss=0.3371, pruned_loss=0.09307, over 21657.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3137, pruned_loss=0.08235, over 4279276.44 frames. ], batch size: 414, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:10:02,577 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-23 01:10:20,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1363392.0, ans=0.125 2023-06-23 01:10:34,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1363452.0, ans=0.125 2023-06-23 01:10:51,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363452.0, ans=0.1 2023-06-23 01:11:08,548 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=12.0 2023-06-23 01:11:20,770 INFO [train.py:996] (0/4) Epoch 8, batch 13800, loss[loss=0.2278, simple_loss=0.3298, pruned_loss=0.06291, over 21664.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3175, pruned_loss=0.08164, over 4269158.32 frames. ], batch size: 247, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:11:29,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-23 01:11:33,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-23 01:11:51,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1363632.0, ans=0.5 2023-06-23 01:11:59,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1363692.0, ans=0.0 2023-06-23 01:12:20,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-23 01:12:37,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.245e+02 4.845e+02 7.419e+02 1.036e+03 2.562e+03, threshold=1.484e+03, percent-clipped=7.0 2023-06-23 01:13:00,693 INFO [train.py:996] (0/4) Epoch 8, batch 13850, loss[loss=0.2666, simple_loss=0.3823, pruned_loss=0.07542, over 20748.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3227, pruned_loss=0.08211, over 4263281.70 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:13:43,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1363992.0, ans=0.04949747468305833 2023-06-23 01:14:12,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1364052.0, ans=0.0 2023-06-23 01:14:26,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1364112.0, ans=0.125 2023-06-23 01:14:31,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1364112.0, ans=0.07 2023-06-23 01:14:33,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1364112.0, ans=0.0 2023-06-23 01:14:39,120 INFO [train.py:996] (0/4) Epoch 8, batch 13900, loss[loss=0.2651, simple_loss=0.3235, pruned_loss=0.1034, over 21825.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3261, pruned_loss=0.0851, over 4267521.36 frames. ], batch size: 247, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:15:14,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1364292.0, ans=0.2 2023-06-23 01:15:35,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.94 vs. limit=10.0 2023-06-23 01:15:36,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1364292.0, ans=0.1 2023-06-23 01:15:55,319 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.327e+02 4.266e+02 5.471e+02 7.768e+02 2.129e+03, threshold=1.094e+03, percent-clipped=1.0 2023-06-23 01:16:11,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-23 01:16:17,087 INFO [train.py:996] (0/4) Epoch 8, batch 13950, loss[loss=0.2326, simple_loss=0.308, pruned_loss=0.07863, over 21800.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3255, pruned_loss=0.08708, over 4281264.86 frames. ], batch size: 298, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:16:41,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1364532.0, ans=0.2 2023-06-23 01:16:48,800 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.13 vs. limit=15.0 2023-06-23 01:17:18,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1364592.0, ans=0.2 2023-06-23 01:17:37,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.50 vs. limit=10.0 2023-06-23 01:17:53,963 INFO [train.py:996] (0/4) Epoch 8, batch 14000, loss[loss=0.1744, simple_loss=0.2476, pruned_loss=0.05061, over 21210.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3224, pruned_loss=0.08474, over 4281668.93 frames. ], batch size: 143, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 01:19:08,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.634e+02 4.329e+02 5.834e+02 8.040e+02 1.947e+03, threshold=1.167e+03, percent-clipped=14.0 2023-06-23 01:19:11,730 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-23 01:19:30,104 INFO [train.py:996] (0/4) Epoch 8, batch 14050, loss[loss=0.2214, simple_loss=0.2764, pruned_loss=0.08316, over 21546.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3151, pruned_loss=0.08055, over 4271363.11 frames. ], batch size: 132, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 01:19:40,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1365072.0, ans=0.2 2023-06-23 01:19:43,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1365072.0, ans=0.125 2023-06-23 01:20:53,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1365312.0, ans=0.125 2023-06-23 01:21:12,117 INFO [train.py:996] (0/4) Epoch 8, batch 14100, loss[loss=0.257, simple_loss=0.3289, pruned_loss=0.09253, over 21931.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3098, pruned_loss=0.08033, over 4261818.49 frames. ], batch size: 372, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:21:50,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1365432.0, ans=0.1 2023-06-23 01:21:53,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1365492.0, ans=0.125 2023-06-23 01:21:53,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-23 01:22:20,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1365552.0, ans=0.125 2023-06-23 01:22:22,548 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-23 01:22:24,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 4.771e+02 6.447e+02 8.696e+02 1.773e+03, threshold=1.289e+03, percent-clipped=8.0 2023-06-23 01:22:25,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1365552.0, ans=0.07 2023-06-23 01:22:33,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-23 01:22:34,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1365612.0, ans=0.2 2023-06-23 01:22:43,446 INFO [train.py:996] (0/4) Epoch 8, batch 14150, loss[loss=0.2551, simple_loss=0.3878, pruned_loss=0.06114, over 20802.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3141, pruned_loss=0.08133, over 4256239.71 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:23:02,016 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-23 01:23:25,361 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.89 vs. limit=22.5 2023-06-23 01:24:19,881 INFO [train.py:996] (0/4) Epoch 8, batch 14200, loss[loss=0.2374, simple_loss=0.3007, pruned_loss=0.08707, over 21685.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.312, pruned_loss=0.07921, over 4260800.83 frames. ], batch size: 332, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:24:26,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-23 01:24:53,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1366032.0, ans=0.2 2023-06-23 01:25:16,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1366092.0, ans=0.0 2023-06-23 01:25:19,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-23 01:25:27,657 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:25:32,009 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.024e+02 4.323e+02 5.337e+02 8.028e+02 2.442e+03, threshold=1.067e+03, percent-clipped=5.0 2023-06-23 01:25:49,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1366212.0, ans=0.0 2023-06-23 01:25:57,623 INFO [train.py:996] (0/4) Epoch 8, batch 14250, loss[loss=0.2269, simple_loss=0.2906, pruned_loss=0.08156, over 21766.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3084, pruned_loss=0.0797, over 4252036.97 frames. ], batch size: 112, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:25:58,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1366272.0, ans=0.125 2023-06-23 01:25:58,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-23 01:27:06,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-23 01:27:35,851 INFO [train.py:996] (0/4) Epoch 8, batch 14300, loss[loss=0.2972, simple_loss=0.4057, pruned_loss=0.09441, over 21240.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3099, pruned_loss=0.07919, over 4252025.80 frames. ], batch size: 549, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:27:49,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1366572.0, ans=0.02 2023-06-23 01:28:20,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1366692.0, ans=0.125 2023-06-23 01:28:54,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.931e+02 4.415e+02 6.422e+02 1.030e+03 2.040e+03, threshold=1.284e+03, percent-clipped=23.0 2023-06-23 01:29:13,131 INFO [train.py:996] (0/4) Epoch 8, batch 14350, loss[loss=0.2709, simple_loss=0.3442, pruned_loss=0.09886, over 21731.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3174, pruned_loss=0.081, over 4253915.25 frames. ], batch size: 389, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:30:47,777 INFO [train.py:996] (0/4) Epoch 8, batch 14400, loss[loss=0.2461, simple_loss=0.2996, pruned_loss=0.09625, over 21571.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.315, pruned_loss=0.08145, over 4259006.40 frames. ], batch size: 441, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:31:56,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.154e+02 4.970e+02 6.969e+02 1.897e+03, threshold=9.939e+02, percent-clipped=6.0 2023-06-23 01:32:19,343 INFO [train.py:996] (0/4) Epoch 8, batch 14450, loss[loss=0.2479, simple_loss=0.3032, pruned_loss=0.09626, over 21800.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3098, pruned_loss=0.08178, over 4249519.43 frames. ], batch size: 351, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:32:26,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1367472.0, ans=0.1 2023-06-23 01:32:34,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-23 01:33:46,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1367712.0, ans=0.125 2023-06-23 01:33:51,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1367712.0, ans=0.1 2023-06-23 01:33:54,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1367712.0, ans=0.125 2023-06-23 01:34:04,047 INFO [train.py:996] (0/4) Epoch 8, batch 14500, loss[loss=0.2999, simple_loss=0.3678, pruned_loss=0.1161, over 21413.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3056, pruned_loss=0.08165, over 4246631.73 frames. ], batch size: 471, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:34:19,456 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=12.0 2023-06-23 01:34:56,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-23 01:34:57,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1367892.0, ans=0.125 2023-06-23 01:35:09,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1367952.0, ans=0.125 2023-06-23 01:35:18,883 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-228000.pt 2023-06-23 01:35:21,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.784e+02 6.137e+02 8.722e+02 1.642e+03, threshold=1.227e+03, percent-clipped=18.0 2023-06-23 01:35:21,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1367952.0, ans=10.0 2023-06-23 01:35:45,110 INFO [train.py:996] (0/4) Epoch 8, batch 14550, loss[loss=0.2665, simple_loss=0.3394, pruned_loss=0.09678, over 21268.00 frames. ], tot_loss[loss=0.238, simple_loss=0.31, pruned_loss=0.08307, over 4249890.22 frames. ], batch size: 176, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:35:47,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1368072.0, ans=0.125 2023-06-23 01:35:58,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1368072.0, ans=0.2 2023-06-23 01:36:31,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1368192.0, ans=0.1 2023-06-23 01:37:23,765 INFO [train.py:996] (0/4) Epoch 8, batch 14600, loss[loss=0.2415, simple_loss=0.3301, pruned_loss=0.07648, over 21394.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3167, pruned_loss=0.08573, over 4262576.62 frames. ], batch size: 211, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:37:29,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-23 01:38:16,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1368492.0, ans=0.125 2023-06-23 01:38:38,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 4.393e+02 5.466e+02 7.760e+02 1.223e+03, threshold=1.093e+03, percent-clipped=0.0 2023-06-23 01:38:53,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1368612.0, ans=0.125 2023-06-23 01:39:00,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-23 01:39:01,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1368672.0, ans=0.125 2023-06-23 01:39:02,832 INFO [train.py:996] (0/4) Epoch 8, batch 14650, loss[loss=0.2591, simple_loss=0.3411, pruned_loss=0.0885, over 21658.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3189, pruned_loss=0.08469, over 4259789.83 frames. ], batch size: 389, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:39:04,877 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:39:49,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-23 01:40:41,949 INFO [train.py:996] (0/4) Epoch 8, batch 14700, loss[loss=0.3034, simple_loss=0.3906, pruned_loss=0.1081, over 21517.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3125, pruned_loss=0.07918, over 4250739.76 frames. ], batch size: 508, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:41:16,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1369032.0, ans=0.0 2023-06-23 01:41:16,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1369032.0, ans=0.125 2023-06-23 01:41:22,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1369092.0, ans=0.125 2023-06-23 01:41:40,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1369152.0, ans=0.0 2023-06-23 01:42:00,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 5.470e+02 7.461e+02 1.083e+03 1.858e+03, threshold=1.492e+03, percent-clipped=24.0 2023-06-23 01:42:01,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1369212.0, ans=0.125 2023-06-23 01:42:07,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1369212.0, ans=0.015 2023-06-23 01:42:11,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-23 01:42:18,597 INFO [train.py:996] (0/4) Epoch 8, batch 14750, loss[loss=0.2351, simple_loss=0.3164, pruned_loss=0.0769, over 21623.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3172, pruned_loss=0.08198, over 4251325.06 frames. ], batch size: 230, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:42:43,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1369332.0, ans=0.025 2023-06-23 01:42:48,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1369332.0, ans=0.0 2023-06-23 01:43:08,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1369392.0, ans=0.125 2023-06-23 01:43:08,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1369392.0, ans=0.1 2023-06-23 01:44:00,047 INFO [train.py:996] (0/4) Epoch 8, batch 14800, loss[loss=0.2106, simple_loss=0.2829, pruned_loss=0.0691, over 21629.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3277, pruned_loss=0.08653, over 4250290.96 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:44:06,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-06-23 01:44:41,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1369692.0, ans=0.2 2023-06-23 01:45:18,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.582e+02 5.734e+02 8.027e+02 1.112e+03 2.200e+03, threshold=1.605e+03, percent-clipped=5.0 2023-06-23 01:45:25,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1369812.0, ans=0.125 2023-06-23 01:45:41,520 INFO [train.py:996] (0/4) Epoch 8, batch 14850, loss[loss=0.2071, simple_loss=0.2739, pruned_loss=0.0702, over 21422.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3233, pruned_loss=0.08708, over 4255865.56 frames. ], batch size: 211, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:46:01,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1369932.0, ans=0.125 2023-06-23 01:46:33,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1369992.0, ans=0.5 2023-06-23 01:47:23,701 INFO [train.py:996] (0/4) Epoch 8, batch 14900, loss[loss=0.3988, simple_loss=0.4513, pruned_loss=0.1732, over 21429.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3263, pruned_loss=0.08957, over 4259162.62 frames. ], batch size: 507, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:47:32,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1370172.0, ans=0.04949747468305833 2023-06-23 01:48:03,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1370292.0, ans=0.0 2023-06-23 01:48:30,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-23 01:48:41,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.642e+02 5.851e+02 8.319e+02 1.860e+03, threshold=1.170e+03, percent-clipped=1.0 2023-06-23 01:48:57,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1370412.0, ans=0.125 2023-06-23 01:49:04,627 INFO [train.py:996] (0/4) Epoch 8, batch 14950, loss[loss=0.232, simple_loss=0.3267, pruned_loss=0.0687, over 21263.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3273, pruned_loss=0.08962, over 4262635.15 frames. ], batch size: 549, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:49:26,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1370532.0, ans=0.125 2023-06-23 01:50:27,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1370712.0, ans=0.2 2023-06-23 01:50:40,434 INFO [train.py:996] (0/4) Epoch 8, batch 15000, loss[loss=0.2458, simple_loss=0.3176, pruned_loss=0.08695, over 21822.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3283, pruned_loss=0.09017, over 4265190.06 frames. ], batch size: 351, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:50:40,435 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 01:51:00,727 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2539, simple_loss=0.3505, pruned_loss=0.07863, over 1796401.00 frames. 2023-06-23 01:51:00,728 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 01:51:25,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-23 01:51:31,684 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:51:31,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1370832.0, ans=0.2 2023-06-23 01:51:46,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2023-06-23 01:52:10,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1370952.0, ans=0.125 2023-06-23 01:52:21,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.233e+02 4.412e+02 5.546e+02 7.158e+02 1.443e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-23 01:52:43,844 INFO [train.py:996] (0/4) Epoch 8, batch 15050, loss[loss=0.2548, simple_loss=0.3512, pruned_loss=0.07923, over 20764.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3287, pruned_loss=0.09024, over 4260084.15 frames. ], batch size: 608, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:52:44,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-23 01:52:50,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371072.0, ans=0.1 2023-06-23 01:53:44,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1371192.0, ans=0.2 2023-06-23 01:53:58,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1371252.0, ans=0.125 2023-06-23 01:53:59,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1371252.0, ans=0.125 2023-06-23 01:54:23,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1371372.0, ans=0.125 2023-06-23 01:54:29,750 INFO [train.py:996] (0/4) Epoch 8, batch 15100, loss[loss=0.253, simple_loss=0.3252, pruned_loss=0.09043, over 21827.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3332, pruned_loss=0.09017, over 4267760.29 frames. ], batch size: 282, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:54:51,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-06-23 01:54:53,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371432.0, ans=0.1 2023-06-23 01:54:54,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1371432.0, ans=0.1 2023-06-23 01:55:15,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1371492.0, ans=0.125 2023-06-23 01:55:25,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-23 01:55:34,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371552.0, ans=0.1 2023-06-23 01:55:35,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1371552.0, ans=0.125 2023-06-23 01:55:50,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.664e+02 5.163e+02 6.983e+02 1.038e+03 2.377e+03, threshold=1.397e+03, percent-clipped=16.0 2023-06-23 01:55:50,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1371612.0, ans=0.05 2023-06-23 01:56:13,548 INFO [train.py:996] (0/4) Epoch 8, batch 15150, loss[loss=0.2092, simple_loss=0.2659, pruned_loss=0.07618, over 21564.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3291, pruned_loss=0.09011, over 4254171.27 frames. ], batch size: 231, lr: 3.73e-03, grad_scale: 4.0 2023-06-23 01:56:24,110 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:57:42,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1371912.0, ans=0.0 2023-06-23 01:57:48,579 INFO [train.py:996] (0/4) Epoch 8, batch 15200, loss[loss=0.1921, simple_loss=0.2743, pruned_loss=0.05491, over 21593.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3204, pruned_loss=0.08563, over 4258913.98 frames. ], batch size: 263, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:59:10,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.032e+02 4.454e+02 6.348e+02 1.099e+03 2.249e+03, threshold=1.270e+03, percent-clipped=13.0 2023-06-23 01:59:23,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1372212.0, ans=0.125 2023-06-23 01:59:29,128 INFO [train.py:996] (0/4) Epoch 8, batch 15250, loss[loss=0.2775, simple_loss=0.3742, pruned_loss=0.09039, over 19646.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3149, pruned_loss=0.08424, over 4247411.19 frames. ], batch size: 703, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:59:34,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1372272.0, ans=0.125 2023-06-23 01:59:36,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1372272.0, ans=0.05 2023-06-23 02:00:07,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1372392.0, ans=0.1 2023-06-23 02:00:09,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=1372392.0, ans=15.0 2023-06-23 02:00:29,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1372452.0, ans=0.125 2023-06-23 02:01:03,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1372512.0, ans=0.125 2023-06-23 02:01:09,089 INFO [train.py:996] (0/4) Epoch 8, batch 15300, loss[loss=0.2912, simple_loss=0.3748, pruned_loss=0.1038, over 17834.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3181, pruned_loss=0.08712, over 4255269.83 frames. ], batch size: 60, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:01:09,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1372572.0, ans=0.04949747468305833 2023-06-23 02:01:31,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1372632.0, ans=0.1 2023-06-23 02:01:54,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1372692.0, ans=0.1 2023-06-23 02:02:34,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.584e+02 4.847e+02 6.325e+02 8.051e+02 1.474e+03, threshold=1.265e+03, percent-clipped=2.0 2023-06-23 02:02:48,600 INFO [train.py:996] (0/4) Epoch 8, batch 15350, loss[loss=0.2302, simple_loss=0.3327, pruned_loss=0.0639, over 21810.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3242, pruned_loss=0.08986, over 4253451.36 frames. ], batch size: 282, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:03:59,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1373052.0, ans=0.0 2023-06-23 02:04:27,066 INFO [train.py:996] (0/4) Epoch 8, batch 15400, loss[loss=0.2482, simple_loss=0.3275, pruned_loss=0.08442, over 21932.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3265, pruned_loss=0.0882, over 4259937.23 frames. ], batch size: 118, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:04:46,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1373232.0, ans=0.125 2023-06-23 02:05:10,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1373292.0, ans=0.125 2023-06-23 02:05:36,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1373352.0, ans=0.1 2023-06-23 02:05:42,579 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.174e+02 4.574e+02 6.918e+02 9.277e+02 1.952e+03, threshold=1.384e+03, percent-clipped=9.0 2023-06-23 02:06:06,524 INFO [train.py:996] (0/4) Epoch 8, batch 15450, loss[loss=0.2511, simple_loss=0.3319, pruned_loss=0.08511, over 20703.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3231, pruned_loss=0.08715, over 4262235.35 frames. ], batch size: 607, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:06:54,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1373592.0, ans=0.125 2023-06-23 02:07:11,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1373652.0, ans=0.015 2023-06-23 02:07:11,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1373652.0, ans=0.1 2023-06-23 02:07:33,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1373712.0, ans=0.125 2023-06-23 02:07:43,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1373712.0, ans=0.0 2023-06-23 02:07:47,619 INFO [train.py:996] (0/4) Epoch 8, batch 15500, loss[loss=0.2182, simple_loss=0.3007, pruned_loss=0.06789, over 15701.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.324, pruned_loss=0.08706, over 4259256.48 frames. ], batch size: 60, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:07:53,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1373772.0, ans=15.0 2023-06-23 02:09:14,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.796e+02 4.547e+02 5.557e+02 7.236e+02 1.680e+03, threshold=1.111e+03, percent-clipped=1.0 2023-06-23 02:09:27,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1374072.0, ans=0.125 2023-06-23 02:09:28,953 INFO [train.py:996] (0/4) Epoch 8, batch 15550, loss[loss=0.2325, simple_loss=0.3104, pruned_loss=0.07725, over 21781.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3206, pruned_loss=0.08462, over 4249389.70 frames. ], batch size: 371, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:09:40,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1374072.0, ans=0.0 2023-06-23 02:11:06,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1374372.0, ans=0.0 2023-06-23 02:11:07,705 INFO [train.py:996] (0/4) Epoch 8, batch 15600, loss[loss=0.2314, simple_loss=0.3077, pruned_loss=0.07761, over 21767.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3139, pruned_loss=0.08281, over 4244927.34 frames. ], batch size: 371, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:11:08,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-23 02:11:23,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1374372.0, ans=0.125 2023-06-23 02:11:59,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1374492.0, ans=0.0 2023-06-23 02:12:32,168 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-23 02:12:32,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.200e+02 4.456e+02 5.982e+02 8.275e+02 1.817e+03, threshold=1.196e+03, percent-clipped=9.0 2023-06-23 02:12:46,689 INFO [train.py:996] (0/4) Epoch 8, batch 15650, loss[loss=0.1949, simple_loss=0.2588, pruned_loss=0.06555, over 21382.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3128, pruned_loss=0.08201, over 4249514.11 frames. ], batch size: 211, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:13:01,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1374672.0, ans=0.125 2023-06-23 02:13:59,879 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:14:31,072 INFO [train.py:996] (0/4) Epoch 8, batch 15700, loss[loss=0.2304, simple_loss=0.296, pruned_loss=0.08242, over 22048.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3097, pruned_loss=0.08173, over 4247129.51 frames. ], batch size: 103, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:14:57,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1375032.0, ans=0.0 2023-06-23 02:15:07,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1375092.0, ans=0.125 2023-06-23 02:15:08,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1375092.0, ans=0.1 2023-06-23 02:15:45,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1375152.0, ans=0.1 2023-06-23 02:15:50,699 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.000e+02 4.439e+02 5.550e+02 6.958e+02 1.356e+03, threshold=1.110e+03, percent-clipped=1.0 2023-06-23 02:16:04,710 INFO [train.py:996] (0/4) Epoch 8, batch 15750, loss[loss=0.1834, simple_loss=0.2485, pruned_loss=0.05917, over 21263.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3051, pruned_loss=0.08112, over 4254565.27 frames. ], batch size: 211, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:16:50,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1375392.0, ans=0.1 2023-06-23 02:17:10,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=12.0 2023-06-23 02:17:20,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1375452.0, ans=0.2 2023-06-23 02:17:27,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-23 02:17:30,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1375512.0, ans=0.0 2023-06-23 02:17:49,028 INFO [train.py:996] (0/4) Epoch 8, batch 15800, loss[loss=0.23, simple_loss=0.2884, pruned_loss=0.08584, over 21772.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3028, pruned_loss=0.08097, over 4253477.17 frames. ], batch size: 371, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:17:54,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1375572.0, ans=0.1 2023-06-23 02:18:07,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1375632.0, ans=0.1 2023-06-23 02:18:39,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1375692.0, ans=0.1 2023-06-23 02:18:47,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-23 02:19:04,323 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-23 02:19:04,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.739e+02 6.417e+02 1.005e+03 2.218e+03, threshold=1.283e+03, percent-clipped=19.0 2023-06-23 02:19:24,016 INFO [train.py:996] (0/4) Epoch 8, batch 15850, loss[loss=0.2351, simple_loss=0.3047, pruned_loss=0.08272, over 21714.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3047, pruned_loss=0.083, over 4260702.48 frames. ], batch size: 332, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:20:02,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1375992.0, ans=0.125 2023-06-23 02:20:37,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1376052.0, ans=0.125 2023-06-23 02:20:59,238 INFO [train.py:996] (0/4) Epoch 8, batch 15900, loss[loss=0.2159, simple_loss=0.2722, pruned_loss=0.07984, over 21616.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3047, pruned_loss=0.08322, over 4263504.12 frames. ], batch size: 263, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:21:07,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1376172.0, ans=0.0 2023-06-23 02:21:47,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1376292.0, ans=0.125 2023-06-23 02:22:24,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 4.482e+02 6.671e+02 9.137e+02 1.402e+03, threshold=1.334e+03, percent-clipped=2.0 2023-06-23 02:22:38,534 INFO [train.py:996] (0/4) Epoch 8, batch 15950, loss[loss=0.1912, simple_loss=0.2698, pruned_loss=0.05635, over 21336.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3058, pruned_loss=0.08187, over 4253774.81 frames. ], batch size: 194, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:22:57,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1376472.0, ans=0.125 2023-06-23 02:23:21,260 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:23:32,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1376592.0, ans=0.125 2023-06-23 02:23:46,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1376652.0, ans=0.125 2023-06-23 02:23:56,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1376712.0, ans=22.5 2023-06-23 02:23:58,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1376712.0, ans=0.125 2023-06-23 02:24:13,833 INFO [train.py:996] (0/4) Epoch 8, batch 16000, loss[loss=0.2055, simple_loss=0.2841, pruned_loss=0.06344, over 21180.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3067, pruned_loss=0.0799, over 4251854.40 frames. ], batch size: 159, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:24:30,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1376772.0, ans=0.0 2023-06-23 02:24:38,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1376832.0, ans=0.07 2023-06-23 02:25:01,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1376892.0, ans=0.1 2023-06-23 02:25:03,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1376892.0, ans=10.0 2023-06-23 02:25:39,168 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.869e+02 4.296e+02 5.678e+02 9.703e+02 1.741e+03, threshold=1.136e+03, percent-clipped=11.0 2023-06-23 02:25:43,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-23 02:25:49,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1377012.0, ans=0.125 2023-06-23 02:25:51,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-23 02:25:53,874 INFO [train.py:996] (0/4) Epoch 8, batch 16050, loss[loss=0.2156, simple_loss=0.3131, pruned_loss=0.05904, over 20749.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3089, pruned_loss=0.07824, over 4259194.64 frames. ], batch size: 607, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:25:54,783 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.63 vs. limit=12.0 2023-06-23 02:25:57,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1377072.0, ans=0.2 2023-06-23 02:26:33,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1377132.0, ans=0.0 2023-06-23 02:26:48,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1377192.0, ans=0.125 2023-06-23 02:26:48,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1377192.0, ans=0.0 2023-06-23 02:27:04,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1377252.0, ans=0.07 2023-06-23 02:27:32,653 INFO [train.py:996] (0/4) Epoch 8, batch 16100, loss[loss=0.2816, simple_loss=0.3731, pruned_loss=0.09507, over 21736.00 frames. ], tot_loss[loss=0.235, simple_loss=0.312, pruned_loss=0.07902, over 4265713.22 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:28:10,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1377432.0, ans=0.125 2023-06-23 02:28:21,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1377492.0, ans=0.0 2023-06-23 02:28:58,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.354e+02 5.010e+02 6.146e+02 8.242e+02 2.299e+03, threshold=1.229e+03, percent-clipped=9.0 2023-06-23 02:29:12,571 INFO [train.py:996] (0/4) Epoch 8, batch 16150, loss[loss=0.262, simple_loss=0.3197, pruned_loss=0.1022, over 21754.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3099, pruned_loss=0.08074, over 4282503.26 frames. ], batch size: 389, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:29:26,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1377672.0, ans=0.1 2023-06-23 02:29:59,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1377792.0, ans=0.1 2023-06-23 02:29:59,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-23 02:30:26,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1377852.0, ans=0.125 2023-06-23 02:30:53,200 INFO [train.py:996] (0/4) Epoch 8, batch 16200, loss[loss=0.2705, simple_loss=0.3413, pruned_loss=0.09992, over 21405.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3137, pruned_loss=0.08276, over 4286523.74 frames. ], batch size: 211, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:31:23,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-23 02:31:32,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1378092.0, ans=0.1 2023-06-23 02:31:34,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-23 02:31:56,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1378152.0, ans=0.0 2023-06-23 02:32:05,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1378152.0, ans=0.125 2023-06-23 02:32:14,995 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.293e+02 4.981e+02 6.694e+02 1.065e+03 1.723e+03, threshold=1.339e+03, percent-clipped=15.0 2023-06-23 02:32:27,745 INFO [train.py:996] (0/4) Epoch 8, batch 16250, loss[loss=0.2374, simple_loss=0.3203, pruned_loss=0.07723, over 21742.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3165, pruned_loss=0.08436, over 4278522.31 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:34:06,541 INFO [train.py:996] (0/4) Epoch 8, batch 16300, loss[loss=0.2005, simple_loss=0.2917, pruned_loss=0.05468, over 21734.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3094, pruned_loss=0.07972, over 4262500.39 frames. ], batch size: 332, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:34:19,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1378572.0, ans=0.0 2023-06-23 02:34:26,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-23 02:34:31,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1378572.0, ans=0.0 2023-06-23 02:34:39,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-23 02:35:35,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.087e+02 4.179e+02 5.648e+02 7.276e+02 1.488e+03, threshold=1.130e+03, percent-clipped=3.0 2023-06-23 02:35:53,670 INFO [train.py:996] (0/4) Epoch 8, batch 16350, loss[loss=0.2501, simple_loss=0.3255, pruned_loss=0.08739, over 21996.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.31, pruned_loss=0.07961, over 4261212.34 frames. ], batch size: 317, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:36:06,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1378872.0, ans=0.07 2023-06-23 02:37:12,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1379112.0, ans=0.0 2023-06-23 02:37:31,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1379112.0, ans=6.0 2023-06-23 02:37:32,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-23 02:37:33,168 INFO [train.py:996] (0/4) Epoch 8, batch 16400, loss[loss=0.2368, simple_loss=0.3087, pruned_loss=0.0824, over 21826.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3165, pruned_loss=0.08185, over 4262096.80 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:38:22,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1379292.0, ans=0.125 2023-06-23 02:38:31,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1379352.0, ans=0.0 2023-06-23 02:38:44,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1379412.0, ans=0.125 2023-06-23 02:38:55,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.176e+02 5.676e+02 8.447e+02 1.118e+03 2.154e+03, threshold=1.689e+03, percent-clipped=24.0 2023-06-23 02:39:11,000 INFO [train.py:996] (0/4) Epoch 8, batch 16450, loss[loss=0.2108, simple_loss=0.2843, pruned_loss=0.06867, over 21695.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3156, pruned_loss=0.08259, over 4272038.08 frames. ], batch size: 230, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:39:13,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379472.0, ans=0.1 2023-06-23 02:40:50,713 INFO [train.py:996] (0/4) Epoch 8, batch 16500, loss[loss=0.1996, simple_loss=0.2592, pruned_loss=0.06999, over 21327.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3129, pruned_loss=0.08197, over 4273126.02 frames. ], batch size: 176, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:41:23,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1379832.0, ans=0.05 2023-06-23 02:42:20,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.553e+02 5.237e+02 7.941e+02 1.285e+03 2.739e+03, threshold=1.588e+03, percent-clipped=14.0 2023-06-23 02:42:24,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-23 02:42:36,181 INFO [train.py:996] (0/4) Epoch 8, batch 16550, loss[loss=0.2347, simple_loss=0.3143, pruned_loss=0.07755, over 21719.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3099, pruned_loss=0.07964, over 4264948.03 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:42:46,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1380072.0, ans=0.125 2023-06-23 02:42:47,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1380072.0, ans=0.2 2023-06-23 02:43:49,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1380252.0, ans=0.125 2023-06-23 02:44:21,933 INFO [train.py:996] (0/4) Epoch 8, batch 16600, loss[loss=0.2859, simple_loss=0.3849, pruned_loss=0.09342, over 21643.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3185, pruned_loss=0.0832, over 4269301.85 frames. ], batch size: 389, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:44:23,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380372.0, ans=0.1 2023-06-23 02:45:00,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-23 02:45:35,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1380552.0, ans=0.125 2023-06-23 02:45:52,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.266e+02 4.965e+02 7.305e+02 1.134e+03 2.257e+03, threshold=1.461e+03, percent-clipped=8.0 2023-06-23 02:46:03,742 INFO [train.py:996] (0/4) Epoch 8, batch 16650, loss[loss=0.235, simple_loss=0.2918, pruned_loss=0.08913, over 20071.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3262, pruned_loss=0.08563, over 4267652.36 frames. ], batch size: 703, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:46:10,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1380672.0, ans=15.0 2023-06-23 02:46:42,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1380732.0, ans=0.125 2023-06-23 02:46:42,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1380732.0, ans=0.125 2023-06-23 02:47:31,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1380912.0, ans=0.125 2023-06-23 02:47:45,730 INFO [train.py:996] (0/4) Epoch 8, batch 16700, loss[loss=0.2149, simple_loss=0.2757, pruned_loss=0.07708, over 21511.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3258, pruned_loss=0.08573, over 4268343.68 frames. ], batch size: 211, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:48:46,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1381092.0, ans=10.0 2023-06-23 02:48:52,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1381152.0, ans=0.1 2023-06-23 02:49:08,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1381152.0, ans=0.125 2023-06-23 02:49:21,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.736e+02 6.147e+02 8.645e+02 1.656e+03, threshold=1.229e+03, percent-clipped=2.0 2023-06-23 02:49:38,554 INFO [train.py:996] (0/4) Epoch 8, batch 16750, loss[loss=0.2219, simple_loss=0.2804, pruned_loss=0.0817, over 20027.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3289, pruned_loss=0.08837, over 4270362.17 frames. ], batch size: 703, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:50:48,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-23 02:50:55,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1381452.0, ans=0.1 2023-06-23 02:51:14,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-23 02:51:25,395 INFO [train.py:996] (0/4) Epoch 8, batch 16800, loss[loss=0.23, simple_loss=0.2891, pruned_loss=0.08549, over 21615.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3337, pruned_loss=0.08825, over 4266932.17 frames. ], batch size: 212, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:51:25,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1381572.0, ans=0.125 2023-06-23 02:51:36,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1381572.0, ans=0.04949747468305833 2023-06-23 02:51:56,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-23 02:52:07,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1381692.0, ans=0.125 2023-06-23 02:52:28,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.64 vs. limit=22.5 2023-06-23 02:52:46,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1381812.0, ans=0.125 2023-06-23 02:52:47,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.159e+02 4.585e+02 6.307e+02 8.952e+02 1.873e+03, threshold=1.261e+03, percent-clipped=4.0 2023-06-23 02:53:03,705 INFO [train.py:996] (0/4) Epoch 8, batch 16850, loss[loss=0.236, simple_loss=0.3026, pruned_loss=0.08474, over 21739.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3292, pruned_loss=0.08862, over 4275647.32 frames. ], batch size: 389, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:53:28,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1381932.0, ans=0.025 2023-06-23 02:53:32,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1381932.0, ans=0.0 2023-06-23 02:53:51,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1381992.0, ans=0.0 2023-06-23 02:54:19,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1382052.0, ans=0.0 2023-06-23 02:54:46,768 INFO [train.py:996] (0/4) Epoch 8, batch 16900, loss[loss=0.2262, simple_loss=0.3048, pruned_loss=0.07385, over 21675.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3253, pruned_loss=0.08714, over 4283856.14 frames. ], batch size: 414, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:55:06,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1382232.0, ans=0.125 2023-06-23 02:55:27,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1382292.0, ans=22.5 2023-06-23 02:55:33,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1382352.0, ans=0.125 2023-06-23 02:55:33,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1382352.0, ans=0.1 2023-06-23 02:55:43,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1382352.0, ans=0.125 2023-06-23 02:56:06,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1382412.0, ans=0.0 2023-06-23 02:56:07,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.361e+02 4.825e+02 6.497e+02 9.276e+02 2.744e+03, threshold=1.299e+03, percent-clipped=9.0 2023-06-23 02:56:26,519 INFO [train.py:996] (0/4) Epoch 8, batch 16950, loss[loss=0.2624, simple_loss=0.3952, pruned_loss=0.0648, over 20722.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3196, pruned_loss=0.08576, over 4281965.84 frames. ], batch size: 607, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:56:35,658 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-23 02:56:53,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1382532.0, ans=0.1 2023-06-23 02:56:56,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1382592.0, ans=10.0 2023-06-23 02:57:05,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1382592.0, ans=0.125 2023-06-23 02:57:18,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1382652.0, ans=0.025 2023-06-23 02:58:05,791 INFO [train.py:996] (0/4) Epoch 8, batch 17000, loss[loss=0.2592, simple_loss=0.3231, pruned_loss=0.09767, over 21941.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3182, pruned_loss=0.08647, over 4290561.89 frames. ], batch size: 333, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 02:58:07,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1382772.0, ans=0.0 2023-06-23 02:59:17,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.02 vs. limit=15.0 2023-06-23 02:59:36,307 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.719e+02 6.116e+02 8.558e+02 1.129e+03 2.527e+03, threshold=1.712e+03, percent-clipped=16.0 2023-06-23 02:59:37,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1383012.0, ans=0.125 2023-06-23 02:59:40,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1383012.0, ans=0.0 2023-06-23 02:59:46,394 INFO [train.py:996] (0/4) Epoch 8, batch 17050, loss[loss=0.2566, simple_loss=0.3456, pruned_loss=0.08386, over 21636.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3246, pruned_loss=0.08871, over 4294072.85 frames. ], batch size: 263, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:00:53,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1383252.0, ans=0.125 2023-06-23 03:00:56,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1383252.0, ans=0.125 2023-06-23 03:01:24,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1383312.0, ans=0.05 2023-06-23 03:01:27,124 INFO [train.py:996] (0/4) Epoch 8, batch 17100, loss[loss=0.2415, simple_loss=0.3115, pruned_loss=0.08568, over 21918.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3225, pruned_loss=0.08857, over 4291577.47 frames. ], batch size: 414, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:01:36,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1383372.0, ans=0.125 2023-06-23 03:02:08,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383492.0, ans=0.1 2023-06-23 03:02:10,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1383492.0, ans=0.1 2023-06-23 03:02:27,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383552.0, ans=0.1 2023-06-23 03:02:37,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1383552.0, ans=0.0 2023-06-23 03:02:37,783 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=15.0 2023-06-23 03:02:52,343 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.295e+02 4.559e+02 5.897e+02 8.703e+02 1.483e+03, threshold=1.179e+03, percent-clipped=0.0 2023-06-23 03:03:01,685 INFO [train.py:996] (0/4) Epoch 8, batch 17150, loss[loss=0.2097, simple_loss=0.2756, pruned_loss=0.07192, over 21244.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3191, pruned_loss=0.08829, over 4296358.40 frames. ], batch size: 608, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:03:07,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.64 vs. limit=5.0 2023-06-23 03:03:55,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1383792.0, ans=0.0 2023-06-23 03:04:40,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1383912.0, ans=10.0 2023-06-23 03:04:42,753 INFO [train.py:996] (0/4) Epoch 8, batch 17200, loss[loss=0.2653, simple_loss=0.3318, pruned_loss=0.09941, over 21345.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.318, pruned_loss=0.08758, over 4294482.53 frames. ], batch size: 176, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:04:54,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1383972.0, ans=0.125 2023-06-23 03:04:54,779 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:06:13,105 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.394e+02 4.493e+02 5.793e+02 8.418e+02 1.650e+03, threshold=1.159e+03, percent-clipped=7.0 2023-06-23 03:06:23,191 INFO [train.py:996] (0/4) Epoch 8, batch 17250, loss[loss=0.2834, simple_loss=0.3571, pruned_loss=0.1048, over 21424.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3228, pruned_loss=0.08983, over 4295349.69 frames. ], batch size: 471, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:06:28,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1384272.0, ans=0.0 2023-06-23 03:06:35,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1384272.0, ans=0.125 2023-06-23 03:06:56,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1384332.0, ans=0.0 2023-06-23 03:07:41,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1384452.0, ans=0.1 2023-06-23 03:08:09,830 INFO [train.py:996] (0/4) Epoch 8, batch 17300, loss[loss=0.2736, simple_loss=0.3444, pruned_loss=0.1014, over 21811.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3308, pruned_loss=0.09411, over 4294517.61 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:08:30,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1384632.0, ans=0.1 2023-06-23 03:08:35,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.01 vs. limit=10.0 2023-06-23 03:08:44,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1384632.0, ans=0.125 2023-06-23 03:08:52,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1384632.0, ans=0.0 2023-06-23 03:08:58,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1384692.0, ans=0.1 2023-06-23 03:09:36,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1384812.0, ans=0.125 2023-06-23 03:09:37,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=1384812.0, ans=12.0 2023-06-23 03:09:41,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.372e+02 4.842e+02 6.354e+02 8.974e+02 2.324e+03, threshold=1.271e+03, percent-clipped=7.0 2023-06-23 03:09:55,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1384872.0, ans=0.125 2023-06-23 03:09:56,688 INFO [train.py:996] (0/4) Epoch 8, batch 17350, loss[loss=0.2504, simple_loss=0.3429, pruned_loss=0.07899, over 21723.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3306, pruned_loss=0.0939, over 4277458.79 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:10:01,753 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:10:14,578 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-23 03:11:19,309 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:11:33,404 INFO [train.py:996] (0/4) Epoch 8, batch 17400, loss[loss=0.2224, simple_loss=0.2954, pruned_loss=0.07473, over 21690.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3267, pruned_loss=0.09002, over 4268572.21 frames. ], batch size: 247, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:11:47,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1385172.0, ans=0.125 2023-06-23 03:12:37,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-23 03:12:37,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1385352.0, ans=0.125 2023-06-23 03:12:53,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1385412.0, ans=0.125 2023-06-23 03:12:59,093 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-23 03:13:07,301 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.616e+02 6.487e+02 8.880e+02 2.609e+03, threshold=1.297e+03, percent-clipped=10.0 2023-06-23 03:13:18,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1385472.0, ans=0.0 2023-06-23 03:13:19,805 INFO [train.py:996] (0/4) Epoch 8, batch 17450, loss[loss=0.1892, simple_loss=0.2754, pruned_loss=0.05152, over 21380.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3237, pruned_loss=0.08735, over 4273630.20 frames. ], batch size: 211, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:14:12,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1385652.0, ans=0.125 2023-06-23 03:14:21,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1385652.0, ans=0.125 2023-06-23 03:14:25,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-23 03:15:00,673 INFO [train.py:996] (0/4) Epoch 8, batch 17500, loss[loss=0.2582, simple_loss=0.3187, pruned_loss=0.09887, over 21872.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3202, pruned_loss=0.08533, over 4277993.42 frames. ], batch size: 332, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:15:11,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1385772.0, ans=15.0 2023-06-23 03:15:43,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1385892.0, ans=0.125 2023-06-23 03:15:56,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1385952.0, ans=0.0 2023-06-23 03:16:09,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2023-06-23 03:16:31,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.152e+02 4.278e+02 5.524e+02 8.928e+02 1.678e+03, threshold=1.105e+03, percent-clipped=3.0 2023-06-23 03:16:37,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1386012.0, ans=0.1 2023-06-23 03:16:40,142 INFO [train.py:996] (0/4) Epoch 8, batch 17550, loss[loss=0.2249, simple_loss=0.3177, pruned_loss=0.06602, over 21825.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3193, pruned_loss=0.08283, over 4280987.55 frames. ], batch size: 316, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:17:02,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1386132.0, ans=0.0 2023-06-23 03:17:27,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.99 vs. limit=10.0 2023-06-23 03:17:34,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1386252.0, ans=15.0 2023-06-23 03:18:05,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1386312.0, ans=0.2 2023-06-23 03:18:18,847 INFO [train.py:996] (0/4) Epoch 8, batch 17600, loss[loss=0.2416, simple_loss=0.3252, pruned_loss=0.07902, over 21500.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3205, pruned_loss=0.08297, over 4271596.12 frames. ], batch size: 131, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:19:48,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.108e+02 4.735e+02 6.263e+02 8.398e+02 1.704e+03, threshold=1.253e+03, percent-clipped=10.0 2023-06-23 03:19:53,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1386612.0, ans=0.125 2023-06-23 03:19:55,707 INFO [train.py:996] (0/4) Epoch 8, batch 17650, loss[loss=0.2598, simple_loss=0.3333, pruned_loss=0.09317, over 21252.00 frames. ], tot_loss[loss=0.243, simple_loss=0.319, pruned_loss=0.08348, over 4266956.67 frames. ], batch size: 143, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:20:24,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1386732.0, ans=0.0 2023-06-23 03:20:30,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1386732.0, ans=0.95 2023-06-23 03:21:22,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1386912.0, ans=0.125 2023-06-23 03:21:36,265 INFO [train.py:996] (0/4) Epoch 8, batch 17700, loss[loss=0.3114, simple_loss=0.3858, pruned_loss=0.1185, over 21718.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3114, pruned_loss=0.08047, over 4249232.90 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:21:43,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1386972.0, ans=0.025 2023-06-23 03:22:04,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1387032.0, ans=0.125 2023-06-23 03:23:11,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.033e+02 4.293e+02 5.499e+02 1.006e+03 2.228e+03, threshold=1.100e+03, percent-clipped=12.0 2023-06-23 03:23:17,905 INFO [train.py:996] (0/4) Epoch 8, batch 17750, loss[loss=0.2385, simple_loss=0.3189, pruned_loss=0.07907, over 21536.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3189, pruned_loss=0.08396, over 4257160.09 frames. ], batch size: 112, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:23:21,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1387272.0, ans=0.0 2023-06-23 03:24:46,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1387512.0, ans=0.125 2023-06-23 03:24:58,733 INFO [train.py:996] (0/4) Epoch 8, batch 17800, loss[loss=0.2163, simple_loss=0.2813, pruned_loss=0.07565, over 20091.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3175, pruned_loss=0.08239, over 4257817.02 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:25:59,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1387692.0, ans=0.0 2023-06-23 03:26:01,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-23 03:26:06,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1387752.0, ans=0.2 2023-06-23 03:26:12,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1387752.0, ans=0.2 2023-06-23 03:26:15,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=15.0 2023-06-23 03:26:32,501 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 4.467e+02 6.036e+02 8.319e+02 2.220e+03, threshold=1.207e+03, percent-clipped=14.0 2023-06-23 03:26:36,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1387812.0, ans=0.2 2023-06-23 03:26:39,343 INFO [train.py:996] (0/4) Epoch 8, batch 17850, loss[loss=0.2257, simple_loss=0.2935, pruned_loss=0.07892, over 20154.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3163, pruned_loss=0.08234, over 4253190.45 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:27:14,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1387932.0, ans=0.0 2023-06-23 03:27:25,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1387992.0, ans=0.0 2023-06-23 03:27:27,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.27 vs. limit=15.0 2023-06-23 03:27:38,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1388052.0, ans=0.1 2023-06-23 03:27:41,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388052.0, ans=0.1 2023-06-23 03:28:05,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388112.0, ans=0.1 2023-06-23 03:28:16,050 INFO [train.py:996] (0/4) Epoch 8, batch 17900, loss[loss=0.2489, simple_loss=0.3475, pruned_loss=0.07515, over 21730.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3228, pruned_loss=0.08496, over 4253278.49 frames. ], batch size: 351, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:28:51,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1388232.0, ans=0.0 2023-06-23 03:29:08,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1388292.0, ans=0.2 2023-06-23 03:30:00,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.268e+02 4.650e+02 5.974e+02 7.368e+02 1.876e+03, threshold=1.195e+03, percent-clipped=6.0 2023-06-23 03:30:11,656 INFO [train.py:996] (0/4) Epoch 8, batch 17950, loss[loss=0.1884, simple_loss=0.2856, pruned_loss=0.04557, over 21761.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3238, pruned_loss=0.08229, over 4255400.89 frames. ], batch size: 298, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:30:46,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1388592.0, ans=0.0 2023-06-23 03:30:54,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-23 03:31:04,600 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:31:49,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1388772.0, ans=0.0 2023-06-23 03:31:50,008 INFO [train.py:996] (0/4) Epoch 8, batch 18000, loss[loss=0.2245, simple_loss=0.2839, pruned_loss=0.08252, over 21320.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3177, pruned_loss=0.08003, over 4254939.25 frames. ], batch size: 160, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:31:50,009 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 03:32:06,870 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2644, simple_loss=0.3593, pruned_loss=0.08473, over 1796401.00 frames. 2023-06-23 03:32:06,871 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 03:32:31,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1388832.0, ans=0.2 2023-06-23 03:33:43,876 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.172e+02 4.301e+02 6.081e+02 8.972e+02 1.795e+03, threshold=1.216e+03, percent-clipped=14.0 2023-06-23 03:33:46,946 INFO [train.py:996] (0/4) Epoch 8, batch 18050, loss[loss=0.2767, simple_loss=0.3412, pruned_loss=0.1061, over 21682.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3136, pruned_loss=0.08019, over 4244777.93 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:33:56,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-06-23 03:33:57,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1389072.0, ans=0.2 2023-06-23 03:35:28,054 INFO [train.py:996] (0/4) Epoch 8, batch 18100, loss[loss=0.2304, simple_loss=0.3301, pruned_loss=0.06532, over 21636.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.315, pruned_loss=0.08163, over 4249694.73 frames. ], batch size: 263, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:35:28,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1389372.0, ans=0.0 2023-06-23 03:36:01,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1389432.0, ans=0.125 2023-06-23 03:37:04,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.338e+02 4.664e+02 6.579e+02 9.782e+02 2.052e+03, threshold=1.316e+03, percent-clipped=11.0 2023-06-23 03:37:06,662 INFO [train.py:996] (0/4) Epoch 8, batch 18150, loss[loss=0.2159, simple_loss=0.3024, pruned_loss=0.06471, over 21325.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3181, pruned_loss=0.08274, over 4250977.19 frames. ], batch size: 211, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:38:42,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1389972.0, ans=0.0 2023-06-23 03:38:43,096 INFO [train.py:996] (0/4) Epoch 8, batch 18200, loss[loss=0.2004, simple_loss=0.2737, pruned_loss=0.06353, over 21427.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3127, pruned_loss=0.08162, over 4249743.56 frames. ], batch size: 144, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:38:56,561 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-23 03:39:49,377 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.63 vs. limit=15.0 2023-06-23 03:39:58,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1390152.0, ans=0.04949747468305833 2023-06-23 03:40:17,377 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.155e+02 4.842e+02 6.713e+02 9.646e+02 2.158e+03, threshold=1.343e+03, percent-clipped=10.0 2023-06-23 03:40:19,041 INFO [train.py:996] (0/4) Epoch 8, batch 18250, loss[loss=0.2676, simple_loss=0.3294, pruned_loss=0.1029, over 21858.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3048, pruned_loss=0.0789, over 4244711.26 frames. ], batch size: 107, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:40:19,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1390272.0, ans=0.125 2023-06-23 03:40:54,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=12.0 2023-06-23 03:41:44,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1390512.0, ans=0.0 2023-06-23 03:41:56,238 INFO [train.py:996] (0/4) Epoch 8, batch 18300, loss[loss=0.3064, simple_loss=0.3839, pruned_loss=0.1145, over 19901.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3056, pruned_loss=0.07852, over 4250122.69 frames. ], batch size: 702, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:42:04,594 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=12.0 2023-06-23 03:42:16,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1390632.0, ans=0.125 2023-06-23 03:42:16,724 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=22.5 2023-06-23 03:42:41,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-23 03:43:14,305 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=22.5 2023-06-23 03:43:15,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1390812.0, ans=15.0 2023-06-23 03:43:32,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.212e+02 4.888e+02 7.173e+02 1.170e+03 2.600e+03, threshold=1.435e+03, percent-clipped=18.0 2023-06-23 03:43:34,051 INFO [train.py:996] (0/4) Epoch 8, batch 18350, loss[loss=0.1915, simple_loss=0.2609, pruned_loss=0.06099, over 16973.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3083, pruned_loss=0.07844, over 4250833.08 frames. ], batch size: 65, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:43:59,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1390932.0, ans=0.2 2023-06-23 03:44:07,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1390992.0, ans=0.125 2023-06-23 03:44:09,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1390992.0, ans=0.0 2023-06-23 03:45:12,250 INFO [train.py:996] (0/4) Epoch 8, batch 18400, loss[loss=0.1988, simple_loss=0.2807, pruned_loss=0.05849, over 21629.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3043, pruned_loss=0.07702, over 4247056.38 frames. ], batch size: 391, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:45:29,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1391232.0, ans=0.1 2023-06-23 03:45:35,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1391232.0, ans=0.125 2023-06-23 03:46:22,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=22.5 2023-06-23 03:46:29,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1391352.0, ans=0.0 2023-06-23 03:46:46,183 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.306e+02 6.034e+02 8.770e+02 2.014e+03, threshold=1.207e+03, percent-clipped=5.0 2023-06-23 03:46:48,008 INFO [train.py:996] (0/4) Epoch 8, batch 18450, loss[loss=0.2494, simple_loss=0.3637, pruned_loss=0.06751, over 19918.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3027, pruned_loss=0.07386, over 4247115.83 frames. ], batch size: 702, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:46:48,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1391472.0, ans=0.125 2023-06-23 03:47:01,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1391472.0, ans=0.125 2023-06-23 03:47:02,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1391532.0, ans=0.125 2023-06-23 03:47:13,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1391532.0, ans=0.2 2023-06-23 03:47:18,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1391592.0, ans=0.125 2023-06-23 03:48:25,060 INFO [train.py:996] (0/4) Epoch 8, batch 18500, loss[loss=0.1882, simple_loss=0.2743, pruned_loss=0.051, over 21742.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2979, pruned_loss=0.07291, over 4248214.39 frames. ], batch size: 282, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:49:10,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1391892.0, ans=0.125 2023-06-23 03:49:36,830 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-232000.pt 2023-06-23 03:49:58,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1392012.0, ans=0.125 2023-06-23 03:50:02,677 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.977e+02 4.135e+02 5.661e+02 7.712e+02 1.457e+03, threshold=1.132e+03, percent-clipped=3.0 2023-06-23 03:50:04,121 INFO [train.py:996] (0/4) Epoch 8, batch 18550, loss[loss=0.2311, simple_loss=0.2895, pruned_loss=0.08632, over 21815.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.294, pruned_loss=0.07244, over 4251287.04 frames. ], batch size: 107, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:50:54,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1392192.0, ans=0.125 2023-06-23 03:50:57,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1392192.0, ans=0.125 2023-06-23 03:51:10,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1392252.0, ans=0.125 2023-06-23 03:51:37,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1392312.0, ans=0.125 2023-06-23 03:51:43,236 INFO [train.py:996] (0/4) Epoch 8, batch 18600, loss[loss=0.2215, simple_loss=0.3032, pruned_loss=0.06984, over 21796.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2936, pruned_loss=0.07266, over 4244778.33 frames. ], batch size: 317, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:51:46,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1392372.0, ans=0.125 2023-06-23 03:51:55,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-23 03:51:55,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1392372.0, ans=0.0 2023-06-23 03:53:17,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.135e+02 5.174e+02 7.950e+02 1.061e+03 1.906e+03, threshold=1.590e+03, percent-clipped=19.0 2023-06-23 03:53:19,662 INFO [train.py:996] (0/4) Epoch 8, batch 18650, loss[loss=0.2528, simple_loss=0.304, pruned_loss=0.1008, over 15336.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2932, pruned_loss=0.07319, over 4241500.95 frames. ], batch size: 60, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:53:43,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1392732.0, ans=0.2 2023-06-23 03:53:49,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1392792.0, ans=0.0 2023-06-23 03:54:31,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1392852.0, ans=0.125 2023-06-23 03:54:33,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1392852.0, ans=0.125 2023-06-23 03:54:35,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1392912.0, ans=0.0 2023-06-23 03:54:36,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1392912.0, ans=0.1 2023-06-23 03:54:43,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1392912.0, ans=0.1 2023-06-23 03:54:49,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1392912.0, ans=0.1 2023-06-23 03:54:53,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1392912.0, ans=0.125 2023-06-23 03:54:56,258 INFO [train.py:996] (0/4) Epoch 8, batch 18700, loss[loss=0.2722, simple_loss=0.3289, pruned_loss=0.1078, over 21862.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2921, pruned_loss=0.0752, over 4257634.95 frames. ], batch size: 414, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:55:00,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-23 03:56:32,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.347e+02 4.104e+02 5.076e+02 6.613e+02 1.727e+03, threshold=1.015e+03, percent-clipped=1.0 2023-06-23 03:56:33,804 INFO [train.py:996] (0/4) Epoch 8, batch 18750, loss[loss=0.2742, simple_loss=0.3418, pruned_loss=0.1033, over 21398.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.294, pruned_loss=0.07759, over 4259998.72 frames. ], batch size: 194, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:56:34,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1393272.0, ans=0.125 2023-06-23 03:56:51,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1393332.0, ans=0.125 2023-06-23 03:56:54,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1393332.0, ans=0.0 2023-06-23 03:57:25,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1393452.0, ans=0.125 2023-06-23 03:58:09,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1393512.0, ans=0.0 2023-06-23 03:58:11,923 INFO [train.py:996] (0/4) Epoch 8, batch 18800, loss[loss=0.2549, simple_loss=0.3279, pruned_loss=0.09092, over 21767.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3016, pruned_loss=0.07911, over 4252387.66 frames. ], batch size: 351, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 03:58:37,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=15.0 2023-06-23 03:59:21,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1393752.0, ans=0.0 2023-06-23 03:59:24,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1393752.0, ans=0.1 2023-06-23 03:59:26,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1393752.0, ans=0.04949747468305833 2023-06-23 03:59:48,204 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.636e+02 4.595e+02 6.296e+02 8.883e+02 2.093e+03, threshold=1.259e+03, percent-clipped=21.0 2023-06-23 03:59:50,069 INFO [train.py:996] (0/4) Epoch 8, batch 18850, loss[loss=0.1992, simple_loss=0.2657, pruned_loss=0.06633, over 21258.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.297, pruned_loss=0.07462, over 4250439.72 frames. ], batch size: 176, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 03:59:50,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1393872.0, ans=0.1 2023-06-23 03:59:55,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1393872.0, ans=0.0 2023-06-23 03:59:55,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1393872.0, ans=0.0 2023-06-23 04:00:12,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1393932.0, ans=0.125 2023-06-23 04:00:48,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=15.0 2023-06-23 04:01:02,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1394052.0, ans=0.125 2023-06-23 04:01:21,382 INFO [train.py:996] (0/4) Epoch 8, batch 18900, loss[loss=0.2457, simple_loss=0.3399, pruned_loss=0.07572, over 20916.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2931, pruned_loss=0.07396, over 4241910.96 frames. ], batch size: 607, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:01:34,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1394172.0, ans=0.125 2023-06-23 04:01:45,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1394232.0, ans=0.1 2023-06-23 04:02:03,150 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:02:58,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.232e+02 4.494e+02 5.345e+02 6.718e+02 1.434e+03, threshold=1.069e+03, percent-clipped=2.0 2023-06-23 04:02:59,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1394472.0, ans=0.125 2023-06-23 04:03:00,406 INFO [train.py:996] (0/4) Epoch 8, batch 18950, loss[loss=0.2538, simple_loss=0.3579, pruned_loss=0.07482, over 21243.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2953, pruned_loss=0.07587, over 4248461.19 frames. ], batch size: 548, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:03:13,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1394472.0, ans=0.0 2023-06-23 04:03:23,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-23 04:03:40,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1394592.0, ans=0.0 2023-06-23 04:03:55,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1394592.0, ans=0.025 2023-06-23 04:04:11,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1394652.0, ans=0.0 2023-06-23 04:04:26,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1394712.0, ans=0.125 2023-06-23 04:04:40,000 INFO [train.py:996] (0/4) Epoch 8, batch 19000, loss[loss=0.2281, simple_loss=0.3191, pruned_loss=0.06857, over 21536.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3054, pruned_loss=0.07804, over 4254133.34 frames. ], batch size: 212, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:05:17,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1394892.0, ans=0.0 2023-06-23 04:05:46,282 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-23 04:06:07,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1395012.0, ans=0.025 2023-06-23 04:06:11,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.757e+02 5.109e+02 8.049e+02 1.091e+03 2.389e+03, threshold=1.610e+03, percent-clipped=25.0 2023-06-23 04:06:13,349 INFO [train.py:996] (0/4) Epoch 8, batch 19050, loss[loss=0.3019, simple_loss=0.3662, pruned_loss=0.1188, over 21418.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3111, pruned_loss=0.08219, over 4263374.93 frames. ], batch size: 471, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:06:25,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1395072.0, ans=0.125 2023-06-23 04:07:11,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1395192.0, ans=0.0 2023-06-23 04:07:31,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1395252.0, ans=0.1 2023-06-23 04:07:53,277 INFO [train.py:996] (0/4) Epoch 8, batch 19100, loss[loss=0.2267, simple_loss=0.296, pruned_loss=0.07873, over 21263.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3122, pruned_loss=0.08475, over 4269662.76 frames. ], batch size: 548, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:08:54,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-23 04:09:02,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1395552.0, ans=0.0 2023-06-23 04:09:20,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1395612.0, ans=0.125 2023-06-23 04:09:22,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1395612.0, ans=0.07 2023-06-23 04:09:33,315 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.382e+02 4.590e+02 5.879e+02 8.375e+02 2.097e+03, threshold=1.176e+03, percent-clipped=3.0 2023-06-23 04:09:34,841 INFO [train.py:996] (0/4) Epoch 8, batch 19150, loss[loss=0.2456, simple_loss=0.3398, pruned_loss=0.07568, over 21735.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3145, pruned_loss=0.08616, over 4277299.26 frames. ], batch size: 282, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:11:19,412 INFO [train.py:996] (0/4) Epoch 8, batch 19200, loss[loss=0.2918, simple_loss=0.3914, pruned_loss=0.0961, over 21490.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3227, pruned_loss=0.08547, over 4283000.19 frames. ], batch size: 471, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:11:21,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1395972.0, ans=0.0 2023-06-23 04:11:37,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1395972.0, ans=0.125 2023-06-23 04:12:50,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-23 04:12:50,651 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.764e+02 4.861e+02 7.058e+02 9.743e+02 2.046e+03, threshold=1.412e+03, percent-clipped=16.0 2023-06-23 04:12:50,672 INFO [train.py:996] (0/4) Epoch 8, batch 19250, loss[loss=0.22, simple_loss=0.2887, pruned_loss=0.07559, over 21403.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3205, pruned_loss=0.08019, over 4279231.27 frames. ], batch size: 131, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:13:07,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-23 04:14:29,806 INFO [train.py:996] (0/4) Epoch 8, batch 19300, loss[loss=0.2379, simple_loss=0.3039, pruned_loss=0.08594, over 21881.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3172, pruned_loss=0.07987, over 4284735.40 frames. ], batch size: 124, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:14:39,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1396572.0, ans=0.015 2023-06-23 04:14:55,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.54 vs. limit=10.0 2023-06-23 04:15:04,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396632.0, ans=0.1 2023-06-23 04:15:12,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1396632.0, ans=0.125 2023-06-23 04:15:52,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1396812.0, ans=0.2 2023-06-23 04:16:11,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1396812.0, ans=0.2 2023-06-23 04:16:14,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.848e+02 5.029e+02 6.832e+02 8.768e+02 1.869e+03, threshold=1.366e+03, percent-clipped=8.0 2023-06-23 04:16:14,291 INFO [train.py:996] (0/4) Epoch 8, batch 19350, loss[loss=0.189, simple_loss=0.275, pruned_loss=0.05148, over 21685.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3131, pruned_loss=0.07679, over 4283485.35 frames. ], batch size: 247, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:16:31,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1396872.0, ans=0.125 2023-06-23 04:17:03,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1396992.0, ans=0.125 2023-06-23 04:17:39,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-06-23 04:17:45,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1397112.0, ans=0.125 2023-06-23 04:17:45,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1397112.0, ans=0.125 2023-06-23 04:17:54,532 INFO [train.py:996] (0/4) Epoch 8, batch 19400, loss[loss=0.24, simple_loss=0.3118, pruned_loss=0.08417, over 21779.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3119, pruned_loss=0.07612, over 4283736.38 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:17:55,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1397172.0, ans=0.2 2023-06-23 04:18:10,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1397172.0, ans=0.125 2023-06-23 04:18:11,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1397172.0, ans=10.0 2023-06-23 04:19:09,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1397352.0, ans=0.125 2023-06-23 04:19:38,377 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.184e+02 4.448e+02 5.769e+02 7.543e+02 1.139e+03, threshold=1.154e+03, percent-clipped=0.0 2023-06-23 04:19:38,407 INFO [train.py:996] (0/4) Epoch 8, batch 19450, loss[loss=0.23, simple_loss=0.2904, pruned_loss=0.08481, over 21685.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3085, pruned_loss=0.07747, over 4290879.05 frames. ], batch size: 414, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:20:05,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-23 04:20:15,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1397532.0, ans=0.0 2023-06-23 04:20:37,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1397652.0, ans=0.1 2023-06-23 04:20:50,179 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:21:16,473 INFO [train.py:996] (0/4) Epoch 8, batch 19500, loss[loss=0.2558, simple_loss=0.3121, pruned_loss=0.09977, over 21856.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3048, pruned_loss=0.07869, over 4288911.79 frames. ], batch size: 98, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:21:21,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1397772.0, ans=0.125 2023-06-23 04:22:24,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1397952.0, ans=0.0 2023-06-23 04:22:36,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-23 04:22:54,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.009e+02 6.847e+02 1.109e+03 2.464e+03, threshold=1.369e+03, percent-clipped=22.0 2023-06-23 04:22:54,727 INFO [train.py:996] (0/4) Epoch 8, batch 19550, loss[loss=0.2173, simple_loss=0.3129, pruned_loss=0.06082, over 21765.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3004, pruned_loss=0.07656, over 4278057.00 frames. ], batch size: 332, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:23:00,439 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.48 vs. limit=22.5 2023-06-23 04:23:13,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1398132.0, ans=0.0 2023-06-23 04:23:37,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1398192.0, ans=0.125 2023-06-23 04:23:40,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-23 04:23:49,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1398252.0, ans=0.2 2023-06-23 04:23:57,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.81 vs. limit=15.0 2023-06-23 04:24:10,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1398312.0, ans=0.1 2023-06-23 04:24:30,134 INFO [train.py:996] (0/4) Epoch 8, batch 19600, loss[loss=0.2393, simple_loss=0.3001, pruned_loss=0.08923, over 21951.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3023, pruned_loss=0.07767, over 4281312.74 frames. ], batch size: 316, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:25:40,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1398552.0, ans=0.125 2023-06-23 04:26:08,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.522e+02 4.617e+02 5.615e+02 7.940e+02 2.383e+03, threshold=1.123e+03, percent-clipped=6.0 2023-06-23 04:26:08,792 INFO [train.py:996] (0/4) Epoch 8, batch 19650, loss[loss=0.2902, simple_loss=0.3623, pruned_loss=0.1091, over 21767.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3073, pruned_loss=0.08147, over 4281604.54 frames. ], batch size: 124, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:26:09,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1398672.0, ans=0.0 2023-06-23 04:27:06,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1398852.0, ans=0.1 2023-06-23 04:27:27,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-23 04:27:55,200 INFO [train.py:996] (0/4) Epoch 8, batch 19700, loss[loss=0.2253, simple_loss=0.3043, pruned_loss=0.07316, over 21648.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3112, pruned_loss=0.08276, over 4278521.90 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:28:38,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1399092.0, ans=0.125 2023-06-23 04:29:05,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1399152.0, ans=0.0 2023-06-23 04:29:19,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1399212.0, ans=0.2 2023-06-23 04:29:24,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1399212.0, ans=0.5 2023-06-23 04:29:34,757 INFO [train.py:996] (0/4) Epoch 8, batch 19750, loss[loss=0.2627, simple_loss=0.352, pruned_loss=0.0867, over 21715.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3182, pruned_loss=0.08325, over 4276085.46 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:29:35,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-06-23 04:29:36,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.270e+02 5.158e+02 7.198e+02 1.115e+03 3.431e+03, threshold=1.440e+03, percent-clipped=24.0 2023-06-23 04:30:00,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1399332.0, ans=0.0 2023-06-23 04:31:05,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1399512.0, ans=0.125 2023-06-23 04:31:12,830 INFO [train.py:996] (0/4) Epoch 8, batch 19800, loss[loss=0.2322, simple_loss=0.297, pruned_loss=0.08373, over 21366.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3182, pruned_loss=0.08365, over 4282792.27 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:31:15,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1399572.0, ans=0.0 2023-06-23 04:31:16,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1399572.0, ans=0.1 2023-06-23 04:31:47,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1399632.0, ans=0.0 2023-06-23 04:32:44,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1399812.0, ans=0.2 2023-06-23 04:32:51,982 INFO [train.py:996] (0/4) Epoch 8, batch 19850, loss[loss=0.2529, simple_loss=0.3669, pruned_loss=0.06942, over 19783.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3116, pruned_loss=0.07952, over 4281439.97 frames. ], batch size: 703, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:32:53,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.926e+02 6.065e+02 9.192e+02 2.099e+03, threshold=1.213e+03, percent-clipped=4.0 2023-06-23 04:34:08,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1400112.0, ans=0.0 2023-06-23 04:34:16,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1400112.0, ans=0.025 2023-06-23 04:34:19,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1400112.0, ans=0.125 2023-06-23 04:34:28,524 INFO [train.py:996] (0/4) Epoch 8, batch 19900, loss[loss=0.23, simple_loss=0.3075, pruned_loss=0.07628, over 21602.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3112, pruned_loss=0.07645, over 4276992.04 frames. ], batch size: 414, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:35:34,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.82 vs. limit=10.0 2023-06-23 04:35:40,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1400352.0, ans=0.1 2023-06-23 04:35:42,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1400352.0, ans=0.125 2023-06-23 04:36:04,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-23 04:36:08,916 INFO [train.py:996] (0/4) Epoch 8, batch 19950, loss[loss=0.2046, simple_loss=0.2725, pruned_loss=0.06839, over 21145.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3067, pruned_loss=0.07662, over 4271303.50 frames. ], batch size: 548, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:36:10,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.980e+02 4.068e+02 6.013e+02 8.874e+02 2.224e+03, threshold=1.203e+03, percent-clipped=12.0 2023-06-23 04:36:25,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1400472.0, ans=0.125 2023-06-23 04:36:41,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1400532.0, ans=0.125 2023-06-23 04:36:46,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1400532.0, ans=0.125 2023-06-23 04:36:50,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1400592.0, ans=0.1 2023-06-23 04:36:56,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1400592.0, ans=0.035 2023-06-23 04:37:19,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1400652.0, ans=0.025 2023-06-23 04:37:38,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1400712.0, ans=0.0 2023-06-23 04:37:40,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1400712.0, ans=0.125 2023-06-23 04:37:42,770 INFO [train.py:996] (0/4) Epoch 8, batch 20000, loss[loss=0.263, simple_loss=0.3372, pruned_loss=0.09438, over 21749.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3102, pruned_loss=0.07801, over 4270865.87 frames. ], batch size: 441, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:37:59,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1400772.0, ans=0.0 2023-06-23 04:38:26,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1400892.0, ans=0.2 2023-06-23 04:38:40,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-23 04:38:41,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1400952.0, ans=0.125 2023-06-23 04:39:15,702 INFO [train.py:996] (0/4) Epoch 8, batch 20050, loss[loss=0.2448, simple_loss=0.3044, pruned_loss=0.09259, over 20157.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3121, pruned_loss=0.08113, over 4278147.15 frames. ], batch size: 707, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:39:18,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.39 vs. limit=10.0 2023-06-23 04:39:18,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.252e+02 4.577e+02 6.319e+02 8.281e+02 1.487e+03, threshold=1.264e+03, percent-clipped=6.0 2023-06-23 04:39:31,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1401072.0, ans=0.125 2023-06-23 04:40:11,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1401192.0, ans=0.015 2023-06-23 04:40:22,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.48 vs. limit=15.0 2023-06-23 04:40:36,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-23 04:40:43,843 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:40:54,580 INFO [train.py:996] (0/4) Epoch 8, batch 20100, loss[loss=0.2955, simple_loss=0.3867, pruned_loss=0.1021, over 21697.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3156, pruned_loss=0.0837, over 4281897.95 frames. ], batch size: 441, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:41:42,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1401492.0, ans=0.125 2023-06-23 04:42:19,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1401612.0, ans=0.125 2023-06-23 04:42:47,131 INFO [train.py:996] (0/4) Epoch 8, batch 20150, loss[loss=0.2802, simple_loss=0.3635, pruned_loss=0.09842, over 21350.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3235, pruned_loss=0.0865, over 4278454.20 frames. ], batch size: 548, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:42:50,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.397e+02 4.538e+02 5.704e+02 8.156e+02 2.453e+03, threshold=1.141e+03, percent-clipped=8.0 2023-06-23 04:43:02,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1401732.0, ans=0.125 2023-06-23 04:43:41,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-23 04:44:02,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-23 04:44:17,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1401912.0, ans=0.09899494936611666 2023-06-23 04:44:24,628 INFO [train.py:996] (0/4) Epoch 8, batch 20200, loss[loss=0.2898, simple_loss=0.4002, pruned_loss=0.08972, over 20813.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3299, pruned_loss=0.08956, over 4271254.36 frames. ], batch size: 607, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:44:44,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1402032.0, ans=0.125 2023-06-23 04:45:23,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.92 vs. limit=5.0 2023-06-23 04:45:54,567 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.75 vs. limit=15.0 2023-06-23 04:45:58,119 INFO [train.py:996] (0/4) Epoch 8, batch 20250, loss[loss=0.2107, simple_loss=0.2933, pruned_loss=0.06403, over 21770.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3281, pruned_loss=0.08719, over 4270138.85 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:46:01,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 5.077e+02 7.177e+02 9.506e+02 2.179e+03, threshold=1.435e+03, percent-clipped=12.0 2023-06-23 04:46:02,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1402272.0, ans=0.0 2023-06-23 04:46:30,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1402332.0, ans=0.0 2023-06-23 04:47:31,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1402512.0, ans=0.0 2023-06-23 04:47:37,020 INFO [train.py:996] (0/4) Epoch 8, batch 20300, loss[loss=0.2143, simple_loss=0.2743, pruned_loss=0.07712, over 21906.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3245, pruned_loss=0.08382, over 4276425.54 frames. ], batch size: 107, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:47:45,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-23 04:48:19,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1402692.0, ans=0.0 2023-06-23 04:49:10,183 INFO [train.py:996] (0/4) Epoch 8, batch 20350, loss[loss=0.2664, simple_loss=0.3373, pruned_loss=0.0978, over 21430.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3232, pruned_loss=0.08304, over 4262599.59 frames. ], batch size: 131, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:49:13,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.458e+02 4.961e+02 7.570e+02 1.006e+03 1.715e+03, threshold=1.514e+03, percent-clipped=7.0 2023-06-23 04:49:24,192 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-23 04:49:33,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1402932.0, ans=0.05 2023-06-23 04:50:03,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1403052.0, ans=0.07 2023-06-23 04:50:11,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1403052.0, ans=0.07 2023-06-23 04:50:27,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1403112.0, ans=0.125 2023-06-23 04:50:32,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.87 vs. limit=22.5 2023-06-23 04:50:44,164 INFO [train.py:996] (0/4) Epoch 8, batch 20400, loss[loss=0.2013, simple_loss=0.2707, pruned_loss=0.06593, over 16336.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3248, pruned_loss=0.0856, over 4252472.53 frames. ], batch size: 62, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:51:02,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-06-23 04:52:17,050 INFO [train.py:996] (0/4) Epoch 8, batch 20450, loss[loss=0.2314, simple_loss=0.3, pruned_loss=0.08138, over 21910.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3268, pruned_loss=0.08811, over 4258027.85 frames. ], batch size: 316, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:52:18,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-23 04:52:20,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.597e+02 5.065e+02 6.621e+02 9.433e+02 1.870e+03, threshold=1.324e+03, percent-clipped=2.0 2023-06-23 04:52:20,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1403472.0, ans=0.0 2023-06-23 04:52:29,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1403472.0, ans=0.125 2023-06-23 04:53:15,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1403652.0, ans=0.125 2023-06-23 04:53:29,610 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=10.0 2023-06-23 04:53:35,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1403652.0, ans=0.2 2023-06-23 04:53:51,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1403712.0, ans=0.0 2023-06-23 04:53:54,333 INFO [train.py:996] (0/4) Epoch 8, batch 20500, loss[loss=0.2326, simple_loss=0.3038, pruned_loss=0.08066, over 21760.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3227, pruned_loss=0.08876, over 4248850.68 frames. ], batch size: 124, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:54:09,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1403832.0, ans=0.125 2023-06-23 04:54:19,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1403832.0, ans=0.2 2023-06-23 04:54:39,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1403892.0, ans=0.0 2023-06-23 04:55:00,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1403952.0, ans=0.0 2023-06-23 04:55:08,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1403952.0, ans=0.125 2023-06-23 04:55:09,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1403952.0, ans=0.2 2023-06-23 04:55:28,231 INFO [train.py:996] (0/4) Epoch 8, batch 20550, loss[loss=0.2746, simple_loss=0.3428, pruned_loss=0.1032, over 21433.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3148, pruned_loss=0.08703, over 4244355.33 frames. ], batch size: 473, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:55:31,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.304e+02 5.833e+02 8.675e+02 1.439e+03, threshold=1.167e+03, percent-clipped=3.0 2023-06-23 04:55:41,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-23 04:56:06,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2023-06-23 04:57:05,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-23 04:57:06,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1404372.0, ans=0.125 2023-06-23 04:57:07,612 INFO [train.py:996] (0/4) Epoch 8, batch 20600, loss[loss=0.2394, simple_loss=0.3049, pruned_loss=0.08691, over 21692.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3148, pruned_loss=0.08417, over 4229099.22 frames. ], batch size: 263, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:57:34,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404432.0, ans=0.1 2023-06-23 04:58:29,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-23 04:58:46,054 INFO [train.py:996] (0/4) Epoch 8, batch 20650, loss[loss=0.2549, simple_loss=0.3165, pruned_loss=0.09668, over 21558.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.313, pruned_loss=0.08558, over 4245456.46 frames. ], batch size: 548, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:58:49,140 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.528e+02 4.605e+02 7.657e+02 1.188e+03 2.326e+03, threshold=1.531e+03, percent-clipped=25.0 2023-06-23 04:59:35,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=12.0 2023-06-23 05:00:23,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1404912.0, ans=0.125 2023-06-23 05:00:26,477 INFO [train.py:996] (0/4) Epoch 8, batch 20700, loss[loss=0.1887, simple_loss=0.2584, pruned_loss=0.05944, over 21237.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3062, pruned_loss=0.08233, over 4246475.19 frames. ], batch size: 176, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 05:00:36,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1404972.0, ans=0.125 2023-06-23 05:01:18,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1405092.0, ans=0.0 2023-06-23 05:01:26,969 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:01:44,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-23 05:01:57,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=22.5 2023-06-23 05:02:07,899 INFO [train.py:996] (0/4) Epoch 8, batch 20750, loss[loss=0.2329, simple_loss=0.3457, pruned_loss=0.06003, over 20839.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3097, pruned_loss=0.08171, over 4248105.83 frames. ], batch size: 608, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 05:02:11,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.166e+02 4.120e+02 5.727e+02 9.009e+02 2.135e+03, threshold=1.145e+03, percent-clipped=5.0 2023-06-23 05:02:30,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-23 05:02:34,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1405332.0, ans=0.125 2023-06-23 05:02:36,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1405332.0, ans=0.04949747468305833 2023-06-23 05:03:47,514 INFO [train.py:996] (0/4) Epoch 8, batch 20800, loss[loss=0.2175, simple_loss=0.2771, pruned_loss=0.07898, over 21618.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3152, pruned_loss=0.08338, over 4253293.54 frames. ], batch size: 247, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:03:54,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1405572.0, ans=0.125 2023-06-23 05:04:51,706 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.02 vs. limit=15.0 2023-06-23 05:04:58,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1405752.0, ans=0.1 2023-06-23 05:05:20,449 INFO [train.py:996] (0/4) Epoch 8, batch 20850, loss[loss=0.2315, simple_loss=0.2957, pruned_loss=0.08358, over 21489.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3068, pruned_loss=0.08082, over 4261451.28 frames. ], batch size: 212, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:05:28,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.968e+02 4.772e+02 9.225e+02 1.220e+03 2.670e+03, threshold=1.845e+03, percent-clipped=33.0 2023-06-23 05:05:40,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-23 05:06:44,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1406112.0, ans=0.1 2023-06-23 05:06:57,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.98 vs. limit=15.0 2023-06-23 05:06:57,365 INFO [train.py:996] (0/4) Epoch 8, batch 20900, loss[loss=0.228, simple_loss=0.2969, pruned_loss=0.07954, over 21277.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3076, pruned_loss=0.08191, over 4271430.49 frames. ], batch size: 159, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:07:04,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1406172.0, ans=0.125 2023-06-23 05:07:34,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=12.0 2023-06-23 05:07:56,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1406292.0, ans=0.1 2023-06-23 05:08:03,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.04 vs. limit=10.0 2023-06-23 05:08:13,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1406352.0, ans=0.0 2023-06-23 05:08:21,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1406412.0, ans=0.2 2023-06-23 05:08:28,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1406412.0, ans=0.125 2023-06-23 05:08:33,364 INFO [train.py:996] (0/4) Epoch 8, batch 20950, loss[loss=0.1971, simple_loss=0.2763, pruned_loss=0.059, over 21744.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3025, pruned_loss=0.07803, over 4266187.64 frames. ], batch size: 332, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:08:36,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 4.522e+02 5.963e+02 9.256e+02 1.585e+03, threshold=1.193e+03, percent-clipped=0.0 2023-06-23 05:09:52,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1406652.0, ans=0.0 2023-06-23 05:09:54,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1406712.0, ans=0.1 2023-06-23 05:10:11,191 INFO [train.py:996] (0/4) Epoch 8, batch 21000, loss[loss=0.2575, simple_loss=0.324, pruned_loss=0.09555, over 21919.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3038, pruned_loss=0.07913, over 4259639.47 frames. ], batch size: 124, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:10:11,192 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 05:10:21,413 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7573, 1.6410, 1.5319, 2.0535, 1.3550, 1.9593, 1.7325, 1.6714], device='cuda:0') 2023-06-23 05:10:27,250 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2634, simple_loss=0.3611, pruned_loss=0.08288, over 1796401.00 frames. 2023-06-23 05:10:27,250 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 05:11:00,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1406832.0, ans=0.125 2023-06-23 05:11:00,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1406832.0, ans=0.1 2023-06-23 05:11:01,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1406832.0, ans=0.0 2023-06-23 05:11:02,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-06-23 05:11:05,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1406832.0, ans=0.0 2023-06-23 05:11:06,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1406832.0, ans=0.0 2023-06-23 05:11:06,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1406832.0, ans=0.125 2023-06-23 05:11:30,386 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-23 05:11:46,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1407012.0, ans=0.2 2023-06-23 05:12:02,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1407072.0, ans=0.0 2023-06-23 05:12:04,114 INFO [train.py:996] (0/4) Epoch 8, batch 21050, loss[loss=0.2364, simple_loss=0.2934, pruned_loss=0.08964, over 21167.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3019, pruned_loss=0.07911, over 4250284.04 frames. ], batch size: 143, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:12:07,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.662e+02 4.992e+02 6.776e+02 1.028e+03 2.055e+03, threshold=1.355e+03, percent-clipped=16.0 2023-06-23 05:12:07,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1407072.0, ans=0.125 2023-06-23 05:12:58,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-23 05:13:02,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1407192.0, ans=0.125 2023-06-23 05:13:39,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1407312.0, ans=0.0 2023-06-23 05:13:42,100 INFO [train.py:996] (0/4) Epoch 8, batch 21100, loss[loss=0.2251, simple_loss=0.2871, pruned_loss=0.08155, over 21774.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2975, pruned_loss=0.07832, over 4253639.00 frames. ], batch size: 317, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:15:15,061 INFO [train.py:996] (0/4) Epoch 8, batch 21150, loss[loss=0.2029, simple_loss=0.2637, pruned_loss=0.07103, over 21662.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2935, pruned_loss=0.07786, over 4249502.54 frames. ], batch size: 282, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:15:17,995 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.999e+02 4.608e+02 5.910e+02 9.200e+02 1.578e+03, threshold=1.182e+03, percent-clipped=4.0 2023-06-23 05:16:05,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1407792.0, ans=0.125 2023-06-23 05:16:43,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-23 05:16:54,300 INFO [train.py:996] (0/4) Epoch 8, batch 21200, loss[loss=0.1829, simple_loss=0.2601, pruned_loss=0.05289, over 21594.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2908, pruned_loss=0.07824, over 4253371.58 frames. ], batch size: 247, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:17:27,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1408032.0, ans=0.125 2023-06-23 05:17:59,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1408152.0, ans=0.125 2023-06-23 05:18:32,266 INFO [train.py:996] (0/4) Epoch 8, batch 21250, loss[loss=0.2108, simple_loss=0.2708, pruned_loss=0.07538, over 21660.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2885, pruned_loss=0.07807, over 4253902.64 frames. ], batch size: 282, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:18:41,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 4.434e+02 5.437e+02 7.242e+02 2.137e+03, threshold=1.087e+03, percent-clipped=7.0 2023-06-23 05:18:49,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1408272.0, ans=0.125 2023-06-23 05:19:00,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1408272.0, ans=0.5 2023-06-23 05:19:03,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-23 05:19:08,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1408332.0, ans=0.125 2023-06-23 05:19:56,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1408512.0, ans=0.125 2023-06-23 05:20:11,403 INFO [train.py:996] (0/4) Epoch 8, batch 21300, loss[loss=0.2285, simple_loss=0.3096, pruned_loss=0.07371, over 21878.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2956, pruned_loss=0.0807, over 4266951.76 frames. ], batch size: 415, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:21:01,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1408692.0, ans=0.125 2023-06-23 05:21:38,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1408812.0, ans=0.125 2023-06-23 05:21:54,433 INFO [train.py:996] (0/4) Epoch 8, batch 21350, loss[loss=0.2058, simple_loss=0.2822, pruned_loss=0.06468, over 21347.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3002, pruned_loss=0.08198, over 4273880.42 frames. ], batch size: 131, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:22:10,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.040e+02 5.053e+02 6.684e+02 9.217e+02 2.330e+03, threshold=1.337e+03, percent-clipped=18.0 2023-06-23 05:22:51,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1408992.0, ans=0.125 2023-06-23 05:23:00,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.54 vs. limit=22.5 2023-06-23 05:23:25,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1409112.0, ans=0.125 2023-06-23 05:23:38,550 INFO [train.py:996] (0/4) Epoch 8, batch 21400, loss[loss=0.2537, simple_loss=0.3291, pruned_loss=0.08914, over 21936.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3024, pruned_loss=0.08044, over 4281875.35 frames. ], batch size: 316, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:23:47,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1409172.0, ans=0.0 2023-06-23 05:24:02,870 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-23 05:24:37,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409352.0, ans=0.1 2023-06-23 05:24:46,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1409352.0, ans=0.125 2023-06-23 05:24:55,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1409412.0, ans=0.0 2023-06-23 05:25:22,974 INFO [train.py:996] (0/4) Epoch 8, batch 21450, loss[loss=0.2552, simple_loss=0.3176, pruned_loss=0.09644, over 21466.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3072, pruned_loss=0.08259, over 4287807.11 frames. ], batch size: 211, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:25:26,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-23 05:25:28,998 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.984e+02 4.393e+02 5.335e+02 6.741e+02 1.398e+03, threshold=1.067e+03, percent-clipped=1.0 2023-06-23 05:26:04,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1409592.0, ans=0.125 2023-06-23 05:26:30,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-23 05:27:01,238 INFO [train.py:996] (0/4) Epoch 8, batch 21500, loss[loss=0.2363, simple_loss=0.2964, pruned_loss=0.08807, over 21740.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3056, pruned_loss=0.08334, over 4293072.11 frames. ], batch size: 112, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:28:06,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-23 05:28:39,694 INFO [train.py:996] (0/4) Epoch 8, batch 21550, loss[loss=0.1495, simple_loss=0.2229, pruned_loss=0.03801, over 21245.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2986, pruned_loss=0.08132, over 4286032.65 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:28:46,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.014e+02 4.565e+02 6.143e+02 8.904e+02 1.889e+03, threshold=1.229e+03, percent-clipped=13.0 2023-06-23 05:29:07,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1410132.0, ans=0.0 2023-06-23 05:30:07,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1410312.0, ans=0.0 2023-06-23 05:30:26,692 INFO [train.py:996] (0/4) Epoch 8, batch 21600, loss[loss=0.1929, simple_loss=0.2556, pruned_loss=0.06512, over 21485.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2936, pruned_loss=0.07883, over 4283833.03 frames. ], batch size: 230, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:30:54,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.86 vs. limit=8.0 2023-06-23 05:31:00,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1410492.0, ans=0.0 2023-06-23 05:31:21,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-23 05:32:05,261 INFO [train.py:996] (0/4) Epoch 8, batch 21650, loss[loss=0.2348, simple_loss=0.3194, pruned_loss=0.07512, over 21229.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2995, pruned_loss=0.0772, over 4285671.99 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:32:10,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.286e+02 5.401e+02 7.635e+02 1.107e+03 2.032e+03, threshold=1.527e+03, percent-clipped=20.0 2023-06-23 05:32:22,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1410732.0, ans=0.025 2023-06-23 05:32:39,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1410792.0, ans=0.0 2023-06-23 05:32:44,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-23 05:32:45,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1410792.0, ans=0.125 2023-06-23 05:32:59,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1410852.0, ans=0.0 2023-06-23 05:33:13,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1410912.0, ans=0.95 2023-06-23 05:33:36,443 INFO [train.py:996] (0/4) Epoch 8, batch 21700, loss[loss=0.2189, simple_loss=0.287, pruned_loss=0.07536, over 21719.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2999, pruned_loss=0.07564, over 4287107.88 frames. ], batch size: 282, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:33:38,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1410972.0, ans=0.125 2023-06-23 05:34:37,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1411152.0, ans=0.2 2023-06-23 05:34:58,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1411212.0, ans=0.125 2023-06-23 05:35:15,288 INFO [train.py:996] (0/4) Epoch 8, batch 21750, loss[loss=0.2284, simple_loss=0.2794, pruned_loss=0.08866, over 21196.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2964, pruned_loss=0.07623, over 4274176.06 frames. ], batch size: 549, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:35:27,363 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.075e+02 4.568e+02 6.230e+02 8.144e+02 2.277e+03, threshold=1.246e+03, percent-clipped=1.0 2023-06-23 05:35:37,655 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:35:56,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1411392.0, ans=0.2 2023-06-23 05:36:27,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1411452.0, ans=0.5 2023-06-23 05:37:01,116 INFO [train.py:996] (0/4) Epoch 8, batch 21800, loss[loss=0.2384, simple_loss=0.298, pruned_loss=0.08941, over 21961.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2938, pruned_loss=0.07703, over 4278830.52 frames. ], batch size: 103, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:37:05,588 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.37 vs. limit=15.0 2023-06-23 05:37:34,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2023-06-23 05:37:51,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1411752.0, ans=0.125 2023-06-23 05:38:17,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1411812.0, ans=0.1 2023-06-23 05:38:27,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1411812.0, ans=0.0 2023-06-23 05:38:35,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1411812.0, ans=0.0 2023-06-23 05:38:35,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-23 05:38:38,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-23 05:38:39,326 INFO [train.py:996] (0/4) Epoch 8, batch 21850, loss[loss=0.234, simple_loss=0.3402, pruned_loss=0.06396, over 21605.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2979, pruned_loss=0.07743, over 4273949.09 frames. ], batch size: 389, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:38:44,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1411872.0, ans=0.125 2023-06-23 05:38:47,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.413e+02 4.593e+02 6.628e+02 8.915e+02 2.617e+03, threshold=1.326e+03, percent-clipped=11.0 2023-06-23 05:40:02,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1412112.0, ans=0.125 2023-06-23 05:40:16,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1412112.0, ans=0.125 2023-06-23 05:40:20,496 INFO [train.py:996] (0/4) Epoch 8, batch 21900, loss[loss=0.243, simple_loss=0.3034, pruned_loss=0.09124, over 21797.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2989, pruned_loss=0.07848, over 4267790.14 frames. ], batch size: 118, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:40:22,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1412172.0, ans=0.0 2023-06-23 05:40:40,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1412232.0, ans=0.0 2023-06-23 05:40:41,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1412232.0, ans=0.125 2023-06-23 05:40:51,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-23 05:42:00,053 INFO [train.py:996] (0/4) Epoch 8, batch 21950, loss[loss=0.2164, simple_loss=0.28, pruned_loss=0.07643, over 21584.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2942, pruned_loss=0.07795, over 4271831.26 frames. ], batch size: 263, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:42:07,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.901e+02 4.723e+02 6.314e+02 7.880e+02 1.650e+03, threshold=1.263e+03, percent-clipped=2.0 2023-06-23 05:42:48,995 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-23 05:43:40,036 INFO [train.py:996] (0/4) Epoch 8, batch 22000, loss[loss=0.2416, simple_loss=0.3042, pruned_loss=0.08947, over 21607.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2879, pruned_loss=0.07473, over 4266291.40 frames. ], batch size: 415, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:43:42,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1412772.0, ans=0.125 2023-06-23 05:43:42,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1412772.0, ans=0.07 2023-06-23 05:43:51,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1412772.0, ans=0.125 2023-06-23 05:43:58,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1412832.0, ans=0.125 2023-06-23 05:43:59,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1412832.0, ans=0.125 2023-06-23 05:44:08,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1412832.0, ans=0.2 2023-06-23 05:44:49,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.41 vs. limit=10.0 2023-06-23 05:45:20,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1413072.0, ans=0.0 2023-06-23 05:45:21,173 INFO [train.py:996] (0/4) Epoch 8, batch 22050, loss[loss=0.2575, simple_loss=0.3481, pruned_loss=0.08345, over 21783.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2958, pruned_loss=0.07741, over 4264422.67 frames. ], batch size: 351, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:45:21,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1413072.0, ans=0.07 2023-06-23 05:45:33,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.843e+02 7.365e+02 1.302e+03 3.775e+03, threshold=1.473e+03, percent-clipped=26.0 2023-06-23 05:45:43,581 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-23 05:45:56,727 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-23 05:46:24,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=12.0 2023-06-23 05:46:28,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1413252.0, ans=0.2 2023-06-23 05:46:37,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1413252.0, ans=0.0 2023-06-23 05:46:46,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1413312.0, ans=0.2 2023-06-23 05:47:02,703 INFO [train.py:996] (0/4) Epoch 8, batch 22100, loss[loss=0.2782, simple_loss=0.3392, pruned_loss=0.1086, over 21584.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3046, pruned_loss=0.08193, over 4264539.89 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:48:30,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1413612.0, ans=0.125 2023-06-23 05:48:34,241 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:48:41,589 INFO [train.py:996] (0/4) Epoch 8, batch 22150, loss[loss=0.2117, simple_loss=0.292, pruned_loss=0.06572, over 21418.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3069, pruned_loss=0.08289, over 4276507.61 frames. ], batch size: 131, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:48:52,628 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 4.829e+02 6.848e+02 1.021e+03 2.130e+03, threshold=1.370e+03, percent-clipped=6.0 2023-06-23 05:50:10,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1413912.0, ans=0.1 2023-06-23 05:50:10,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1413912.0, ans=0.0 2023-06-23 05:50:21,038 INFO [train.py:996] (0/4) Epoch 8, batch 22200, loss[loss=0.2522, simple_loss=0.3314, pruned_loss=0.08648, over 21257.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3102, pruned_loss=0.0842, over 4284251.31 frames. ], batch size: 159, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:50:44,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1414032.0, ans=0.1 2023-06-23 05:50:58,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1414092.0, ans=0.125 2023-06-23 05:51:19,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1414092.0, ans=0.125 2023-06-23 05:51:43,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1414212.0, ans=0.125 2023-06-23 05:51:57,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.43 vs. limit=10.0 2023-06-23 05:52:00,989 INFO [train.py:996] (0/4) Epoch 8, batch 22250, loss[loss=0.3027, simple_loss=0.3728, pruned_loss=0.1163, over 21822.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.317, pruned_loss=0.08602, over 4285609.25 frames. ], batch size: 118, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:52:12,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.556e+02 5.033e+02 6.376e+02 9.699e+02 1.847e+03, threshold=1.275e+03, percent-clipped=11.0 2023-06-23 05:52:33,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1414332.0, ans=0.1 2023-06-23 05:52:40,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1414392.0, ans=0.125 2023-06-23 05:53:40,279 INFO [train.py:996] (0/4) Epoch 8, batch 22300, loss[loss=0.2436, simple_loss=0.301, pruned_loss=0.09312, over 21693.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3181, pruned_loss=0.08765, over 4291550.15 frames. ], batch size: 230, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:53:40,611 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:53:52,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1414572.0, ans=0.125 2023-06-23 05:53:55,341 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-23 05:54:10,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1414692.0, ans=0.2 2023-06-23 05:54:53,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1414752.0, ans=0.125 2023-06-23 05:55:13,977 INFO [train.py:996] (0/4) Epoch 8, batch 22350, loss[loss=0.241, simple_loss=0.3125, pruned_loss=0.08472, over 21737.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3158, pruned_loss=0.08791, over 4298784.58 frames. ], batch size: 112, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:55:25,673 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.473e+02 4.765e+02 6.117e+02 7.891e+02 1.509e+03, threshold=1.223e+03, percent-clipped=2.0 2023-06-23 05:55:30,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-23 05:55:39,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1414932.0, ans=0.0 2023-06-23 05:55:52,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1414992.0, ans=0.125 2023-06-23 05:56:18,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1415052.0, ans=0.125 2023-06-23 05:56:48,481 INFO [train.py:996] (0/4) Epoch 8, batch 22400, loss[loss=0.2307, simple_loss=0.3002, pruned_loss=0.08064, over 21510.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3126, pruned_loss=0.0851, over 4291147.83 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 05:57:25,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1415232.0, ans=0.0 2023-06-23 05:57:59,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1415352.0, ans=0.125 2023-06-23 05:58:25,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1415472.0, ans=0.02 2023-06-23 05:58:25,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1415472.0, ans=0.125 2023-06-23 05:58:26,833 INFO [train.py:996] (0/4) Epoch 8, batch 22450, loss[loss=0.2841, simple_loss=0.3428, pruned_loss=0.1127, over 19979.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3068, pruned_loss=0.08396, over 4284106.34 frames. ], batch size: 702, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 05:58:37,908 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 3.949e+02 5.140e+02 7.263e+02 1.360e+03, threshold=1.028e+03, percent-clipped=2.0 2023-06-23 05:58:39,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1415472.0, ans=0.125 2023-06-23 05:59:37,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1415652.0, ans=0.0 2023-06-23 06:00:03,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.84 vs. limit=10.0 2023-06-23 06:00:06,821 INFO [train.py:996] (0/4) Epoch 8, batch 22500, loss[loss=0.2146, simple_loss=0.2867, pruned_loss=0.07126, over 21783.00 frames. ], tot_loss[loss=0.234, simple_loss=0.302, pruned_loss=0.08299, over 4277515.40 frames. ], batch size: 124, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:00:18,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1415772.0, ans=0.1 2023-06-23 06:00:21,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1415772.0, ans=22.5 2023-06-23 06:00:59,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1415892.0, ans=0.125 2023-06-23 06:01:24,596 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-236000.pt 2023-06-23 06:01:47,310 INFO [train.py:996] (0/4) Epoch 8, batch 22550, loss[loss=0.2192, simple_loss=0.2872, pruned_loss=0.07562, over 21553.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3058, pruned_loss=0.08332, over 4278075.23 frames. ], batch size: 212, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:01:48,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.06 vs. limit=15.0 2023-06-23 06:02:04,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.380e+02 5.264e+02 6.977e+02 1.047e+03 2.151e+03, threshold=1.395e+03, percent-clipped=25.0 2023-06-23 06:02:17,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1416132.0, ans=0.125 2023-06-23 06:02:34,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1416192.0, ans=0.125 2023-06-23 06:02:39,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1416192.0, ans=0.125 2023-06-23 06:03:07,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1416312.0, ans=0.0 2023-06-23 06:03:26,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1416312.0, ans=0.1 2023-06-23 06:03:29,281 INFO [train.py:996] (0/4) Epoch 8, batch 22600, loss[loss=0.2418, simple_loss=0.3192, pruned_loss=0.08213, over 21811.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3082, pruned_loss=0.08374, over 4282553.17 frames. ], batch size: 316, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:04:03,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1416432.0, ans=0.125 2023-06-23 06:04:07,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1416432.0, ans=0.04949747468305833 2023-06-23 06:04:21,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-23 06:04:46,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.22 vs. limit=15.0 2023-06-23 06:04:47,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1416612.0, ans=0.125 2023-06-23 06:05:05,359 INFO [train.py:996] (0/4) Epoch 8, batch 22650, loss[loss=0.235, simple_loss=0.2918, pruned_loss=0.0891, over 21870.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3058, pruned_loss=0.0835, over 4278717.00 frames. ], batch size: 98, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:05:21,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.542e+02 6.131e+02 9.012e+02 1.354e+03 2.560e+03, threshold=1.802e+03, percent-clipped=24.0 2023-06-23 06:05:33,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.37 vs. limit=10.0 2023-06-23 06:06:13,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1416852.0, ans=0.125 2023-06-23 06:06:37,807 INFO [train.py:996] (0/4) Epoch 8, batch 22700, loss[loss=0.2212, simple_loss=0.2803, pruned_loss=0.08098, over 21347.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2991, pruned_loss=0.08277, over 4271539.27 frames. ], batch size: 160, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:07:37,373 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-23 06:07:42,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1417152.0, ans=0.1 2023-06-23 06:08:07,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1417212.0, ans=0.025 2023-06-23 06:08:08,920 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:08:16,130 INFO [train.py:996] (0/4) Epoch 8, batch 22750, loss[loss=0.2276, simple_loss=0.3345, pruned_loss=0.06036, over 19748.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3003, pruned_loss=0.08386, over 4278374.94 frames. ], batch size: 703, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:08:31,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-23 06:08:31,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.804e+02 6.420e+02 9.928e+02 2.099e+03, threshold=1.284e+03, percent-clipped=4.0 2023-06-23 06:09:04,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-23 06:09:06,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1417392.0, ans=0.0 2023-06-23 06:09:22,892 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-23 06:09:45,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1417512.0, ans=0.125 2023-06-23 06:09:48,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1417512.0, ans=0.125 2023-06-23 06:09:54,225 INFO [train.py:996] (0/4) Epoch 8, batch 22800, loss[loss=0.2297, simple_loss=0.2909, pruned_loss=0.08426, over 21677.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3064, pruned_loss=0.08704, over 4287955.40 frames. ], batch size: 263, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:10:32,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1417632.0, ans=0.125 2023-06-23 06:10:36,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1417692.0, ans=0.125 2023-06-23 06:11:32,468 INFO [train.py:996] (0/4) Epoch 8, batch 22850, loss[loss=0.2252, simple_loss=0.282, pruned_loss=0.08427, over 21380.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3027, pruned_loss=0.0858, over 4292941.84 frames. ], batch size: 144, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:11:49,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.448e+02 5.341e+02 7.317e+02 9.622e+02 1.873e+03, threshold=1.463e+03, percent-clipped=13.0 2023-06-23 06:12:04,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1417932.0, ans=0.125 2023-06-23 06:12:14,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1417992.0, ans=0.0 2023-06-23 06:13:07,130 INFO [train.py:996] (0/4) Epoch 8, batch 22900, loss[loss=0.2504, simple_loss=0.3696, pruned_loss=0.06562, over 21194.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3054, pruned_loss=0.08463, over 4284706.80 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:13:15,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-23 06:13:44,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1418232.0, ans=0.125 2023-06-23 06:14:10,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1418352.0, ans=0.125 2023-06-23 06:14:43,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=22.5 2023-06-23 06:14:56,824 INFO [train.py:996] (0/4) Epoch 8, batch 22950, loss[loss=0.263, simple_loss=0.3967, pruned_loss=0.06458, over 21339.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3178, pruned_loss=0.08301, over 4276330.05 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:15:03,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1418472.0, ans=0.1 2023-06-23 06:15:09,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1418472.0, ans=15.0 2023-06-23 06:15:10,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.029e+02 4.953e+02 7.269e+02 1.039e+03 2.026e+03, threshold=1.454e+03, percent-clipped=12.0 2023-06-23 06:15:20,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1418532.0, ans=0.2 2023-06-23 06:15:24,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1418532.0, ans=0.04949747468305833 2023-06-23 06:15:29,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1418592.0, ans=0.2 2023-06-23 06:15:34,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1418592.0, ans=0.125 2023-06-23 06:15:46,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.81 vs. limit=15.0 2023-06-23 06:16:11,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1418712.0, ans=0.125 2023-06-23 06:16:36,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-23 06:16:36,823 INFO [train.py:996] (0/4) Epoch 8, batch 23000, loss[loss=0.2691, simple_loss=0.3581, pruned_loss=0.09002, over 20036.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3186, pruned_loss=0.0815, over 4279847.56 frames. ], batch size: 703, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:16:40,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-23 06:16:58,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-23 06:17:55,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1419012.0, ans=0.0 2023-06-23 06:17:57,437 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=22.5 2023-06-23 06:18:01,320 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:18:12,384 INFO [train.py:996] (0/4) Epoch 8, batch 23050, loss[loss=0.2443, simple_loss=0.3245, pruned_loss=0.0821, over 20798.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3189, pruned_loss=0.08312, over 4272627.87 frames. ], batch size: 607, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:18:22,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1419072.0, ans=0.2 2023-06-23 06:18:25,320 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.192e+02 4.592e+02 5.368e+02 6.927e+02 1.540e+03, threshold=1.074e+03, percent-clipped=1.0 2023-06-23 06:19:03,132 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-06-23 06:19:31,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1419312.0, ans=0.125 2023-06-23 06:19:33,477 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-23 06:19:42,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-23 06:19:43,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.57 vs. limit=5.0 2023-06-23 06:19:45,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1419372.0, ans=0.125 2023-06-23 06:19:47,039 INFO [train.py:996] (0/4) Epoch 8, batch 23100, loss[loss=0.194, simple_loss=0.2556, pruned_loss=0.06618, over 21597.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3152, pruned_loss=0.08433, over 4261665.08 frames. ], batch size: 231, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:20:00,754 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:20:23,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.69 vs. limit=5.0 2023-06-23 06:20:33,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1419492.0, ans=0.125 2023-06-23 06:21:21,809 INFO [train.py:996] (0/4) Epoch 8, batch 23150, loss[loss=0.2055, simple_loss=0.2807, pruned_loss=0.06519, over 21200.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3088, pruned_loss=0.08342, over 4259177.22 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:21:34,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.444e+02 4.721e+02 6.329e+02 9.421e+02 1.968e+03, threshold=1.266e+03, percent-clipped=20.0 2023-06-23 06:21:44,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1419732.0, ans=0.2 2023-06-23 06:21:50,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1419732.0, ans=0.125 2023-06-23 06:21:51,567 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.52 vs. limit=22.5 2023-06-23 06:21:52,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-23 06:21:57,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1419792.0, ans=0.2 2023-06-23 06:22:23,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-23 06:22:42,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1419912.0, ans=0.05 2023-06-23 06:22:50,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1419912.0, ans=0.2 2023-06-23 06:22:59,193 INFO [train.py:996] (0/4) Epoch 8, batch 23200, loss[loss=0.2542, simple_loss=0.3269, pruned_loss=0.09078, over 21372.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.306, pruned_loss=0.08333, over 4262079.92 frames. ], batch size: 131, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:23:01,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1419972.0, ans=0.0 2023-06-23 06:23:13,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-23 06:23:28,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1420032.0, ans=0.0 2023-06-23 06:23:31,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1420092.0, ans=0.125 2023-06-23 06:23:54,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=15.0 2023-06-23 06:24:24,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1420212.0, ans=0.125 2023-06-23 06:24:26,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1420212.0, ans=0.2 2023-06-23 06:24:36,819 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:24:37,812 INFO [train.py:996] (0/4) Epoch 8, batch 23250, loss[loss=0.2748, simple_loss=0.3323, pruned_loss=0.1086, over 21859.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3075, pruned_loss=0.08509, over 4274712.46 frames. ], batch size: 414, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:24:50,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.546e+02 4.969e+02 6.559e+02 1.052e+03 2.390e+03, threshold=1.312e+03, percent-clipped=18.0 2023-06-23 06:25:05,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1420332.0, ans=0.0 2023-06-23 06:25:27,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1420392.0, ans=0.125 2023-06-23 06:25:30,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1420392.0, ans=0.0 2023-06-23 06:25:49,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1420452.0, ans=0.0 2023-06-23 06:25:54,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1420452.0, ans=0.0 2023-06-23 06:26:18,074 INFO [train.py:996] (0/4) Epoch 8, batch 23300, loss[loss=0.3494, simple_loss=0.4447, pruned_loss=0.1271, over 21527.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3167, pruned_loss=0.08706, over 4283192.37 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:27:01,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1420692.0, ans=0.1 2023-06-23 06:27:30,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-23 06:27:39,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1420752.0, ans=0.0 2023-06-23 06:27:58,461 INFO [train.py:996] (0/4) Epoch 8, batch 23350, loss[loss=0.1993, simple_loss=0.2857, pruned_loss=0.05647, over 21686.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3203, pruned_loss=0.08604, over 4275380.50 frames. ], batch size: 298, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:27:58,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1420872.0, ans=0.125 2023-06-23 06:28:18,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.315e+02 4.912e+02 6.155e+02 8.820e+02 1.771e+03, threshold=1.231e+03, percent-clipped=5.0 2023-06-23 06:28:18,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1420932.0, ans=0.0 2023-06-23 06:29:11,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-23 06:29:37,124 INFO [train.py:996] (0/4) Epoch 8, batch 23400, loss[loss=0.2244, simple_loss=0.2908, pruned_loss=0.07903, over 21478.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3135, pruned_loss=0.08203, over 4283872.80 frames. ], batch size: 211, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:31:20,236 INFO [train.py:996] (0/4) Epoch 8, batch 23450, loss[loss=0.2874, simple_loss=0.3538, pruned_loss=0.1105, over 21350.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3156, pruned_loss=0.08565, over 4284049.61 frames. ], batch size: 143, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:31:24,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1421472.0, ans=0.0 2023-06-23 06:31:38,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.252e+02 4.296e+02 5.237e+02 7.563e+02 1.579e+03, threshold=1.047e+03, percent-clipped=8.0 2023-06-23 06:31:58,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1421532.0, ans=0.0 2023-06-23 06:32:58,374 INFO [train.py:996] (0/4) Epoch 8, batch 23500, loss[loss=0.2472, simple_loss=0.3115, pruned_loss=0.09142, over 21789.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3178, pruned_loss=0.08767, over 4282219.94 frames. ], batch size: 389, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:34:01,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1421952.0, ans=0.2 2023-06-23 06:34:07,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-23 06:34:26,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1422012.0, ans=0.125 2023-06-23 06:34:35,807 INFO [train.py:996] (0/4) Epoch 8, batch 23550, loss[loss=0.236, simple_loss=0.2856, pruned_loss=0.09314, over 21382.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3136, pruned_loss=0.08686, over 4265280.27 frames. ], batch size: 473, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:34:54,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.322e+02 4.998e+02 7.038e+02 9.548e+02 2.153e+03, threshold=1.408e+03, percent-clipped=14.0 2023-06-23 06:34:56,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1422132.0, ans=0.125 2023-06-23 06:35:06,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1422132.0, ans=0.125 2023-06-23 06:35:25,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1422192.0, ans=0.0 2023-06-23 06:35:30,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1422192.0, ans=0.125 2023-06-23 06:36:08,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1422312.0, ans=0.1 2023-06-23 06:36:18,160 INFO [train.py:996] (0/4) Epoch 8, batch 23600, loss[loss=0.2709, simple_loss=0.3416, pruned_loss=0.1001, over 21870.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3125, pruned_loss=0.08687, over 4258297.64 frames. ], batch size: 371, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:36:54,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1422432.0, ans=0.125 2023-06-23 06:37:13,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-23 06:37:38,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1422612.0, ans=0.125 2023-06-23 06:37:58,170 INFO [train.py:996] (0/4) Epoch 8, batch 23650, loss[loss=0.2415, simple_loss=0.3171, pruned_loss=0.08301, over 20102.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3121, pruned_loss=0.08396, over 4260331.65 frames. ], batch size: 702, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:38:10,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1422672.0, ans=0.0 2023-06-23 06:38:22,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.610e+02 4.602e+02 5.917e+02 8.221e+02 1.589e+03, threshold=1.183e+03, percent-clipped=3.0 2023-06-23 06:38:39,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1422792.0, ans=0.125 2023-06-23 06:39:24,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1422912.0, ans=0.125 2023-06-23 06:39:48,552 INFO [train.py:996] (0/4) Epoch 8, batch 23700, loss[loss=0.2484, simple_loss=0.3228, pruned_loss=0.08701, over 21401.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3144, pruned_loss=0.08361, over 4267911.84 frames. ], batch size: 176, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:39:50,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1422972.0, ans=0.0 2023-06-23 06:40:38,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1423092.0, ans=0.0 2023-06-23 06:41:28,757 INFO [train.py:996] (0/4) Epoch 8, batch 23750, loss[loss=0.2041, simple_loss=0.3125, pruned_loss=0.04785, over 21628.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.316, pruned_loss=0.08341, over 4264139.87 frames. ], batch size: 414, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:41:31,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2023-06-23 06:41:41,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1423272.0, ans=0.025 2023-06-23 06:41:42,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.157e+02 4.173e+02 5.450e+02 7.281e+02 1.269e+03, threshold=1.090e+03, percent-clipped=1.0 2023-06-23 06:41:44,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1423332.0, ans=0.125 2023-06-23 06:41:46,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1423332.0, ans=0.0 2023-06-23 06:42:11,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1423392.0, ans=0.0 2023-06-23 06:43:07,498 INFO [train.py:996] (0/4) Epoch 8, batch 23800, loss[loss=0.2255, simple_loss=0.2982, pruned_loss=0.07636, over 21366.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3141, pruned_loss=0.08089, over 4266379.96 frames. ], batch size: 131, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:44:26,686 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.72 vs. limit=15.0 2023-06-23 06:44:40,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1423812.0, ans=0.125 2023-06-23 06:44:47,960 INFO [train.py:996] (0/4) Epoch 8, batch 23850, loss[loss=0.2793, simple_loss=0.3468, pruned_loss=0.1059, over 21447.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3216, pruned_loss=0.08342, over 4267989.15 frames. ], batch size: 549, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:44:53,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1423872.0, ans=0.125 2023-06-23 06:44:59,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1423872.0, ans=0.125 2023-06-23 06:45:07,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.341e+02 5.290e+02 6.961e+02 9.016e+02 2.497e+03, threshold=1.392e+03, percent-clipped=15.0 2023-06-23 06:45:24,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1423932.0, ans=0.07 2023-06-23 06:46:16,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1424112.0, ans=0.0 2023-06-23 06:46:33,156 INFO [train.py:996] (0/4) Epoch 8, batch 23900, loss[loss=0.232, simple_loss=0.3182, pruned_loss=0.07296, over 21655.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3273, pruned_loss=0.08525, over 4270085.18 frames. ], batch size: 332, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:48:06,342 INFO [train.py:996] (0/4) Epoch 8, batch 23950, loss[loss=0.2274, simple_loss=0.2992, pruned_loss=0.07784, over 21933.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3219, pruned_loss=0.08471, over 4272360.66 frames. ], batch size: 317, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:48:25,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.604e+02 5.747e+02 7.946e+02 1.092e+03 1.988e+03, threshold=1.589e+03, percent-clipped=11.0 2023-06-23 06:48:31,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1424532.0, ans=0.0 2023-06-23 06:48:59,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1424592.0, ans=6.0 2023-06-23 06:49:20,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1424652.0, ans=0.125 2023-06-23 06:49:45,037 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-23 06:49:45,378 INFO [train.py:996] (0/4) Epoch 8, batch 24000, loss[loss=0.2513, simple_loss=0.3689, pruned_loss=0.06687, over 19889.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3217, pruned_loss=0.08633, over 4267746.39 frames. ], batch size: 703, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:49:45,379 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 06:49:59,536 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.7906, 3.0778, 3.0199, 1.6325], device='cuda:0') 2023-06-23 06:50:04,122 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2639, simple_loss=0.3603, pruned_loss=0.08376, over 1796401.00 frames. 2023-06-23 06:50:04,123 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 06:50:18,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1424772.0, ans=0.0 2023-06-23 06:50:48,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1424892.0, ans=0.1 2023-06-23 06:50:55,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=22.5 2023-06-23 06:51:43,407 INFO [train.py:996] (0/4) Epoch 8, batch 24050, loss[loss=0.2413, simple_loss=0.3271, pruned_loss=0.07772, over 21843.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3241, pruned_loss=0.08758, over 4266563.65 frames. ], batch size: 371, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:51:55,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1425072.0, ans=0.125 2023-06-23 06:52:08,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.279e+02 4.739e+02 5.574e+02 8.138e+02 1.478e+03, threshold=1.115e+03, percent-clipped=0.0 2023-06-23 06:52:33,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1425192.0, ans=0.0 2023-06-23 06:53:28,941 INFO [train.py:996] (0/4) Epoch 8, batch 24100, loss[loss=0.2675, simple_loss=0.3457, pruned_loss=0.09464, over 21261.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3246, pruned_loss=0.0867, over 4270810.13 frames. ], batch size: 548, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:53:54,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1425432.0, ans=0.0 2023-06-23 06:54:09,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1425492.0, ans=0.125 2023-06-23 06:54:13,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-23 06:54:32,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1425552.0, ans=0.125 2023-06-23 06:54:49,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1425612.0, ans=0.0 2023-06-23 06:54:59,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1425612.0, ans=0.125 2023-06-23 06:55:07,259 INFO [train.py:996] (0/4) Epoch 8, batch 24150, loss[loss=0.1996, simple_loss=0.2512, pruned_loss=0.07398, over 20336.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3234, pruned_loss=0.08818, over 4270630.79 frames. ], batch size: 703, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:55:08,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-23 06:55:11,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1425672.0, ans=0.125 2023-06-23 06:55:16,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1425672.0, ans=0.0 2023-06-23 06:55:22,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.625e+02 4.851e+02 6.515e+02 9.296e+02 1.728e+03, threshold=1.303e+03, percent-clipped=14.0 2023-06-23 06:55:50,192 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-06-23 06:56:07,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1425852.0, ans=0.125 2023-06-23 06:56:43,547 INFO [train.py:996] (0/4) Epoch 8, batch 24200, loss[loss=0.2432, simple_loss=0.3268, pruned_loss=0.07984, over 21730.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3269, pruned_loss=0.08991, over 4279439.63 frames. ], batch size: 298, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:57:18,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-23 06:57:24,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1426092.0, ans=0.2 2023-06-23 06:58:11,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-23 06:58:14,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1426212.0, ans=0.015 2023-06-23 06:58:21,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1426212.0, ans=0.0 2023-06-23 06:58:25,362 INFO [train.py:996] (0/4) Epoch 8, batch 24250, loss[loss=0.1859, simple_loss=0.2874, pruned_loss=0.04215, over 21827.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.323, pruned_loss=0.08443, over 4269084.03 frames. ], batch size: 316, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:58:44,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.260e+02 4.495e+02 7.277e+02 1.167e+03 2.451e+03, threshold=1.455e+03, percent-clipped=16.0 2023-06-23 06:58:51,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1426332.0, ans=0.125 2023-06-23 06:59:38,124 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:00:04,127 INFO [train.py:996] (0/4) Epoch 8, batch 24300, loss[loss=0.2031, simple_loss=0.2853, pruned_loss=0.06048, over 21809.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3163, pruned_loss=0.07868, over 4275791.31 frames. ], batch size: 316, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:00:32,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1426632.0, ans=0.125 2023-06-23 07:01:46,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-23 07:01:47,238 INFO [train.py:996] (0/4) Epoch 8, batch 24350, loss[loss=0.2981, simple_loss=0.3659, pruned_loss=0.1151, over 21788.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3128, pruned_loss=0.07861, over 4283280.83 frames. ], batch size: 124, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:01:52,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1426872.0, ans=0.0 2023-06-23 07:02:03,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.048e+02 4.784e+02 6.670e+02 9.592e+02 1.817e+03, threshold=1.334e+03, percent-clipped=7.0 2023-06-23 07:03:22,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1427112.0, ans=0.125 2023-06-23 07:03:27,454 INFO [train.py:996] (0/4) Epoch 8, batch 24400, loss[loss=0.2319, simple_loss=0.3202, pruned_loss=0.07177, over 17188.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3168, pruned_loss=0.08117, over 4277847.33 frames. ], batch size: 60, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:03:59,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1427232.0, ans=0.0 2023-06-23 07:05:07,077 INFO [train.py:996] (0/4) Epoch 8, batch 24450, loss[loss=0.2506, simple_loss=0.312, pruned_loss=0.09457, over 21721.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3178, pruned_loss=0.08265, over 4272370.66 frames. ], batch size: 333, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:05:09,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1427472.0, ans=0.2 2023-06-23 07:05:23,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.696e+02 5.464e+02 7.459e+02 1.124e+03 2.090e+03, threshold=1.492e+03, percent-clipped=14.0 2023-06-23 07:06:26,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1427652.0, ans=0.125 2023-06-23 07:06:35,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1427712.0, ans=0.125 2023-06-23 07:06:40,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1427712.0, ans=0.025 2023-06-23 07:06:44,612 INFO [train.py:996] (0/4) Epoch 8, batch 24500, loss[loss=0.2235, simple_loss=0.3, pruned_loss=0.07352, over 21790.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.32, pruned_loss=0.08349, over 4280208.13 frames. ], batch size: 247, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:06:57,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1427772.0, ans=0.1 2023-06-23 07:06:59,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1427832.0, ans=0.1 2023-06-23 07:07:24,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1427892.0, ans=0.0 2023-06-23 07:07:27,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1427892.0, ans=0.07 2023-06-23 07:07:35,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1427892.0, ans=0.0 2023-06-23 07:08:13,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1428012.0, ans=0.125 2023-06-23 07:08:24,403 INFO [train.py:996] (0/4) Epoch 8, batch 24550, loss[loss=0.2362, simple_loss=0.3135, pruned_loss=0.07948, over 21734.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3236, pruned_loss=0.08613, over 4288344.88 frames. ], batch size: 332, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:08:43,972 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-23 07:08:50,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.336e+02 4.677e+02 6.091e+02 7.782e+02 1.609e+03, threshold=1.218e+03, percent-clipped=3.0 2023-06-23 07:09:11,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-23 07:09:17,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1428192.0, ans=0.0 2023-06-23 07:09:19,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-23 07:09:42,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1428252.0, ans=0.125 2023-06-23 07:09:44,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1428312.0, ans=0.2 2023-06-23 07:09:53,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1428312.0, ans=0.125 2023-06-23 07:10:02,258 INFO [train.py:996] (0/4) Epoch 8, batch 24600, loss[loss=0.2238, simple_loss=0.2815, pruned_loss=0.08304, over 21607.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3193, pruned_loss=0.08593, over 4287107.77 frames. ], batch size: 231, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:10:17,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1428372.0, ans=0.0 2023-06-23 07:10:20,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-23 07:10:43,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1428432.0, ans=0.125 2023-06-23 07:10:45,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1428432.0, ans=0.125 2023-06-23 07:11:05,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.53 vs. limit=10.0 2023-06-23 07:11:12,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1428552.0, ans=0.2 2023-06-23 07:11:40,905 INFO [train.py:996] (0/4) Epoch 8, batch 24650, loss[loss=0.1906, simple_loss=0.2617, pruned_loss=0.0598, over 21472.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3127, pruned_loss=0.08517, over 4277291.29 frames. ], batch size: 195, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:12:03,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1428672.0, ans=0.0 2023-06-23 07:12:13,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.435e+02 5.561e+02 8.132e+02 1.139e+03 1.963e+03, threshold=1.626e+03, percent-clipped=16.0 2023-06-23 07:12:38,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1428792.0, ans=0.0 2023-06-23 07:13:06,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1428912.0, ans=0.125 2023-06-23 07:13:08,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-23 07:13:19,838 INFO [train.py:996] (0/4) Epoch 8, batch 24700, loss[loss=0.2494, simple_loss=0.3014, pruned_loss=0.09871, over 21438.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.31, pruned_loss=0.08338, over 4272458.52 frames. ], batch size: 509, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:13:46,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1429032.0, ans=0.125 2023-06-23 07:14:47,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1429212.0, ans=0.1 2023-06-23 07:14:52,866 INFO [train.py:996] (0/4) Epoch 8, batch 24750, loss[loss=0.23, simple_loss=0.2852, pruned_loss=0.0874, over 14979.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3035, pruned_loss=0.08021, over 4269536.84 frames. ], batch size: 62, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:15:19,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.105e+02 4.833e+02 6.692e+02 9.106e+02 2.171e+03, threshold=1.338e+03, percent-clipped=2.0 2023-06-23 07:15:45,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1429392.0, ans=0.0 2023-06-23 07:16:31,276 INFO [train.py:996] (0/4) Epoch 8, batch 24800, loss[loss=0.2399, simple_loss=0.3017, pruned_loss=0.08904, over 21614.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2973, pruned_loss=0.07957, over 4265000.14 frames. ], batch size: 441, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:16:35,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1429572.0, ans=0.2 2023-06-23 07:16:38,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-23 07:17:36,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1429752.0, ans=0.0 2023-06-23 07:17:39,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1429752.0, ans=0.125 2023-06-23 07:17:58,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1429812.0, ans=0.125 2023-06-23 07:18:00,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-23 07:18:03,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1429872.0, ans=0.0 2023-06-23 07:18:04,070 INFO [train.py:996] (0/4) Epoch 8, batch 24850, loss[loss=0.2504, simple_loss=0.3163, pruned_loss=0.09224, over 21157.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2969, pruned_loss=0.08061, over 4267914.38 frames. ], batch size: 608, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:18:19,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1429872.0, ans=0.0 2023-06-23 07:18:33,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.302e+02 4.718e+02 6.141e+02 8.581e+02 1.389e+03, threshold=1.228e+03, percent-clipped=1.0 2023-06-23 07:19:06,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1429992.0, ans=0.2 2023-06-23 07:19:24,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-23 07:19:49,223 INFO [train.py:996] (0/4) Epoch 8, batch 24900, loss[loss=0.2053, simple_loss=0.2737, pruned_loss=0.06847, over 21718.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2995, pruned_loss=0.08093, over 4270494.44 frames. ], batch size: 247, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:19:51,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1430172.0, ans=0.0 2023-06-23 07:20:34,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1430292.0, ans=0.125 2023-06-23 07:20:44,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1430292.0, ans=0.0 2023-06-23 07:20:47,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1430292.0, ans=0.1 2023-06-23 07:21:34,320 INFO [train.py:996] (0/4) Epoch 8, batch 24950, loss[loss=0.2122, simple_loss=0.257, pruned_loss=0.08373, over 20322.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3074, pruned_loss=0.08529, over 4274523.08 frames. ], batch size: 703, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:22:03,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.572e+02 4.703e+02 5.868e+02 8.505e+02 2.192e+03, threshold=1.174e+03, percent-clipped=6.0 2023-06-23 07:22:25,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-23 07:22:31,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-23 07:23:20,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1430772.0, ans=0.0 2023-06-23 07:23:21,902 INFO [train.py:996] (0/4) Epoch 8, batch 25000, loss[loss=0.2647, simple_loss=0.3601, pruned_loss=0.08471, over 16681.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3141, pruned_loss=0.08746, over 4269368.48 frames. ], batch size: 60, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:23:28,854 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:24:46,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1431012.0, ans=0.125 2023-06-23 07:24:53,451 INFO [train.py:996] (0/4) Epoch 8, batch 25050, loss[loss=0.2108, simple_loss=0.2667, pruned_loss=0.07746, over 21223.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3069, pruned_loss=0.08554, over 4277305.46 frames. ], batch size: 549, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:25:08,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.62 vs. limit=22.5 2023-06-23 07:25:17,637 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.487e+02 5.838e+02 7.912e+02 1.332e+03, threshold=1.168e+03, percent-clipped=3.0 2023-06-23 07:25:18,043 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:25:25,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1431132.0, ans=0.125 2023-06-23 07:25:30,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=12.0 2023-06-23 07:25:33,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1431192.0, ans=0.0 2023-06-23 07:26:33,600 INFO [train.py:996] (0/4) Epoch 8, batch 25100, loss[loss=0.2017, simple_loss=0.2713, pruned_loss=0.06608, over 21554.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3013, pruned_loss=0.08431, over 4275826.60 frames. ], batch size: 247, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:27:04,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.22 vs. limit=6.0 2023-06-23 07:27:29,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1431552.0, ans=0.125 2023-06-23 07:28:11,766 INFO [train.py:996] (0/4) Epoch 8, batch 25150, loss[loss=0.2282, simple_loss=0.3146, pruned_loss=0.07089, over 21854.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3056, pruned_loss=0.08228, over 4263575.07 frames. ], batch size: 118, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:28:34,961 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.013e+02 4.444e+02 6.518e+02 1.039e+03 2.142e+03, threshold=1.304e+03, percent-clipped=17.0 2023-06-23 07:28:35,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1431732.0, ans=0.125 2023-06-23 07:28:35,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1431732.0, ans=0.125 2023-06-23 07:28:41,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1431732.0, ans=0.125 2023-06-23 07:28:51,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-06-23 07:29:03,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1431852.0, ans=0.035 2023-06-23 07:29:11,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1431852.0, ans=0.2 2023-06-23 07:29:48,564 INFO [train.py:996] (0/4) Epoch 8, batch 25200, loss[loss=0.1993, simple_loss=0.2722, pruned_loss=0.06321, over 21915.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3053, pruned_loss=0.07979, over 4271551.26 frames. ], batch size: 107, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:30:27,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1432092.0, ans=0.125 2023-06-23 07:30:39,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-23 07:30:46,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432152.0, ans=0.1 2023-06-23 07:31:10,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1432212.0, ans=0.0 2023-06-23 07:31:20,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1432272.0, ans=0.125 2023-06-23 07:31:26,047 INFO [train.py:996] (0/4) Epoch 8, batch 25250, loss[loss=0.251, simple_loss=0.3039, pruned_loss=0.09908, over 21868.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3034, pruned_loss=0.07897, over 4267872.13 frames. ], batch size: 98, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:31:49,882 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.012e+02 4.456e+02 5.347e+02 9.796e+02 2.256e+03, threshold=1.069e+03, percent-clipped=12.0 2023-06-23 07:32:12,800 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-23 07:32:39,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1432512.0, ans=0.125 2023-06-23 07:32:39,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1432512.0, ans=15.0 2023-06-23 07:32:58,682 INFO [train.py:996] (0/4) Epoch 8, batch 25300, loss[loss=0.2872, simple_loss=0.3535, pruned_loss=0.1104, over 21248.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3009, pruned_loss=0.07751, over 4271713.40 frames. ], batch size: 143, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:33:39,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-23 07:33:40,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1432692.0, ans=0.0 2023-06-23 07:34:37,984 INFO [train.py:996] (0/4) Epoch 8, batch 25350, loss[loss=0.2146, simple_loss=0.2966, pruned_loss=0.0663, over 21611.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3038, pruned_loss=0.07719, over 4269866.87 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:34:50,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1432872.0, ans=0.1 2023-06-23 07:35:02,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.301e+02 4.631e+02 6.587e+02 1.003e+03 1.652e+03, threshold=1.317e+03, percent-clipped=14.0 2023-06-23 07:35:07,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-06-23 07:35:25,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1432992.0, ans=0.125 2023-06-23 07:36:14,613 INFO [train.py:996] (0/4) Epoch 8, batch 25400, loss[loss=0.2496, simple_loss=0.2994, pruned_loss=0.0999, over 21498.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2992, pruned_loss=0.07584, over 4267109.82 frames. ], batch size: 441, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:37:14,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1433352.0, ans=0.0 2023-06-23 07:37:28,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-06-23 07:37:51,436 INFO [train.py:996] (0/4) Epoch 8, batch 25450, loss[loss=0.2225, simple_loss=0.298, pruned_loss=0.07352, over 21834.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2989, pruned_loss=0.07718, over 4264797.65 frames. ], batch size: 118, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:38:17,199 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.052e+02 4.125e+02 5.251e+02 6.939e+02 1.396e+03, threshold=1.050e+03, percent-clipped=1.0 2023-06-23 07:38:58,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1433652.0, ans=0.125 2023-06-23 07:39:32,055 INFO [train.py:996] (0/4) Epoch 8, batch 25500, loss[loss=0.2183, simple_loss=0.2907, pruned_loss=0.07296, over 15789.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2989, pruned_loss=0.07394, over 4252250.96 frames. ], batch size: 62, lr: 3.65e-03, grad_scale: 8.0 2023-06-23 07:39:34,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1433772.0, ans=0.2 2023-06-23 07:39:51,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1433832.0, ans=0.125 2023-06-23 07:40:14,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1433892.0, ans=0.2 2023-06-23 07:40:28,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1433952.0, ans=0.1 2023-06-23 07:41:11,164 INFO [train.py:996] (0/4) Epoch 8, batch 25550, loss[loss=0.2804, simple_loss=0.3756, pruned_loss=0.09265, over 21573.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3071, pruned_loss=0.07537, over 4251249.58 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 8.0 2023-06-23 07:41:35,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1434132.0, ans=0.1 2023-06-23 07:41:38,529 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.786e+02 4.210e+02 5.304e+02 7.417e+02 2.336e+03, threshold=1.061e+03, percent-clipped=9.0 2023-06-23 07:42:03,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1434192.0, ans=0.125 2023-06-23 07:42:25,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1434252.0, ans=0.125 2023-06-23 07:42:26,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1434252.0, ans=0.125 2023-06-23 07:42:55,566 INFO [train.py:996] (0/4) Epoch 8, batch 25600, loss[loss=0.3295, simple_loss=0.3884, pruned_loss=0.1353, over 21477.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3125, pruned_loss=0.07704, over 4259916.36 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:43:22,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1434432.0, ans=0.125 2023-06-23 07:44:33,964 INFO [train.py:996] (0/4) Epoch 8, batch 25650, loss[loss=0.2571, simple_loss=0.311, pruned_loss=0.1016, over 21593.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3127, pruned_loss=0.08017, over 4261292.71 frames. ], batch size: 415, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:44:48,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1434732.0, ans=0.0 2023-06-23 07:44:48,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1434732.0, ans=0.2 2023-06-23 07:44:55,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-23 07:44:55,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.690e+02 5.926e+02 8.067e+02 1.090e+03 2.033e+03, threshold=1.613e+03, percent-clipped=28.0 2023-06-23 07:44:59,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1434732.0, ans=0.0 2023-06-23 07:46:11,855 INFO [train.py:996] (0/4) Epoch 8, batch 25700, loss[loss=0.2949, simple_loss=0.4272, pruned_loss=0.08129, over 19755.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3092, pruned_loss=0.08094, over 4255735.80 frames. ], batch size: 702, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:46:13,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1434972.0, ans=0.1 2023-06-23 07:47:38,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1435212.0, ans=0.125 2023-06-23 07:47:52,548 INFO [train.py:996] (0/4) Epoch 8, batch 25750, loss[loss=0.2958, simple_loss=0.3814, pruned_loss=0.1051, over 21721.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3151, pruned_loss=0.08356, over 4263043.49 frames. ], batch size: 332, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:48:25,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.333e+02 5.092e+02 6.488e+02 8.589e+02 2.442e+03, threshold=1.298e+03, percent-clipped=2.0 2023-06-23 07:48:48,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1435392.0, ans=0.5 2023-06-23 07:49:20,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1435512.0, ans=0.05 2023-06-23 07:49:23,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1435512.0, ans=0.0 2023-06-23 07:49:38,491 INFO [train.py:996] (0/4) Epoch 8, batch 25800, loss[loss=0.3015, simple_loss=0.3715, pruned_loss=0.1157, over 21389.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.328, pruned_loss=0.08839, over 4260832.88 frames. ], batch size: 159, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:49:59,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1435572.0, ans=0.125 2023-06-23 07:50:02,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1435632.0, ans=0.125 2023-06-23 07:50:09,181 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:50:17,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1435632.0, ans=0.0 2023-06-23 07:50:28,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1435692.0, ans=0.07 2023-06-23 07:50:30,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1435692.0, ans=0.0 2023-06-23 07:50:30,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-23 07:51:22,178 INFO [train.py:996] (0/4) Epoch 8, batch 25850, loss[loss=0.2332, simple_loss=0.2983, pruned_loss=0.08405, over 21550.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3291, pruned_loss=0.08791, over 4268109.44 frames. ], batch size: 211, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:51:45,490 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.179e+02 4.988e+02 6.409e+02 1.000e+03 3.081e+03, threshold=1.282e+03, percent-clipped=14.0 2023-06-23 07:52:00,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1435992.0, ans=0.0 2023-06-23 07:52:27,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1436052.0, ans=0.125 2023-06-23 07:52:46,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1436112.0, ans=0.125 2023-06-23 07:52:57,081 INFO [train.py:996] (0/4) Epoch 8, batch 25900, loss[loss=0.3152, simple_loss=0.401, pruned_loss=0.1147, over 21812.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3312, pruned_loss=0.08964, over 4274219.59 frames. ], batch size: 351, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:53:20,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-23 07:54:25,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1436412.0, ans=0.0 2023-06-23 07:54:33,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1436412.0, ans=0.0 2023-06-23 07:54:36,050 INFO [train.py:996] (0/4) Epoch 8, batch 25950, loss[loss=0.2812, simple_loss=0.351, pruned_loss=0.1057, over 21812.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3389, pruned_loss=0.09351, over 4272572.45 frames. ], batch size: 282, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 07:55:03,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.640e+02 4.822e+02 6.504e+02 9.167e+02 2.432e+03, threshold=1.301e+03, percent-clipped=14.0 2023-06-23 07:55:17,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.98 vs. limit=10.0 2023-06-23 07:56:14,702 INFO [train.py:996] (0/4) Epoch 8, batch 26000, loss[loss=0.2762, simple_loss=0.3552, pruned_loss=0.09861, over 21687.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3373, pruned_loss=0.09093, over 4265989.05 frames. ], batch size: 351, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 07:56:26,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1436772.0, ans=0.1 2023-06-23 07:57:20,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1436952.0, ans=0.2 2023-06-23 07:57:25,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1436952.0, ans=0.125 2023-06-23 07:57:25,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1436952.0, ans=0.0 2023-06-23 07:57:44,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.97 vs. limit=15.0 2023-06-23 07:57:57,828 INFO [train.py:996] (0/4) Epoch 8, batch 26050, loss[loss=0.288, simple_loss=0.3439, pruned_loss=0.1161, over 21810.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3376, pruned_loss=0.09319, over 4272062.67 frames. ], batch size: 441, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 07:57:59,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-23 07:58:19,770 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 4.589e+02 6.004e+02 7.871e+02 1.709e+03, threshold=1.201e+03, percent-clipped=5.0 2023-06-23 07:59:11,005 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:59:36,460 INFO [train.py:996] (0/4) Epoch 8, batch 26100, loss[loss=0.2395, simple_loss=0.2944, pruned_loss=0.09231, over 21350.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3315, pruned_loss=0.09316, over 4286521.31 frames. ], batch size: 176, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 07:59:41,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1437372.0, ans=0.05 2023-06-23 08:00:04,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1437432.0, ans=0.2 2023-06-23 08:00:07,607 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:00:38,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=15.0 2023-06-23 08:00:46,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1437552.0, ans=0.1 2023-06-23 08:00:46,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1437552.0, ans=0.0 2023-06-23 08:00:50,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1437552.0, ans=0.125 2023-06-23 08:00:57,121 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:01:16,868 INFO [train.py:996] (0/4) Epoch 8, batch 26150, loss[loss=0.2586, simple_loss=0.3299, pruned_loss=0.09363, over 21237.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3294, pruned_loss=0.09356, over 4292900.12 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:01:45,501 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 4.992e+02 6.219e+02 9.688e+02 1.983e+03, threshold=1.244e+03, percent-clipped=15.0 2023-06-23 08:02:19,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1437852.0, ans=0.0 2023-06-23 08:02:51,834 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-23 08:02:55,526 INFO [train.py:996] (0/4) Epoch 8, batch 26200, loss[loss=0.2464, simple_loss=0.3466, pruned_loss=0.0731, over 21658.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3295, pruned_loss=0.09041, over 4294862.55 frames. ], batch size: 414, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:02:59,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1437972.0, ans=0.125 2023-06-23 08:03:08,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1437972.0, ans=0.125 2023-06-23 08:04:04,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1438152.0, ans=0.0 2023-06-23 08:04:04,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1438152.0, ans=0.035 2023-06-23 08:04:34,819 INFO [train.py:996] (0/4) Epoch 8, batch 26250, loss[loss=0.2544, simple_loss=0.3247, pruned_loss=0.09206, over 21913.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3319, pruned_loss=0.08893, over 4297994.24 frames. ], batch size: 351, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:04:49,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1438272.0, ans=0.0 2023-06-23 08:04:56,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-23 08:05:07,708 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 4.875e+02 6.519e+02 1.074e+03 2.423e+03, threshold=1.304e+03, percent-clipped=19.0 2023-06-23 08:06:00,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-23 08:06:01,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1438512.0, ans=0.125 2023-06-23 08:06:12,247 INFO [train.py:996] (0/4) Epoch 8, batch 26300, loss[loss=0.2305, simple_loss=0.2985, pruned_loss=0.0813, over 21582.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3284, pruned_loss=0.08975, over 4303928.38 frames. ], batch size: 195, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:06:43,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-23 08:08:01,854 INFO [train.py:996] (0/4) Epoch 8, batch 26350, loss[loss=0.2829, simple_loss=0.3541, pruned_loss=0.1058, over 21400.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3268, pruned_loss=0.08991, over 4304716.87 frames. ], batch size: 548, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:08:30,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.621e+02 4.985e+02 6.232e+02 7.669e+02 1.189e+03, threshold=1.246e+03, percent-clipped=0.0 2023-06-23 08:08:59,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1439052.0, ans=0.0 2023-06-23 08:09:00,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1439052.0, ans=0.1 2023-06-23 08:09:40,199 INFO [train.py:996] (0/4) Epoch 8, batch 26400, loss[loss=0.2223, simple_loss=0.2811, pruned_loss=0.08178, over 21251.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3211, pruned_loss=0.09001, over 4284237.33 frames. ], batch size: 176, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:09:40,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1439172.0, ans=0.07 2023-06-23 08:10:21,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=22.5 2023-06-23 08:10:22,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1439292.0, ans=0.125 2023-06-23 08:11:20,574 INFO [train.py:996] (0/4) Epoch 8, batch 26450, loss[loss=0.2256, simple_loss=0.3263, pruned_loss=0.06244, over 19853.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3197, pruned_loss=0.08851, over 4274971.11 frames. ], batch size: 707, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:11:30,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1439472.0, ans=0.0 2023-06-23 08:11:50,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 6.327e+02 8.779e+02 1.313e+03 2.472e+03, threshold=1.756e+03, percent-clipped=25.0 2023-06-23 08:12:04,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1439592.0, ans=0.07 2023-06-23 08:12:24,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1439652.0, ans=0.125 2023-06-23 08:13:01,284 INFO [train.py:996] (0/4) Epoch 8, batch 26500, loss[loss=0.2021, simple_loss=0.2558, pruned_loss=0.07424, over 21274.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3184, pruned_loss=0.08605, over 4272772.06 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:13:03,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1439772.0, ans=0.1 2023-06-23 08:13:31,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1439832.0, ans=0.0 2023-06-23 08:13:59,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1439952.0, ans=0.125 2023-06-23 08:14:14,730 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.87 vs. limit=15.0 2023-06-23 08:14:16,871 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-240000.pt 2023-06-23 08:14:39,344 INFO [train.py:996] (0/4) Epoch 8, batch 26550, loss[loss=0.1919, simple_loss=0.2825, pruned_loss=0.05067, over 21720.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3158, pruned_loss=0.08316, over 4261307.99 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:14:40,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-23 08:14:51,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1440072.0, ans=0.125 2023-06-23 08:15:20,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.436e+02 5.263e+02 8.028e+02 1.102e+03 2.204e+03, threshold=1.606e+03, percent-clipped=5.0 2023-06-23 08:15:20,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1440132.0, ans=0.125 2023-06-23 08:15:49,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-23 08:16:23,296 INFO [train.py:996] (0/4) Epoch 8, batch 26600, loss[loss=0.2125, simple_loss=0.282, pruned_loss=0.07144, over 21255.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3163, pruned_loss=0.08085, over 4262707.01 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:16:27,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1440372.0, ans=0.125 2023-06-23 08:16:31,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1440372.0, ans=0.0 2023-06-23 08:17:15,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1440492.0, ans=0.2 2023-06-23 08:18:02,027 INFO [train.py:996] (0/4) Epoch 8, batch 26650, loss[loss=0.1885, simple_loss=0.2569, pruned_loss=0.06008, over 21403.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3104, pruned_loss=0.07974, over 4263405.26 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:18:22,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-23 08:18:36,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 4.294e+02 5.616e+02 7.721e+02 1.631e+03, threshold=1.123e+03, percent-clipped=1.0 2023-06-23 08:19:32,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1440912.0, ans=0.125 2023-06-23 08:19:39,924 INFO [train.py:996] (0/4) Epoch 8, batch 26700, loss[loss=0.267, simple_loss=0.336, pruned_loss=0.09901, over 21875.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3026, pruned_loss=0.07675, over 4266016.06 frames. ], batch size: 107, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:19:48,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1440972.0, ans=0.125 2023-06-23 08:19:58,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1440972.0, ans=0.1 2023-06-23 08:21:25,380 INFO [train.py:996] (0/4) Epoch 8, batch 26750, loss[loss=0.3018, simple_loss=0.3693, pruned_loss=0.1172, over 21406.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3033, pruned_loss=0.07611, over 4275765.19 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:21:32,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1441272.0, ans=0.0 2023-06-23 08:21:56,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.690e+02 4.314e+02 5.876e+02 8.992e+02 1.662e+03, threshold=1.175e+03, percent-clipped=13.0 2023-06-23 08:22:04,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-23 08:22:24,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1441452.0, ans=0.125 2023-06-23 08:22:29,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1441452.0, ans=0.0 2023-06-23 08:23:11,024 INFO [train.py:996] (0/4) Epoch 8, batch 26800, loss[loss=0.2326, simple_loss=0.3092, pruned_loss=0.07795, over 21857.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3113, pruned_loss=0.0804, over 4275352.86 frames. ], batch size: 247, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:23:26,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1441632.0, ans=0.0 2023-06-23 08:23:30,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1441632.0, ans=0.125 2023-06-23 08:23:34,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1441632.0, ans=0.2 2023-06-23 08:23:37,971 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:24:01,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1441692.0, ans=0.125 2023-06-23 08:24:16,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1441752.0, ans=0.125 2023-06-23 08:24:49,806 INFO [train.py:996] (0/4) Epoch 8, batch 26850, loss[loss=0.2378, simple_loss=0.2948, pruned_loss=0.09046, over 20682.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3127, pruned_loss=0.08281, over 4276137.34 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:24:59,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1441872.0, ans=0.0 2023-06-23 08:25:13,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1441932.0, ans=0.0 2023-06-23 08:25:14,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.645e+02 5.050e+02 6.196e+02 9.210e+02 1.737e+03, threshold=1.239e+03, percent-clipped=8.0 2023-06-23 08:25:17,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1441932.0, ans=0.2 2023-06-23 08:26:22,998 INFO [train.py:996] (0/4) Epoch 8, batch 26900, loss[loss=0.2437, simple_loss=0.3466, pruned_loss=0.07044, over 19878.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3063, pruned_loss=0.08225, over 4276542.97 frames. ], batch size: 702, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:26:29,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1442172.0, ans=0.1 2023-06-23 08:26:31,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1442172.0, ans=0.125 2023-06-23 08:27:18,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1442292.0, ans=0.1 2023-06-23 08:28:02,567 INFO [train.py:996] (0/4) Epoch 8, batch 26950, loss[loss=0.226, simple_loss=0.3096, pruned_loss=0.07119, over 21567.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3045, pruned_loss=0.08205, over 4271368.13 frames. ], batch size: 230, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:28:26,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1442532.0, ans=0.1 2023-06-23 08:28:33,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.377e+02 4.817e+02 6.890e+02 1.132e+03 2.322e+03, threshold=1.378e+03, percent-clipped=18.0 2023-06-23 08:29:31,702 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.99 vs. limit=10.0 2023-06-23 08:29:35,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1442712.0, ans=0.035 2023-06-23 08:29:37,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1442712.0, ans=0.0 2023-06-23 08:29:46,691 INFO [train.py:996] (0/4) Epoch 8, batch 27000, loss[loss=0.2638, simple_loss=0.3496, pruned_loss=0.08895, over 21611.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3054, pruned_loss=0.0801, over 4266121.47 frames. ], batch size: 442, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:29:46,692 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 08:30:02,864 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2419, simple_loss=0.3397, pruned_loss=0.07206, over 1796401.00 frames. 2023-06-23 08:30:02,865 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 08:31:39,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1443012.0, ans=0.1 2023-06-23 08:31:42,043 INFO [train.py:996] (0/4) Epoch 8, batch 27050, loss[loss=0.2892, simple_loss=0.3554, pruned_loss=0.1115, over 21758.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3083, pruned_loss=0.07685, over 4269958.33 frames. ], batch size: 441, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:32:18,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.89 vs. limit=15.0 2023-06-23 08:32:18,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.931e+02 4.261e+02 5.762e+02 7.370e+02 1.710e+03, threshold=1.152e+03, percent-clipped=3.0 2023-06-23 08:32:22,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1443192.0, ans=0.2 2023-06-23 08:32:44,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1443252.0, ans=0.125 2023-06-23 08:33:14,996 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:33:20,870 INFO [train.py:996] (0/4) Epoch 8, batch 27100, loss[loss=0.2922, simple_loss=0.3619, pruned_loss=0.1113, over 21568.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3111, pruned_loss=0.07788, over 4280276.87 frames. ], batch size: 471, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:34:14,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=15.0 2023-06-23 08:34:49,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1443612.0, ans=0.0 2023-06-23 08:34:54,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.90 vs. limit=10.0 2023-06-23 08:35:01,739 INFO [train.py:996] (0/4) Epoch 8, batch 27150, loss[loss=0.2926, simple_loss=0.3861, pruned_loss=0.09956, over 21672.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3212, pruned_loss=0.08126, over 4283456.64 frames. ], batch size: 414, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:35:17,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1443672.0, ans=0.125 2023-06-23 08:35:28,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-23 08:35:43,350 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.713e+02 7.787e+02 1.225e+03 2.393e+03, threshold=1.557e+03, percent-clipped=28.0 2023-06-23 08:35:45,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1443732.0, ans=0.125 2023-06-23 08:35:47,055 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:36:12,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1443852.0, ans=0.125 2023-06-23 08:36:21,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1443852.0, ans=0.95 2023-06-23 08:36:46,654 INFO [train.py:996] (0/4) Epoch 8, batch 27200, loss[loss=0.287, simple_loss=0.3504, pruned_loss=0.1118, over 21390.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3305, pruned_loss=0.0845, over 4288602.21 frames. ], batch size: 211, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:37:00,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1443972.0, ans=0.0 2023-06-23 08:37:46,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1444092.0, ans=0.125 2023-06-23 08:38:01,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1444152.0, ans=0.0 2023-06-23 08:38:36,526 INFO [train.py:996] (0/4) Epoch 8, batch 27250, loss[loss=0.3, simple_loss=0.3603, pruned_loss=0.1199, over 21586.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3358, pruned_loss=0.0898, over 4286771.51 frames. ], batch size: 415, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:39:06,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1444332.0, ans=0.0 2023-06-23 08:39:09,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.379e+02 5.510e+02 6.974e+02 9.879e+02 1.721e+03, threshold=1.395e+03, percent-clipped=1.0 2023-06-23 08:39:26,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-23 08:40:17,646 INFO [train.py:996] (0/4) Epoch 8, batch 27300, loss[loss=0.236, simple_loss=0.3223, pruned_loss=0.07486, over 21680.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3373, pruned_loss=0.09054, over 4283940.02 frames. ], batch size: 263, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:40:24,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1444572.0, ans=0.0 2023-06-23 08:40:26,796 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-23 08:40:37,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1444632.0, ans=0.125 2023-06-23 08:41:09,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1444692.0, ans=0.2 2023-06-23 08:41:17,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-23 08:41:34,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1444752.0, ans=0.07 2023-06-23 08:41:48,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1444812.0, ans=0.035 2023-06-23 08:41:51,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1444812.0, ans=0.125 2023-06-23 08:41:57,274 INFO [train.py:996] (0/4) Epoch 8, batch 27350, loss[loss=0.2528, simple_loss=0.3342, pruned_loss=0.08572, over 21246.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3398, pruned_loss=0.0909, over 4271995.13 frames. ], batch size: 176, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:42:06,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1444872.0, ans=0.1 2023-06-23 08:42:28,134 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.496e+02 4.684e+02 5.886e+02 7.664e+02 1.698e+03, threshold=1.177e+03, percent-clipped=3.0 2023-06-23 08:42:33,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1444992.0, ans=0.1 2023-06-23 08:42:34,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1444992.0, ans=0.125 2023-06-23 08:43:03,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1445052.0, ans=0.125 2023-06-23 08:43:28,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1445112.0, ans=0.125 2023-06-23 08:43:28,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1445112.0, ans=0.125 2023-06-23 08:43:34,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-23 08:43:40,294 INFO [train.py:996] (0/4) Epoch 8, batch 27400, loss[loss=0.2141, simple_loss=0.2746, pruned_loss=0.07675, over 21572.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3336, pruned_loss=0.0892, over 4267757.55 frames. ], batch size: 263, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:43:40,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1445172.0, ans=0.2 2023-06-23 08:44:58,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1445412.0, ans=0.125 2023-06-23 08:45:19,324 INFO [train.py:996] (0/4) Epoch 8, batch 27450, loss[loss=0.237, simple_loss=0.3239, pruned_loss=0.07504, over 21894.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3257, pruned_loss=0.08665, over 4269218.55 frames. ], batch size: 372, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:45:33,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1445532.0, ans=0.125 2023-06-23 08:45:50,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.374e+02 5.145e+02 6.858e+02 8.934e+02 1.227e+03, threshold=1.372e+03, percent-clipped=2.0 2023-06-23 08:46:55,707 INFO [train.py:996] (0/4) Epoch 8, batch 27500, loss[loss=0.2376, simple_loss=0.3057, pruned_loss=0.08476, over 21250.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3236, pruned_loss=0.08714, over 4276078.24 frames. ], batch size: 143, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:46:56,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1445772.0, ans=0.0 2023-06-23 08:47:55,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1445892.0, ans=0.2 2023-06-23 08:48:02,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1445952.0, ans=0.1 2023-06-23 08:48:02,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1445952.0, ans=0.0 2023-06-23 08:48:20,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1446012.0, ans=0.1 2023-06-23 08:48:34,523 INFO [train.py:996] (0/4) Epoch 8, batch 27550, loss[loss=0.2155, simple_loss=0.279, pruned_loss=0.07595, over 21350.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3175, pruned_loss=0.08369, over 4275822.59 frames. ], batch size: 211, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:48:39,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1446072.0, ans=0.125 2023-06-23 08:48:49,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1446132.0, ans=0.1 2023-06-23 08:49:06,592 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 4.176e+02 5.018e+02 7.145e+02 2.103e+03, threshold=1.004e+03, percent-clipped=5.0 2023-06-23 08:49:13,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1446192.0, ans=0.1 2023-06-23 08:49:37,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1446252.0, ans=0.125 2023-06-23 08:50:07,977 INFO [train.py:996] (0/4) Epoch 8, batch 27600, loss[loss=0.2853, simple_loss=0.3511, pruned_loss=0.1098, over 20072.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3112, pruned_loss=0.08319, over 4276454.98 frames. ], batch size: 702, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:51:05,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1446492.0, ans=0.035 2023-06-23 08:51:45,627 INFO [train.py:996] (0/4) Epoch 8, batch 27650, loss[loss=0.258, simple_loss=0.3144, pruned_loss=0.1008, over 21290.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3051, pruned_loss=0.0826, over 4274845.36 frames. ], batch size: 176, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:52:19,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.399e+02 4.844e+02 6.403e+02 8.598e+02 1.573e+03, threshold=1.281e+03, percent-clipped=18.0 2023-06-23 08:52:54,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1446852.0, ans=0.0 2023-06-23 08:53:21,706 INFO [train.py:996] (0/4) Epoch 8, batch 27700, loss[loss=0.2253, simple_loss=0.3108, pruned_loss=0.06989, over 21784.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3083, pruned_loss=0.08192, over 4281036.16 frames. ], batch size: 332, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:54:14,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1447092.0, ans=0.125 2023-06-23 08:54:15,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-23 08:54:18,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1447092.0, ans=0.0 2023-06-23 08:54:37,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1447152.0, ans=0.5 2023-06-23 08:55:00,523 INFO [train.py:996] (0/4) Epoch 8, batch 27750, loss[loss=0.2258, simple_loss=0.3092, pruned_loss=0.0712, over 21493.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3098, pruned_loss=0.08066, over 4282993.96 frames. ], batch size: 211, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:55:16,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-23 08:55:32,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 5.055e+02 6.711e+02 8.629e+02 1.749e+03, threshold=1.342e+03, percent-clipped=9.0 2023-06-23 08:55:36,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1447392.0, ans=0.125 2023-06-23 08:56:14,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1447512.0, ans=0.125 2023-06-23 08:56:18,529 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-23 08:56:35,742 INFO [train.py:996] (0/4) Epoch 8, batch 27800, loss[loss=0.239, simple_loss=0.3174, pruned_loss=0.08028, over 21801.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3071, pruned_loss=0.0801, over 4286349.51 frames. ], batch size: 112, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:56:40,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.45 vs. limit=10.0 2023-06-23 08:56:46,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1447572.0, ans=0.1 2023-06-23 08:57:37,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-23 08:57:41,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1447752.0, ans=0.125 2023-06-23 08:57:46,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1447752.0, ans=0.125 2023-06-23 08:58:15,621 INFO [train.py:996] (0/4) Epoch 8, batch 27850, loss[loss=0.247, simple_loss=0.3358, pruned_loss=0.07905, over 21819.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3072, pruned_loss=0.08163, over 4287513.24 frames. ], batch size: 332, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:58:50,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.347e+02 5.210e+02 6.936e+02 1.592e+03, threshold=1.042e+03, percent-clipped=2.0 2023-06-23 08:59:57,466 INFO [train.py:996] (0/4) Epoch 8, batch 27900, loss[loss=0.3335, simple_loss=0.4075, pruned_loss=0.1298, over 21514.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.318, pruned_loss=0.08424, over 4287003.28 frames. ], batch size: 471, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:00:01,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1448172.0, ans=0.2 2023-06-23 09:00:13,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1448232.0, ans=0.125 2023-06-23 09:00:29,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1448232.0, ans=0.5 2023-06-23 09:00:41,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-23 09:01:34,345 INFO [train.py:996] (0/4) Epoch 8, batch 27950, loss[loss=0.2069, simple_loss=0.3011, pruned_loss=0.05632, over 21722.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.316, pruned_loss=0.07932, over 4276179.37 frames. ], batch size: 247, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:01:56,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-23 09:02:08,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.302e+02 4.594e+02 6.671e+02 9.534e+02 1.876e+03, threshold=1.334e+03, percent-clipped=19.0 2023-06-23 09:02:32,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-23 09:03:08,011 INFO [train.py:996] (0/4) Epoch 8, batch 28000, loss[loss=0.2973, simple_loss=0.3553, pruned_loss=0.1197, over 21658.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3123, pruned_loss=0.07729, over 4271776.23 frames. ], batch size: 471, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:03:15,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1448772.0, ans=0.125 2023-06-23 09:04:01,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1448892.0, ans=0.125 2023-06-23 09:04:31,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1449012.0, ans=0.125 2023-06-23 09:04:44,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1449012.0, ans=0.0 2023-06-23 09:04:52,719 INFO [train.py:996] (0/4) Epoch 8, batch 28050, loss[loss=0.2072, simple_loss=0.2837, pruned_loss=0.06533, over 21807.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3099, pruned_loss=0.07875, over 4273017.58 frames. ], batch size: 282, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:05:14,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1449132.0, ans=0.04949747468305833 2023-06-23 09:05:26,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.011e+02 4.954e+02 6.052e+02 8.048e+02 2.120e+03, threshold=1.210e+03, percent-clipped=2.0 2023-06-23 09:06:07,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1449312.0, ans=0.025 2023-06-23 09:06:08,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1449312.0, ans=0.0 2023-06-23 09:06:27,356 INFO [train.py:996] (0/4) Epoch 8, batch 28100, loss[loss=0.2181, simple_loss=0.2829, pruned_loss=0.07667, over 21778.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3096, pruned_loss=0.07936, over 4273350.87 frames. ], batch size: 118, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:07:31,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1449552.0, ans=0.125 2023-06-23 09:08:01,921 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:08:03,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1449612.0, ans=0.125 2023-06-23 09:08:05,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1449672.0, ans=0.0 2023-06-23 09:08:06,377 INFO [train.py:996] (0/4) Epoch 8, batch 28150, loss[loss=0.1817, simple_loss=0.2428, pruned_loss=0.06034, over 21459.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3037, pruned_loss=0.07986, over 4263232.71 frames. ], batch size: 212, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:08:26,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1449732.0, ans=0.125 2023-06-23 09:08:39,169 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.413e+02 4.985e+02 7.502e+02 1.116e+03 2.390e+03, threshold=1.500e+03, percent-clipped=18.0 2023-06-23 09:08:47,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1449792.0, ans=0.0 2023-06-23 09:09:30,790 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.72 vs. limit=6.0 2023-06-23 09:09:43,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1449972.0, ans=0.125 2023-06-23 09:09:44,405 INFO [train.py:996] (0/4) Epoch 8, batch 28200, loss[loss=0.2554, simple_loss=0.3186, pruned_loss=0.09605, over 21439.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3023, pruned_loss=0.08048, over 4265052.19 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:10:21,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1450092.0, ans=0.07 2023-06-23 09:10:54,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-06-23 09:11:06,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1450212.0, ans=0.1 2023-06-23 09:11:27,091 INFO [train.py:996] (0/4) Epoch 8, batch 28250, loss[loss=0.2387, simple_loss=0.2953, pruned_loss=0.09102, over 21507.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3063, pruned_loss=0.08351, over 4262812.55 frames. ], batch size: 441, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:11:32,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1450272.0, ans=0.2 2023-06-23 09:11:39,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1450272.0, ans=0.04949747468305833 2023-06-23 09:11:54,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1450332.0, ans=0.0 2023-06-23 09:12:04,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.687e+02 5.319e+02 7.100e+02 8.712e+02 1.908e+03, threshold=1.420e+03, percent-clipped=3.0 2023-06-23 09:12:42,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1450452.0, ans=15.0 2023-06-23 09:13:06,478 INFO [train.py:996] (0/4) Epoch 8, batch 28300, loss[loss=0.1951, simple_loss=0.2808, pruned_loss=0.05469, over 21586.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3031, pruned_loss=0.08136, over 4260154.28 frames. ], batch size: 230, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:13:31,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-23 09:13:38,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-23 09:14:45,015 INFO [train.py:996] (0/4) Epoch 8, batch 28350, loss[loss=0.2136, simple_loss=0.3404, pruned_loss=0.04338, over 20813.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2987, pruned_loss=0.075, over 4263290.17 frames. ], batch size: 607, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:15:14,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1450932.0, ans=0.0 2023-06-23 09:15:18,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1450932.0, ans=0.2 2023-06-23 09:15:21,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.828e+02 5.599e+02 8.860e+02 1.294e+03 2.489e+03, threshold=1.772e+03, percent-clipped=23.0 2023-06-23 09:15:32,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1450992.0, ans=0.0 2023-06-23 09:16:23,430 INFO [train.py:996] (0/4) Epoch 8, batch 28400, loss[loss=0.2691, simple_loss=0.3298, pruned_loss=0.1042, over 21762.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2951, pruned_loss=0.07563, over 4264738.63 frames. ], batch size: 118, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:16:31,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1451172.0, ans=0.1 2023-06-23 09:16:36,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.28 vs. limit=10.0 2023-06-23 09:16:37,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1451172.0, ans=0.125 2023-06-23 09:17:24,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1451292.0, ans=0.1 2023-06-23 09:17:46,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1451412.0, ans=0.0 2023-06-23 09:17:59,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1451412.0, ans=0.025 2023-06-23 09:18:03,294 INFO [train.py:996] (0/4) Epoch 8, batch 28450, loss[loss=0.2758, simple_loss=0.3289, pruned_loss=0.1114, over 21329.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3023, pruned_loss=0.08043, over 4265215.81 frames. ], batch size: 159, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:18:09,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-23 09:18:41,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.622e+02 7.842e+02 1.295e+03 2.358e+03, threshold=1.568e+03, percent-clipped=7.0 2023-06-23 09:18:54,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=12.0 2023-06-23 09:19:10,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=12.0 2023-06-23 09:19:28,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1451712.0, ans=0.125 2023-06-23 09:19:39,561 INFO [train.py:996] (0/4) Epoch 8, batch 28500, loss[loss=0.2243, simple_loss=0.2902, pruned_loss=0.07914, over 21937.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3054, pruned_loss=0.08308, over 4275376.86 frames. ], batch size: 351, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:19:44,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1451772.0, ans=0.125 2023-06-23 09:20:04,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1451832.0, ans=0.1 2023-06-23 09:20:07,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.59 vs. limit=5.0 2023-06-23 09:21:10,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.19 vs. limit=8.0 2023-06-23 09:21:16,162 INFO [train.py:996] (0/4) Epoch 8, batch 28550, loss[loss=0.2822, simple_loss=0.3803, pruned_loss=0.09204, over 21695.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3127, pruned_loss=0.08532, over 4276089.15 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:21:37,331 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:22:02,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.502e+02 4.663e+02 6.262e+02 9.623e+02 1.798e+03, threshold=1.252e+03, percent-clipped=1.0 2023-06-23 09:22:34,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.42 vs. limit=15.0 2023-06-23 09:23:02,388 INFO [train.py:996] (0/4) Epoch 8, batch 28600, loss[loss=0.3291, simple_loss=0.3805, pruned_loss=0.1389, over 21402.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3217, pruned_loss=0.08829, over 4276401.46 frames. ], batch size: 471, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:23:25,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-23 09:23:33,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1452432.0, ans=0.1 2023-06-23 09:23:40,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1452432.0, ans=0.125 2023-06-23 09:23:42,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-23 09:24:41,384 INFO [train.py:996] (0/4) Epoch 8, batch 28650, loss[loss=0.1934, simple_loss=0.26, pruned_loss=0.06343, over 21111.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3152, pruned_loss=0.08664, over 4268810.30 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:25:07,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1452732.0, ans=0.1 2023-06-23 09:25:15,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1452732.0, ans=0.125 2023-06-23 09:25:23,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.145e+02 4.493e+02 5.758e+02 7.930e+02 1.580e+03, threshold=1.152e+03, percent-clipped=4.0 2023-06-23 09:25:26,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1452792.0, ans=0.2 2023-06-23 09:25:43,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-23 09:25:49,937 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:26:13,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1452912.0, ans=0.125 2023-06-23 09:26:16,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1452912.0, ans=0.1 2023-06-23 09:26:19,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-23 09:26:26,175 INFO [train.py:996] (0/4) Epoch 8, batch 28700, loss[loss=0.2088, simple_loss=0.2724, pruned_loss=0.07257, over 21278.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3125, pruned_loss=0.0871, over 4272316.84 frames. ], batch size: 549, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:26:31,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1452972.0, ans=0.0 2023-06-23 09:26:34,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1452972.0, ans=0.125 2023-06-23 09:27:09,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1453092.0, ans=0.0 2023-06-23 09:27:38,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-23 09:28:02,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1453212.0, ans=0.125 2023-06-23 09:28:05,324 INFO [train.py:996] (0/4) Epoch 8, batch 28750, loss[loss=0.2412, simple_loss=0.3151, pruned_loss=0.08371, over 21419.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3136, pruned_loss=0.08805, over 4274341.84 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:28:26,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-23 09:28:28,385 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-23 09:28:39,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1453332.0, ans=0.125 2023-06-23 09:28:41,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.805e+02 4.984e+02 6.274e+02 9.092e+02 1.737e+03, threshold=1.255e+03, percent-clipped=10.0 2023-06-23 09:29:36,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-23 09:29:49,452 INFO [train.py:996] (0/4) Epoch 8, batch 28800, loss[loss=0.2466, simple_loss=0.3232, pruned_loss=0.08502, over 21326.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.318, pruned_loss=0.08864, over 4271502.79 frames. ], batch size: 548, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:29:58,401 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:30:41,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1453752.0, ans=0.04949747468305833 2023-06-23 09:30:56,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=22.5 2023-06-23 09:31:29,198 INFO [train.py:996] (0/4) Epoch 8, batch 28850, loss[loss=0.21, simple_loss=0.275, pruned_loss=0.07249, over 20992.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3203, pruned_loss=0.09019, over 4279832.53 frames. ], batch size: 607, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:32:03,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.517e+02 4.921e+02 6.393e+02 7.769e+02 1.909e+03, threshold=1.279e+03, percent-clipped=3.0 2023-06-23 09:32:52,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1454112.0, ans=0.1 2023-06-23 09:32:57,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1454112.0, ans=0.1 2023-06-23 09:33:11,072 INFO [train.py:996] (0/4) Epoch 8, batch 28900, loss[loss=0.2484, simple_loss=0.3197, pruned_loss=0.08849, over 21372.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3227, pruned_loss=0.09201, over 4284188.17 frames. ], batch size: 548, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:33:53,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-23 09:34:26,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1454352.0, ans=0.125 2023-06-23 09:34:49,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1454412.0, ans=0.125 2023-06-23 09:34:52,396 INFO [train.py:996] (0/4) Epoch 8, batch 28950, loss[loss=0.2396, simple_loss=0.3359, pruned_loss=0.07162, over 21832.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3225, pruned_loss=0.09036, over 4281679.84 frames. ], batch size: 371, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:35:41,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 4.837e+02 6.969e+02 9.888e+02 2.996e+03, threshold=1.394e+03, percent-clipped=10.0 2023-06-23 09:36:13,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.34 vs. limit=10.0 2023-06-23 09:36:19,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1454712.0, ans=0.0 2023-06-23 09:36:32,718 INFO [train.py:996] (0/4) Epoch 8, batch 29000, loss[loss=0.2412, simple_loss=0.3178, pruned_loss=0.08224, over 21820.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3249, pruned_loss=0.08832, over 4274806.45 frames. ], batch size: 247, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:37:35,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1454892.0, ans=0.0 2023-06-23 09:38:03,402 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.11 vs. limit=10.0 2023-06-23 09:38:10,316 INFO [train.py:996] (0/4) Epoch 8, batch 29050, loss[loss=0.2721, simple_loss=0.3334, pruned_loss=0.1054, over 21835.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3232, pruned_loss=0.0895, over 4284806.82 frames. ], batch size: 441, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:38:48,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-23 09:39:01,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.533e+02 4.892e+02 6.390e+02 8.567e+02 1.270e+03, threshold=1.278e+03, percent-clipped=0.0 2023-06-23 09:39:17,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1455252.0, ans=0.0 2023-06-23 09:39:48,133 INFO [train.py:996] (0/4) Epoch 8, batch 29100, loss[loss=0.2159, simple_loss=0.2786, pruned_loss=0.07656, over 21763.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3144, pruned_loss=0.08706, over 4286836.03 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:40:01,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1455372.0, ans=0.2 2023-06-23 09:40:57,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1455552.0, ans=0.0 2023-06-23 09:40:59,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1455552.0, ans=0.125 2023-06-23 09:41:00,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1455552.0, ans=0.1 2023-06-23 09:41:05,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-23 09:41:17,804 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:41:17,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1455612.0, ans=0.09899494936611666 2023-06-23 09:41:26,876 INFO [train.py:996] (0/4) Epoch 8, batch 29150, loss[loss=0.2203, simple_loss=0.3154, pruned_loss=0.06258, over 21398.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3134, pruned_loss=0.08505, over 4271571.31 frames. ], batch size: 194, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:42:10,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-23 09:42:14,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.362e+02 4.552e+02 5.999e+02 9.339e+02 2.396e+03, threshold=1.200e+03, percent-clipped=6.0 2023-06-23 09:42:17,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-23 09:42:26,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1455792.0, ans=0.125 2023-06-23 09:42:34,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1455852.0, ans=0.125 2023-06-23 09:42:43,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=12.0 2023-06-23 09:43:01,041 INFO [train.py:996] (0/4) Epoch 8, batch 29200, loss[loss=0.1916, simple_loss=0.2558, pruned_loss=0.06372, over 21746.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3086, pruned_loss=0.08424, over 4266446.08 frames. ], batch size: 283, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:43:51,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1456092.0, ans=0.0 2023-06-23 09:44:09,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1456152.0, ans=0.2 2023-06-23 09:44:11,709 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-06-23 09:44:25,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1456212.0, ans=0.0 2023-06-23 09:44:50,182 INFO [train.py:996] (0/4) Epoch 8, batch 29250, loss[loss=0.1833, simple_loss=0.25, pruned_loss=0.05829, over 17377.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3077, pruned_loss=0.08189, over 4262609.25 frames. ], batch size: 67, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:45:33,755 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.540e+02 4.714e+02 6.019e+02 9.609e+02 2.170e+03, threshold=1.204e+03, percent-clipped=18.0 2023-06-23 09:45:55,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1456452.0, ans=0.0 2023-06-23 09:46:33,663 INFO [train.py:996] (0/4) Epoch 8, batch 29300, loss[loss=0.2758, simple_loss=0.3284, pruned_loss=0.1116, over 21290.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3097, pruned_loss=0.08132, over 4270754.75 frames. ], batch size: 471, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:46:35,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456572.0, ans=0.1 2023-06-23 09:46:41,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1456572.0, ans=0.0 2023-06-23 09:46:42,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-23 09:47:01,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1456632.0, ans=0.125 2023-06-23 09:48:17,867 INFO [train.py:996] (0/4) Epoch 8, batch 29350, loss[loss=0.2429, simple_loss=0.3232, pruned_loss=0.08132, over 21743.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3077, pruned_loss=0.08114, over 4267239.37 frames. ], batch size: 333, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:48:52,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.303e+02 4.822e+02 6.215e+02 9.294e+02 1.604e+03, threshold=1.243e+03, percent-clipped=12.0 2023-06-23 09:49:06,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1457052.0, ans=0.1 2023-06-23 09:49:56,835 INFO [train.py:996] (0/4) Epoch 8, batch 29400, loss[loss=0.2369, simple_loss=0.3389, pruned_loss=0.06747, over 20767.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3071, pruned_loss=0.07858, over 4259569.99 frames. ], batch size: 608, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:50:04,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1457172.0, ans=10.0 2023-06-23 09:50:05,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1457172.0, ans=0.05 2023-06-23 09:50:10,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1457172.0, ans=15.0 2023-06-23 09:50:18,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1457232.0, ans=0.04949747468305833 2023-06-23 09:50:26,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1457232.0, ans=0.125 2023-06-23 09:50:33,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1457292.0, ans=0.125 2023-06-23 09:50:48,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-23 09:51:34,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1457472.0, ans=0.0 2023-06-23 09:51:35,728 INFO [train.py:996] (0/4) Epoch 8, batch 29450, loss[loss=0.2207, simple_loss=0.3143, pruned_loss=0.06353, over 20768.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.305, pruned_loss=0.07728, over 4265775.46 frames. ], batch size: 607, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:52:02,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1457532.0, ans=0.125 2023-06-23 09:52:08,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-23 09:52:11,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.254e+02 6.189e+02 1.171e+03 1.650e+03 2.483e+03, threshold=2.343e+03, percent-clipped=48.0 2023-06-23 09:52:15,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1457592.0, ans=0.2 2023-06-23 09:52:24,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1457652.0, ans=0.05 2023-06-23 09:53:12,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=10.0 2023-06-23 09:53:14,235 INFO [train.py:996] (0/4) Epoch 8, batch 29500, loss[loss=0.2665, simple_loss=0.3335, pruned_loss=0.09977, over 21388.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3108, pruned_loss=0.08168, over 4274151.25 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:53:46,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-23 09:53:48,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1457892.0, ans=0.125 2023-06-23 09:54:42,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1458012.0, ans=0.0 2023-06-23 09:54:43,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1458012.0, ans=0.1 2023-06-23 09:54:52,639 INFO [train.py:996] (0/4) Epoch 8, batch 29550, loss[loss=0.2702, simple_loss=0.3394, pruned_loss=0.1005, over 21811.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.31, pruned_loss=0.08308, over 4283198.79 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:55:28,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.866e+02 5.921e+02 7.933e+02 1.124e+03 2.184e+03, threshold=1.587e+03, percent-clipped=0.0 2023-06-23 09:55:38,314 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-23 09:56:27,430 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:56:28,483 INFO [train.py:996] (0/4) Epoch 8, batch 29600, loss[loss=0.2975, simple_loss=0.3947, pruned_loss=0.1002, over 21275.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3161, pruned_loss=0.08555, over 4287375.19 frames. ], batch size: 548, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:56:30,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1458372.0, ans=0.0 2023-06-23 09:56:46,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1458432.0, ans=0.0 2023-06-23 09:57:43,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1458552.0, ans=0.125 2023-06-23 09:58:02,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1458612.0, ans=0.0 2023-06-23 09:58:06,728 INFO [train.py:996] (0/4) Epoch 8, batch 29650, loss[loss=0.2403, simple_loss=0.3163, pruned_loss=0.08216, over 21645.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3137, pruned_loss=0.08219, over 4278921.95 frames. ], batch size: 441, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:58:08,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-23 09:58:21,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1458732.0, ans=0.0 2023-06-23 09:58:46,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.053e+02 5.356e+02 7.008e+02 1.123e+03 3.687e+03, threshold=1.402e+03, percent-clipped=10.0 2023-06-23 09:58:46,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1458792.0, ans=0.0 2023-06-23 09:58:46,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1458792.0, ans=0.125 2023-06-23 09:59:19,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1458852.0, ans=0.125 2023-06-23 09:59:45,985 INFO [train.py:996] (0/4) Epoch 8, batch 29700, loss[loss=0.2713, simple_loss=0.3696, pruned_loss=0.08647, over 21642.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3147, pruned_loss=0.0819, over 4282851.28 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 10:00:06,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1459032.0, ans=10.0 2023-06-23 10:00:55,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-23 10:01:15,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1459212.0, ans=0.1 2023-06-23 10:01:17,331 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:01:19,994 INFO [train.py:996] (0/4) Epoch 8, batch 29750, loss[loss=0.1904, simple_loss=0.2744, pruned_loss=0.05318, over 21446.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3193, pruned_loss=0.08177, over 4277363.79 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 10:01:23,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1459272.0, ans=0.125 2023-06-23 10:01:47,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1459332.0, ans=0.125 2023-06-23 10:02:05,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 4.705e+02 6.606e+02 1.008e+03 2.185e+03, threshold=1.321e+03, percent-clipped=11.0 2023-06-23 10:02:19,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1459392.0, ans=0.125 2023-06-23 10:02:58,166 INFO [train.py:996] (0/4) Epoch 8, batch 29800, loss[loss=0.2185, simple_loss=0.2944, pruned_loss=0.0713, over 21528.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3213, pruned_loss=0.08296, over 4286188.77 frames. ], batch size: 194, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:03:34,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1459692.0, ans=0.1 2023-06-23 10:04:18,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1459812.0, ans=0.125 2023-06-23 10:04:30,522 INFO [train.py:996] (0/4) Epoch 8, batch 29850, loss[loss=0.201, simple_loss=0.281, pruned_loss=0.06054, over 21871.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3154, pruned_loss=0.08067, over 4285475.71 frames. ], batch size: 316, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:05:03,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-23 10:05:16,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.132e+02 5.162e+02 6.765e+02 9.039e+02 1.623e+03, threshold=1.353e+03, percent-clipped=3.0 2023-06-23 10:05:30,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-23 10:05:42,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-23 10:05:59,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.30 vs. limit=15.0 2023-06-23 10:06:08,507 INFO [train.py:996] (0/4) Epoch 8, batch 29900, loss[loss=0.2739, simple_loss=0.3356, pruned_loss=0.1061, over 21329.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3138, pruned_loss=0.08181, over 4290253.94 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:06:20,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1460172.0, ans=0.0 2023-06-23 10:06:49,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1460292.0, ans=0.1 2023-06-23 10:07:48,913 INFO [train.py:996] (0/4) Epoch 8, batch 29950, loss[loss=0.2099, simple_loss=0.2778, pruned_loss=0.07103, over 20263.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3178, pruned_loss=0.08568, over 4277366.36 frames. ], batch size: 707, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:07:49,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1460472.0, ans=0.2 2023-06-23 10:08:03,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1460472.0, ans=0.1 2023-06-23 10:08:36,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1460592.0, ans=0.125 2023-06-23 10:08:45,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.472e+02 5.581e+02 7.305e+02 1.013e+03 2.177e+03, threshold=1.461e+03, percent-clipped=7.0 2023-06-23 10:08:58,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1460652.0, ans=0.125 2023-06-23 10:09:34,056 INFO [train.py:996] (0/4) Epoch 8, batch 30000, loss[loss=0.2362, simple_loss=0.3322, pruned_loss=0.07009, over 21646.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3201, pruned_loss=0.0859, over 4285214.46 frames. ], batch size: 389, lr: 3.61e-03, grad_scale: 32.0 2023-06-23 10:09:34,058 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 10:09:49,544 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8598, 1.7672, 2.8788, 3.0783], device='cuda:0') 2023-06-23 10:09:54,201 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.244, simple_loss=0.3443, pruned_loss=0.07188, over 1796401.00 frames. 2023-06-23 10:09:54,202 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 10:10:18,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1460832.0, ans=0.125 2023-06-23 10:10:23,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1460832.0, ans=0.125 2023-06-23 10:10:23,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1460832.0, ans=0.125 2023-06-23 10:11:23,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-23 10:11:46,629 INFO [train.py:996] (0/4) Epoch 8, batch 30050, loss[loss=0.2828, simple_loss=0.3935, pruned_loss=0.08601, over 21636.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3238, pruned_loss=0.08292, over 4275853.06 frames. ], batch size: 414, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:12:26,000 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.441e+02 4.815e+02 7.305e+02 9.705e+02 3.214e+03, threshold=1.461e+03, percent-clipped=9.0 2023-06-23 10:12:38,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-23 10:12:48,386 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-23 10:13:26,311 INFO [train.py:996] (0/4) Epoch 8, batch 30100, loss[loss=0.2599, simple_loss=0.3069, pruned_loss=0.1065, over 21367.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3247, pruned_loss=0.08346, over 4269002.46 frames. ], batch size: 144, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:13:41,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1461432.0, ans=0.2 2023-06-23 10:14:13,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1461492.0, ans=0.1 2023-06-23 10:14:33,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1461552.0, ans=0.0 2023-06-23 10:15:05,463 INFO [train.py:996] (0/4) Epoch 8, batch 30150, loss[loss=0.2685, simple_loss=0.3272, pruned_loss=0.1049, over 21659.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3194, pruned_loss=0.08458, over 4268527.94 frames. ], batch size: 351, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:15:21,168 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.91 vs. limit=15.0 2023-06-23 10:15:45,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-23 10:15:59,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.498e+02 5.519e+02 7.633e+02 1.440e+03, threshold=1.104e+03, percent-clipped=0.0 2023-06-23 10:16:24,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1461852.0, ans=0.125 2023-06-23 10:16:32,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1461912.0, ans=0.125 2023-06-23 10:16:34,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1461912.0, ans=10.0 2023-06-23 10:16:47,032 INFO [train.py:996] (0/4) Epoch 8, batch 30200, loss[loss=0.233, simple_loss=0.3223, pruned_loss=0.07187, over 21645.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3222, pruned_loss=0.08369, over 4270498.74 frames. ], batch size: 263, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:17:28,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1462032.0, ans=0.025 2023-06-23 10:18:16,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-23 10:18:17,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.91 vs. limit=5.0 2023-06-23 10:18:27,143 INFO [train.py:996] (0/4) Epoch 8, batch 30250, loss[loss=0.3269, simple_loss=0.4022, pruned_loss=0.1257, over 21508.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3298, pruned_loss=0.08669, over 4274205.99 frames. ], batch size: 471, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:18:39,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-23 10:19:01,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1462332.0, ans=0.1 2023-06-23 10:19:25,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.543e+02 5.739e+02 8.338e+02 1.276e+03 3.132e+03, threshold=1.668e+03, percent-clipped=33.0 2023-06-23 10:19:52,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-23 10:19:57,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1462512.0, ans=0.125 2023-06-23 10:20:10,531 INFO [train.py:996] (0/4) Epoch 8, batch 30300, loss[loss=0.2112, simple_loss=0.2705, pruned_loss=0.07597, over 21519.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.327, pruned_loss=0.08623, over 4266972.77 frames. ], batch size: 414, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:20:14,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1462572.0, ans=0.2 2023-06-23 10:20:29,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1462572.0, ans=0.95 2023-06-23 10:20:47,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1462632.0, ans=0.02 2023-06-23 10:20:56,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.60 vs. limit=15.0 2023-06-23 10:21:07,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1462692.0, ans=0.0 2023-06-23 10:21:12,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1462692.0, ans=0.05 2023-06-23 10:21:27,293 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:21:28,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1462752.0, ans=0.125 2023-06-23 10:22:08,468 INFO [train.py:996] (0/4) Epoch 8, batch 30350, loss[loss=0.3576, simple_loss=0.4376, pruned_loss=0.1388, over 21469.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.326, pruned_loss=0.08734, over 4262519.04 frames. ], batch size: 471, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:22:21,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=12.0 2023-06-23 10:22:43,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.418e+02 5.113e+02 8.159e+02 1.296e+03 2.782e+03, threshold=1.632e+03, percent-clipped=10.0 2023-06-23 10:23:05,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1463112.0, ans=0.0 2023-06-23 10:23:21,547 INFO [train.py:996] (0/4) Epoch 8, batch 30400, loss[loss=0.2464, simple_loss=0.2962, pruned_loss=0.09832, over 20263.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3209, pruned_loss=0.08558, over 4249428.05 frames. ], batch size: 703, lr: 3.61e-03, grad_scale: 32.0 2023-06-23 10:24:22,436 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:24:32,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1463412.0, ans=0.0 2023-06-23 10:24:45,444 INFO [train.py:996] (0/4) Epoch 8, batch 30450, loss[loss=0.2863, simple_loss=0.402, pruned_loss=0.08527, over 19869.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3216, pruned_loss=0.08443, over 4192924.22 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:25:11,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.69 vs. limit=15.0 2023-06-23 10:25:18,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1463592.0, ans=10.0 2023-06-23 10:25:24,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.983e+02 7.886e+02 1.299e+03 2.180e+03 7.301e+03, threshold=2.598e+03, percent-clipped=35.0 2023-06-23 10:25:26,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1463592.0, ans=0.04949747468305833 2023-06-23 10:25:52,759 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-8.pt 2023-06-23 10:27:25,958 INFO [train.py:996] (0/4) Epoch 9, batch 0, loss[loss=0.2205, simple_loss=0.2908, pruned_loss=0.07509, over 21545.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2908, pruned_loss=0.07509, over 21545.00 frames. ], batch size: 391, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:27:25,959 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 10:27:41,476 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2407, simple_loss=0.3498, pruned_loss=0.06579, over 1796401.00 frames. 2023-06-23 10:27:41,477 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 10:27:48,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1463742.0, ans=0.2 2023-06-23 10:28:03,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-23 10:28:05,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1463802.0, ans=0.0 2023-06-23 10:28:22,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1463862.0, ans=0.0 2023-06-23 10:28:42,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1463922.0, ans=0.125 2023-06-23 10:28:57,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1463922.0, ans=0.125 2023-06-23 10:28:57,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1463922.0, ans=0.125 2023-06-23 10:29:07,378 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-244000.pt 2023-06-23 10:29:20,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1464042.0, ans=0.0 2023-06-23 10:29:21,253 INFO [train.py:996] (0/4) Epoch 9, batch 50, loss[loss=0.2061, simple_loss=0.2778, pruned_loss=0.06717, over 21215.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3147, pruned_loss=0.08268, over 954721.06 frames. ], batch size: 159, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:30:12,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1464162.0, ans=0.125 2023-06-23 10:30:22,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.385e+02 5.823e+02 9.334e+02 1.610e+03 5.016e+03, threshold=1.867e+03, percent-clipped=15.0 2023-06-23 10:30:22,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1464222.0, ans=0.125 2023-06-23 10:30:58,864 INFO [train.py:996] (0/4) Epoch 9, batch 100, loss[loss=0.2513, simple_loss=0.3483, pruned_loss=0.07718, over 19888.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.335, pruned_loss=0.0857, over 1682070.16 frames. ], batch size: 702, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:31:49,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1464462.0, ans=0.125 2023-06-23 10:32:17,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1464522.0, ans=0.125 2023-06-23 10:32:23,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1464582.0, ans=0.125 2023-06-23 10:32:30,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1464582.0, ans=0.125 2023-06-23 10:32:35,524 INFO [train.py:996] (0/4) Epoch 9, batch 150, loss[loss=0.2131, simple_loss=0.2969, pruned_loss=0.06459, over 21368.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3377, pruned_loss=0.08588, over 2263264.68 frames. ], batch size: 194, lr: 3.39e-03, grad_scale: 16.0 2023-06-23 10:32:37,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-23 10:33:38,728 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.525e+02 5.241e+02 6.632e+02 9.762e+02 2.000e+03, threshold=1.326e+03, percent-clipped=1.0 2023-06-23 10:34:12,631 INFO [train.py:996] (0/4) Epoch 9, batch 200, loss[loss=0.2101, simple_loss=0.2981, pruned_loss=0.06103, over 21760.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3365, pruned_loss=0.0858, over 2712224.63 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:34:34,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1465002.0, ans=0.95 2023-06-23 10:34:55,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-23 10:35:10,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1465122.0, ans=0.125 2023-06-23 10:35:47,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1465182.0, ans=0.125 2023-06-23 10:35:50,369 INFO [train.py:996] (0/4) Epoch 9, batch 250, loss[loss=0.2177, simple_loss=0.3154, pruned_loss=0.06, over 21676.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3321, pruned_loss=0.08608, over 3059759.87 frames. ], batch size: 247, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:36:41,477 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.88 vs. limit=22.5 2023-06-23 10:36:47,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1465422.0, ans=0.0 2023-06-23 10:36:55,172 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 4.793e+02 6.542e+02 9.601e+02 1.948e+03, threshold=1.308e+03, percent-clipped=7.0 2023-06-23 10:37:01,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1465422.0, ans=0.125 2023-06-23 10:37:29,676 INFO [train.py:996] (0/4) Epoch 9, batch 300, loss[loss=0.2672, simple_loss=0.362, pruned_loss=0.08619, over 21757.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3238, pruned_loss=0.08522, over 3326626.36 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:37:30,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1465542.0, ans=0.125 2023-06-23 10:38:25,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1465662.0, ans=0.2 2023-06-23 10:38:58,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1465782.0, ans=0.0 2023-06-23 10:39:11,161 INFO [train.py:996] (0/4) Epoch 9, batch 350, loss[loss=0.1983, simple_loss=0.2639, pruned_loss=0.06631, over 21577.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3189, pruned_loss=0.08338, over 3533457.58 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:39:28,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1465902.0, ans=0.2 2023-06-23 10:39:42,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1465902.0, ans=0.0 2023-06-23 10:40:16,457 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 5.700e+02 8.166e+02 1.374e+03 3.481e+03, threshold=1.633e+03, percent-clipped=26.0 2023-06-23 10:40:30,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1466022.0, ans=0.125 2023-06-23 10:40:52,652 INFO [train.py:996] (0/4) Epoch 9, batch 400, loss[loss=0.2493, simple_loss=0.308, pruned_loss=0.09525, over 21370.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3127, pruned_loss=0.08177, over 3694619.39 frames. ], batch size: 473, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:41:00,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1466142.0, ans=0.125 2023-06-23 10:41:26,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1466202.0, ans=0.125 2023-06-23 10:41:26,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466202.0, ans=0.1 2023-06-23 10:42:05,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1466322.0, ans=0.125 2023-06-23 10:42:10,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1466322.0, ans=0.0 2023-06-23 10:42:13,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1466322.0, ans=0.0 2023-06-23 10:42:14,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1466322.0, ans=0.125 2023-06-23 10:42:35,013 INFO [train.py:996] (0/4) Epoch 9, batch 450, loss[loss=0.1983, simple_loss=0.2561, pruned_loss=0.0703, over 20225.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3128, pruned_loss=0.08066, over 3826090.48 frames. ], batch size: 703, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:42:43,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1466442.0, ans=15.0 2023-06-23 10:42:45,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1466442.0, ans=0.0 2023-06-23 10:43:00,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466502.0, ans=0.1 2023-06-23 10:43:25,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1466562.0, ans=0.5 2023-06-23 10:43:40,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 7.104e+02 9.718e+02 1.338e+03 3.704e+03, threshold=1.944e+03, percent-clipped=17.0 2023-06-23 10:43:44,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1466622.0, ans=0.2 2023-06-23 10:44:09,241 INFO [train.py:996] (0/4) Epoch 9, batch 500, loss[loss=0.189, simple_loss=0.274, pruned_loss=0.05197, over 21279.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3108, pruned_loss=0.07889, over 3934302.38 frames. ], batch size: 176, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:45:24,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-23 10:45:33,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1466982.0, ans=0.0 2023-06-23 10:45:47,796 INFO [train.py:996] (0/4) Epoch 9, batch 550, loss[loss=0.2329, simple_loss=0.3091, pruned_loss=0.07835, over 21533.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3142, pruned_loss=0.07804, over 4009287.92 frames. ], batch size: 230, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:45:53,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1467042.0, ans=10.0 2023-06-23 10:46:24,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1467102.0, ans=0.125 2023-06-23 10:46:53,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 4.598e+02 6.514e+02 1.038e+03 2.454e+03, threshold=1.303e+03, percent-clipped=6.0 2023-06-23 10:47:00,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1467222.0, ans=0.0 2023-06-23 10:47:13,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1467282.0, ans=0.125 2023-06-23 10:47:22,496 INFO [train.py:996] (0/4) Epoch 9, batch 600, loss[loss=0.2587, simple_loss=0.3492, pruned_loss=0.08406, over 21429.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3191, pruned_loss=0.07931, over 4073619.83 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:47:33,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-23 10:47:41,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1467402.0, ans=0.5 2023-06-23 10:47:42,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=16.65 vs. limit=15.0 2023-06-23 10:47:44,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1467402.0, ans=0.125 2023-06-23 10:47:53,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1467402.0, ans=0.125 2023-06-23 10:48:06,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1467402.0, ans=0.2 2023-06-23 10:48:28,136 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-23 10:48:42,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-06-23 10:48:51,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1467582.0, ans=0.125 2023-06-23 10:49:00,664 INFO [train.py:996] (0/4) Epoch 9, batch 650, loss[loss=0.2017, simple_loss=0.2906, pruned_loss=0.05634, over 21730.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3187, pruned_loss=0.0797, over 4115711.11 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:49:03,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-23 10:49:58,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1467762.0, ans=0.1 2023-06-23 10:50:05,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1467822.0, ans=0.2 2023-06-23 10:50:07,003 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.298e+02 4.773e+02 6.710e+02 1.032e+03 2.196e+03, threshold=1.342e+03, percent-clipped=13.0 2023-06-23 10:50:36,069 INFO [train.py:996] (0/4) Epoch 9, batch 700, loss[loss=0.1953, simple_loss=0.2626, pruned_loss=0.06403, over 21213.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3174, pruned_loss=0.0809, over 4158621.47 frames. ], batch size: 143, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:50:37,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1467942.0, ans=0.125 2023-06-23 10:51:14,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1468002.0, ans=0.125 2023-06-23 10:51:41,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1468122.0, ans=0.125 2023-06-23 10:51:42,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-23 10:51:43,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1468122.0, ans=0.2 2023-06-23 10:52:07,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1468182.0, ans=0.0 2023-06-23 10:52:09,596 INFO [train.py:996] (0/4) Epoch 9, batch 750, loss[loss=0.2503, simple_loss=0.3807, pruned_loss=0.05989, over 19803.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3166, pruned_loss=0.08066, over 4166889.00 frames. ], batch size: 703, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:53:10,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1468422.0, ans=0.0 2023-06-23 10:53:15,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.543e+02 5.132e+02 8.530e+02 1.237e+03 2.839e+03, threshold=1.706e+03, percent-clipped=17.0 2023-06-23 10:53:21,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.97 vs. limit=22.5 2023-06-23 10:53:33,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1468482.0, ans=0.125 2023-06-23 10:53:40,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=12.0 2023-06-23 10:53:44,253 INFO [train.py:996] (0/4) Epoch 9, batch 800, loss[loss=0.2429, simple_loss=0.3205, pruned_loss=0.08265, over 21544.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3145, pruned_loss=0.08184, over 4195931.33 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:53:48,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1468542.0, ans=0.125 2023-06-23 10:54:24,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-23 10:54:59,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=1468722.0, ans=15.0 2023-06-23 10:55:19,776 INFO [train.py:996] (0/4) Epoch 9, batch 850, loss[loss=0.2284, simple_loss=0.2971, pruned_loss=0.0799, over 21187.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.314, pruned_loss=0.08186, over 4219949.50 frames. ], batch size: 176, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:55:30,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.78 vs. limit=5.0 2023-06-23 10:55:44,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1468902.0, ans=0.0 2023-06-23 10:55:49,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1468902.0, ans=0.2 2023-06-23 10:56:18,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1468962.0, ans=0.125 2023-06-23 10:56:23,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1468962.0, ans=0.0 2023-06-23 10:56:31,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.626e+02 5.922e+02 9.431e+02 1.406e+03 2.564e+03, threshold=1.886e+03, percent-clipped=15.0 2023-06-23 10:57:05,019 INFO [train.py:996] (0/4) Epoch 9, batch 900, loss[loss=0.2442, simple_loss=0.3014, pruned_loss=0.09355, over 21303.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3112, pruned_loss=0.08069, over 4233768.56 frames. ], batch size: 176, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:57:50,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1469262.0, ans=0.1 2023-06-23 10:57:55,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1469262.0, ans=0.125 2023-06-23 10:58:45,643 INFO [train.py:996] (0/4) Epoch 9, batch 950, loss[loss=0.2142, simple_loss=0.2749, pruned_loss=0.07671, over 21760.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3102, pruned_loss=0.08052, over 4248838.09 frames. ], batch size: 300, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:59:23,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1469502.0, ans=0.0 2023-06-23 10:59:53,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.620e+02 5.465e+02 8.299e+02 1.252e+03 2.692e+03, threshold=1.660e+03, percent-clipped=4.0 2023-06-23 10:59:58,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1469622.0, ans=0.125 2023-06-23 11:00:05,457 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-23 11:00:25,500 INFO [train.py:996] (0/4) Epoch 9, batch 1000, loss[loss=0.2265, simple_loss=0.3137, pruned_loss=0.06966, over 21609.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3074, pruned_loss=0.07944, over 4260450.61 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:01:13,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1469862.0, ans=0.125 2023-06-23 11:02:11,552 INFO [train.py:996] (0/4) Epoch 9, batch 1050, loss[loss=0.246, simple_loss=0.3117, pruned_loss=0.09014, over 21817.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3079, pruned_loss=0.07961, over 4270062.69 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:02:12,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1470042.0, ans=0.2 2023-06-23 11:02:15,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1470042.0, ans=0.0 2023-06-23 11:02:58,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1470162.0, ans=0.1 2023-06-23 11:03:15,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.334e+02 4.887e+02 6.781e+02 8.518e+02 2.404e+03, threshold=1.356e+03, percent-clipped=1.0 2023-06-23 11:03:58,868 INFO [train.py:996] (0/4) Epoch 9, batch 1100, loss[loss=0.2096, simple_loss=0.2636, pruned_loss=0.07781, over 20147.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3073, pruned_loss=0.07872, over 4263032.14 frames. ], batch size: 702, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:04:02,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1470342.0, ans=0.0 2023-06-23 11:04:15,642 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-06-23 11:05:11,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1470582.0, ans=0.125 2023-06-23 11:05:43,365 INFO [train.py:996] (0/4) Epoch 9, batch 1150, loss[loss=0.234, simple_loss=0.3035, pruned_loss=0.08223, over 21880.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.307, pruned_loss=0.07839, over 4267950.59 frames. ], batch size: 124, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:06:15,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1470762.0, ans=0.0 2023-06-23 11:06:33,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-23 11:06:43,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.848e+02 5.352e+02 7.597e+02 1.030e+03 2.056e+03, threshold=1.519e+03, percent-clipped=12.0 2023-06-23 11:07:03,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1470882.0, ans=0.1 2023-06-23 11:07:24,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1470942.0, ans=0.07 2023-06-23 11:07:25,444 INFO [train.py:996] (0/4) Epoch 9, batch 1200, loss[loss=0.1922, simple_loss=0.2763, pruned_loss=0.05404, over 21503.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.309, pruned_loss=0.07939, over 4277614.36 frames. ], batch size: 212, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:07:36,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1470942.0, ans=0.0 2023-06-23 11:08:22,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=12.0 2023-06-23 11:08:47,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1471182.0, ans=0.0 2023-06-23 11:08:48,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.54 vs. limit=15.0 2023-06-23 11:08:56,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-23 11:08:56,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.97 vs. limit=22.5 2023-06-23 11:09:01,905 INFO [train.py:996] (0/4) Epoch 9, batch 1250, loss[loss=0.2866, simple_loss=0.3723, pruned_loss=0.1005, over 21720.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3121, pruned_loss=0.08044, over 4286411.05 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:09:15,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1471242.0, ans=0.1 2023-06-23 11:09:21,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-23 11:10:01,583 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.789e+02 4.897e+02 6.693e+02 9.449e+02 1.847e+03, threshold=1.339e+03, percent-clipped=0.0 2023-06-23 11:10:41,452 INFO [train.py:996] (0/4) Epoch 9, batch 1300, loss[loss=0.2397, simple_loss=0.3411, pruned_loss=0.06916, over 19870.00 frames. ], tot_loss[loss=0.236, simple_loss=0.311, pruned_loss=0.08051, over 4278134.96 frames. ], batch size: 703, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:10:53,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1471542.0, ans=0.125 2023-06-23 11:11:05,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1471602.0, ans=0.0 2023-06-23 11:11:29,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1471662.0, ans=0.0 2023-06-23 11:11:33,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1471722.0, ans=0.125 2023-06-23 11:12:12,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1471782.0, ans=0.125 2023-06-23 11:12:16,956 INFO [train.py:996] (0/4) Epoch 9, batch 1350, loss[loss=0.3133, simple_loss=0.3693, pruned_loss=0.1287, over 21425.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3137, pruned_loss=0.08182, over 4283164.79 frames. ], batch size: 509, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:12:19,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-23 11:12:24,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1471842.0, ans=0.125 2023-06-23 11:12:33,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1471902.0, ans=0.0 2023-06-23 11:12:56,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1471962.0, ans=0.2 2023-06-23 11:13:16,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.586e+02 4.783e+02 6.688e+02 9.049e+02 1.938e+03, threshold=1.338e+03, percent-clipped=9.0 2023-06-23 11:13:56,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1472142.0, ans=0.125 2023-06-23 11:13:57,416 INFO [train.py:996] (0/4) Epoch 9, batch 1400, loss[loss=0.2286, simple_loss=0.2996, pruned_loss=0.07876, over 21916.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.313, pruned_loss=0.08198, over 4281403.61 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:14:34,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1472262.0, ans=0.1 2023-06-23 11:15:09,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1472322.0, ans=0.125 2023-06-23 11:15:21,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1472382.0, ans=0.0 2023-06-23 11:15:33,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1472382.0, ans=0.0 2023-06-23 11:15:39,600 INFO [train.py:996] (0/4) Epoch 9, batch 1450, loss[loss=0.2059, simple_loss=0.2695, pruned_loss=0.07116, over 21688.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3137, pruned_loss=0.08311, over 4278441.30 frames. ], batch size: 333, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:15:46,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1472442.0, ans=0.0 2023-06-23 11:16:23,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1472562.0, ans=0.0 2023-06-23 11:16:33,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1472622.0, ans=0.1 2023-06-23 11:16:44,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.412e+02 5.469e+02 7.594e+02 1.040e+03 1.854e+03, threshold=1.519e+03, percent-clipped=12.0 2023-06-23 11:16:56,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-23 11:17:20,417 INFO [train.py:996] (0/4) Epoch 9, batch 1500, loss[loss=0.2165, simple_loss=0.2759, pruned_loss=0.0786, over 21111.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3132, pruned_loss=0.0834, over 4283064.50 frames. ], batch size: 143, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:17:28,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-23 11:19:03,003 INFO [train.py:996] (0/4) Epoch 9, batch 1550, loss[loss=0.1647, simple_loss=0.2337, pruned_loss=0.0479, over 16578.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3115, pruned_loss=0.0824, over 4282224.82 frames. ], batch size: 61, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:20:13,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1473222.0, ans=0.125 2023-06-23 11:20:14,286 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.435e+02 5.299e+02 6.765e+02 1.096e+03 1.841e+03, threshold=1.353e+03, percent-clipped=3.0 2023-06-23 11:20:16,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1473222.0, ans=0.125 2023-06-23 11:20:40,613 INFO [train.py:996] (0/4) Epoch 9, batch 1600, loss[loss=0.1557, simple_loss=0.2023, pruned_loss=0.05449, over 16393.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3124, pruned_loss=0.08232, over 4278784.57 frames. ], batch size: 60, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:21:10,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1473402.0, ans=0.0 2023-06-23 11:21:17,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1473462.0, ans=0.1 2023-06-23 11:22:15,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.15 vs. limit=15.0 2023-06-23 11:22:16,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1473582.0, ans=0.1 2023-06-23 11:22:23,029 INFO [train.py:996] (0/4) Epoch 9, batch 1650, loss[loss=0.2193, simple_loss=0.2895, pruned_loss=0.07458, over 21198.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3111, pruned_loss=0.08158, over 4278867.96 frames. ], batch size: 143, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:23:33,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1473822.0, ans=0.0 2023-06-23 11:23:41,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 5.690e+02 7.589e+02 1.047e+03 2.202e+03, threshold=1.518e+03, percent-clipped=10.0 2023-06-23 11:24:06,482 INFO [train.py:996] (0/4) Epoch 9, batch 1700, loss[loss=0.2853, simple_loss=0.3391, pruned_loss=0.1158, over 21358.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3154, pruned_loss=0.08347, over 4279872.02 frames. ], batch size: 507, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:24:07,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1473942.0, ans=0.125 2023-06-23 11:24:14,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1473942.0, ans=0.0 2023-06-23 11:24:21,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1473942.0, ans=0.0 2023-06-23 11:25:54,639 INFO [train.py:996] (0/4) Epoch 9, batch 1750, loss[loss=0.258, simple_loss=0.3484, pruned_loss=0.08378, over 19866.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3172, pruned_loss=0.08227, over 4272663.12 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:25:55,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1474242.0, ans=0.125 2023-06-23 11:26:02,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1474242.0, ans=0.125 2023-06-23 11:26:43,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1474362.0, ans=0.05 2023-06-23 11:27:02,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1474422.0, ans=0.0 2023-06-23 11:27:13,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.110e+02 6.517e+02 8.827e+02 1.421e+03 2.550e+03, threshold=1.765e+03, percent-clipped=23.0 2023-06-23 11:27:26,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-23 11:27:43,606 INFO [train.py:996] (0/4) Epoch 9, batch 1800, loss[loss=0.2283, simple_loss=0.3349, pruned_loss=0.06085, over 21746.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3141, pruned_loss=0.0788, over 4273221.16 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:28:38,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1474662.0, ans=0.125 2023-06-23 11:29:00,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1474722.0, ans=0.125 2023-06-23 11:29:15,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1474782.0, ans=0.125 2023-06-23 11:29:15,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1474782.0, ans=0.125 2023-06-23 11:29:25,361 INFO [train.py:996] (0/4) Epoch 9, batch 1850, loss[loss=0.216, simple_loss=0.3012, pruned_loss=0.0654, over 21846.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3141, pruned_loss=0.07656, over 4269656.34 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:29:58,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1474902.0, ans=0.125 2023-06-23 11:30:30,048 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-23 11:30:37,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.456e+02 5.587e+02 8.108e+02 1.184e+03 2.810e+03, threshold=1.622e+03, percent-clipped=5.0 2023-06-23 11:30:53,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1475082.0, ans=0.05 2023-06-23 11:31:11,690 INFO [train.py:996] (0/4) Epoch 9, batch 1900, loss[loss=0.2085, simple_loss=0.278, pruned_loss=0.06946, over 21134.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3137, pruned_loss=0.07684, over 4270516.87 frames. ], batch size: 143, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:31:40,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1475202.0, ans=0.0 2023-06-23 11:32:15,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475322.0, ans=0.1 2023-06-23 11:32:17,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-23 11:32:49,647 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-23 11:32:58,415 INFO [train.py:996] (0/4) Epoch 9, batch 1950, loss[loss=0.2245, simple_loss=0.2892, pruned_loss=0.07989, over 21847.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3097, pruned_loss=0.07696, over 4275711.84 frames. ], batch size: 107, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:33:51,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1475622.0, ans=0.125 2023-06-23 11:34:00,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.571e+02 6.151e+02 9.472e+02 1.342e+03 2.834e+03, threshold=1.894e+03, percent-clipped=13.0 2023-06-23 11:34:37,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1475682.0, ans=0.125 2023-06-23 11:34:40,464 INFO [train.py:996] (0/4) Epoch 9, batch 2000, loss[loss=0.2905, simple_loss=0.3771, pruned_loss=0.1019, over 21823.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3055, pruned_loss=0.07505, over 4276437.14 frames. ], batch size: 372, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:36:15,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-23 11:36:16,276 INFO [train.py:996] (0/4) Epoch 9, batch 2050, loss[loss=0.2155, simple_loss=0.2983, pruned_loss=0.06629, over 21362.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3059, pruned_loss=0.07616, over 4278408.09 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:36:34,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1476102.0, ans=0.125 2023-06-23 11:37:17,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.522e+02 6.873e+02 9.848e+02 2.030e+03, threshold=1.375e+03, percent-clipped=1.0 2023-06-23 11:37:23,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1476222.0, ans=0.2 2023-06-23 11:37:56,798 INFO [train.py:996] (0/4) Epoch 9, batch 2100, loss[loss=0.3213, simple_loss=0.3743, pruned_loss=0.1341, over 21396.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3101, pruned_loss=0.07914, over 4279776.64 frames. ], batch size: 471, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:37:57,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1476342.0, ans=0.125 2023-06-23 11:38:46,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1476462.0, ans=0.125 2023-06-23 11:39:18,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1476582.0, ans=0.2 2023-06-23 11:39:37,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1476582.0, ans=0.1 2023-06-23 11:39:39,895 INFO [train.py:996] (0/4) Epoch 9, batch 2150, loss[loss=0.2219, simple_loss=0.2923, pruned_loss=0.07574, over 21461.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3107, pruned_loss=0.08088, over 4279925.45 frames. ], batch size: 389, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:39:48,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1476642.0, ans=0.1 2023-06-23 11:40:16,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1476762.0, ans=0.125 2023-06-23 11:40:21,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.32 vs. limit=10.0 2023-06-23 11:40:30,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1476762.0, ans=0.1 2023-06-23 11:40:42,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.083e+02 6.104e+02 8.851e+02 1.376e+03 2.645e+03, threshold=1.770e+03, percent-clipped=25.0 2023-06-23 11:41:21,812 INFO [train.py:996] (0/4) Epoch 9, batch 2200, loss[loss=0.2011, simple_loss=0.2727, pruned_loss=0.06475, over 21806.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3123, pruned_loss=0.08128, over 4276691.19 frames. ], batch size: 118, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:41:45,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1477002.0, ans=0.125 2023-06-23 11:41:49,877 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:43:00,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.48 vs. limit=10.0 2023-06-23 11:43:02,295 INFO [train.py:996] (0/4) Epoch 9, batch 2250, loss[loss=0.2234, simple_loss=0.29, pruned_loss=0.07843, over 21722.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3103, pruned_loss=0.08013, over 4286499.56 frames. ], batch size: 371, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:43:43,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1477362.0, ans=0.125 2023-06-23 11:43:44,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1477362.0, ans=0.1 2023-06-23 11:43:47,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-23 11:44:09,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.634e+02 5.493e+02 8.283e+02 1.333e+03 2.509e+03, threshold=1.657e+03, percent-clipped=6.0 2023-06-23 11:44:37,527 INFO [train.py:996] (0/4) Epoch 9, batch 2300, loss[loss=0.2827, simple_loss=0.3146, pruned_loss=0.1254, over 21469.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3054, pruned_loss=0.08014, over 4287816.22 frames. ], batch size: 511, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:44:43,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2023-06-23 11:45:20,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1477662.0, ans=0.125 2023-06-23 11:46:18,230 INFO [train.py:996] (0/4) Epoch 9, batch 2350, loss[loss=0.2436, simple_loss=0.3009, pruned_loss=0.09319, over 21234.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3016, pruned_loss=0.07986, over 4277540.63 frames. ], batch size: 159, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:46:28,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1477842.0, ans=0.125 2023-06-23 11:47:33,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1478022.0, ans=0.125 2023-06-23 11:47:36,708 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.626e+02 5.310e+02 7.234e+02 1.027e+03 2.720e+03, threshold=1.447e+03, percent-clipped=6.0 2023-06-23 11:47:45,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1478082.0, ans=0.125 2023-06-23 11:47:45,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1478082.0, ans=0.125 2023-06-23 11:48:06,162 INFO [train.py:996] (0/4) Epoch 9, batch 2400, loss[loss=0.2636, simple_loss=0.3301, pruned_loss=0.09853, over 21500.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3056, pruned_loss=0.0825, over 4281195.41 frames. ], batch size: 112, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:48:26,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1478202.0, ans=0.0 2023-06-23 11:48:33,205 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:48:38,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1478262.0, ans=0.1 2023-06-23 11:48:48,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-23 11:49:18,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-23 11:49:39,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1478382.0, ans=0.125 2023-06-23 11:49:41,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1478382.0, ans=15.0 2023-06-23 11:49:43,580 INFO [train.py:996] (0/4) Epoch 9, batch 2450, loss[loss=0.2874, simple_loss=0.3551, pruned_loss=0.1099, over 21809.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3081, pruned_loss=0.08394, over 4281551.75 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:49:51,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.84 vs. limit=10.0 2023-06-23 11:49:57,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1478442.0, ans=0.125 2023-06-23 11:50:13,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1478502.0, ans=0.125 2023-06-23 11:50:49,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1478622.0, ans=0.0 2023-06-23 11:50:55,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.454e+02 8.636e+02 1.143e+03 3.101e+03, threshold=1.727e+03, percent-clipped=10.0 2023-06-23 11:51:10,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1478682.0, ans=0.1 2023-06-23 11:51:24,333 INFO [train.py:996] (0/4) Epoch 9, batch 2500, loss[loss=0.2188, simple_loss=0.313, pruned_loss=0.06231, over 21693.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3078, pruned_loss=0.08297, over 4282835.14 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:51:30,684 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.88 vs. limit=15.0 2023-06-23 11:51:49,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1478802.0, ans=0.2 2023-06-23 11:51:49,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1478802.0, ans=0.125 2023-06-23 11:52:40,063 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:52:46,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1478982.0, ans=0.125 2023-06-23 11:53:05,827 INFO [train.py:996] (0/4) Epoch 9, batch 2550, loss[loss=0.2802, simple_loss=0.337, pruned_loss=0.1117, over 21779.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3075, pruned_loss=0.08186, over 4268856.78 frames. ], batch size: 124, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:53:14,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1479042.0, ans=0.07 2023-06-23 11:53:42,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1479162.0, ans=0.0 2023-06-23 11:54:19,355 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.829e+02 7.204e+02 9.571e+02 1.455e+03 2.660e+03, threshold=1.914e+03, percent-clipped=10.0 2023-06-23 11:54:46,760 INFO [train.py:996] (0/4) Epoch 9, batch 2600, loss[loss=0.2428, simple_loss=0.3203, pruned_loss=0.08263, over 21448.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3102, pruned_loss=0.08381, over 4270713.72 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:54:58,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1479342.0, ans=0.125 2023-06-23 11:55:16,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1479402.0, ans=0.2 2023-06-23 11:55:39,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1479522.0, ans=0.125 2023-06-23 11:56:03,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1479522.0, ans=0.0 2023-06-23 11:56:28,274 INFO [train.py:996] (0/4) Epoch 9, batch 2650, loss[loss=0.2735, simple_loss=0.3424, pruned_loss=0.1023, over 21834.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.311, pruned_loss=0.08432, over 4275064.80 frames. ], batch size: 118, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:56:46,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1479702.0, ans=0.0 2023-06-23 11:56:59,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1479762.0, ans=0.1 2023-06-23 11:57:03,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1479762.0, ans=0.125 2023-06-23 11:57:37,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.783e+02 6.164e+02 7.850e+02 1.193e+03 2.220e+03, threshold=1.570e+03, percent-clipped=3.0 2023-06-23 11:57:49,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1479882.0, ans=0.1 2023-06-23 11:58:04,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1479942.0, ans=0.2 2023-06-23 11:58:05,294 INFO [train.py:996] (0/4) Epoch 9, batch 2700, loss[loss=0.2825, simple_loss=0.3513, pruned_loss=0.1068, over 21612.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.311, pruned_loss=0.08376, over 4282156.91 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:58:19,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-23 11:58:33,834 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-23 11:58:40,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-23 11:58:46,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1480062.0, ans=0.2 2023-06-23 11:59:19,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1480122.0, ans=0.1 2023-06-23 11:59:40,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1480182.0, ans=0.125 2023-06-23 11:59:43,052 INFO [train.py:996] (0/4) Epoch 9, batch 2750, loss[loss=0.2241, simple_loss=0.345, pruned_loss=0.05155, over 19726.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3108, pruned_loss=0.08314, over 4283653.96 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:59:53,212 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:00:00,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1480302.0, ans=0.125 2023-06-23 12:00:07,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1480302.0, ans=0.125 2023-06-23 12:00:54,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-23 12:00:58,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.035e+02 5.584e+02 7.738e+02 1.130e+03 2.409e+03, threshold=1.548e+03, percent-clipped=8.0 2023-06-23 12:01:19,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1480482.0, ans=0.125 2023-06-23 12:01:27,107 INFO [train.py:996] (0/4) Epoch 9, batch 2800, loss[loss=0.1886, simple_loss=0.2636, pruned_loss=0.05683, over 21435.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3175, pruned_loss=0.08489, over 4288285.45 frames. ], batch size: 212, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 12:02:28,399 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-23 12:03:09,792 INFO [train.py:996] (0/4) Epoch 9, batch 2850, loss[loss=0.2579, simple_loss=0.3584, pruned_loss=0.07867, over 20737.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3177, pruned_loss=0.08543, over 4290128.26 frames. ], batch size: 607, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:03:48,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-23 12:04:22,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1481022.0, ans=0.0 2023-06-23 12:04:25,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.732e+02 5.955e+02 8.768e+02 1.383e+03 2.997e+03, threshold=1.754e+03, percent-clipped=21.0 2023-06-23 12:04:44,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1481082.0, ans=0.125 2023-06-23 12:04:45,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.05 vs. limit=10.0 2023-06-23 12:04:50,688 INFO [train.py:996] (0/4) Epoch 9, batch 2900, loss[loss=0.23, simple_loss=0.3074, pruned_loss=0.0763, over 21804.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3135, pruned_loss=0.08359, over 4282945.75 frames. ], batch size: 112, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:04:51,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1481142.0, ans=0.125 2023-06-23 12:05:18,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1481202.0, ans=0.1 2023-06-23 12:05:18,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1481202.0, ans=0.0 2023-06-23 12:06:06,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1481322.0, ans=0.125 2023-06-23 12:06:19,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1481382.0, ans=0.04949747468305833 2023-06-23 12:06:31,730 INFO [train.py:996] (0/4) Epoch 9, batch 2950, loss[loss=0.2092, simple_loss=0.2926, pruned_loss=0.0629, over 21367.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3147, pruned_loss=0.08421, over 4288747.97 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:06:32,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1481442.0, ans=0.0 2023-06-23 12:06:35,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1481442.0, ans=0.05 2023-06-23 12:06:52,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1481502.0, ans=0.0 2023-06-23 12:07:19,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1481562.0, ans=0.125 2023-06-23 12:07:45,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-23 12:07:48,168 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.698e+02 5.526e+02 7.206e+02 1.005e+03 1.804e+03, threshold=1.441e+03, percent-clipped=1.0 2023-06-23 12:08:05,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1481682.0, ans=0.125 2023-06-23 12:08:07,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1481742.0, ans=0.2 2023-06-23 12:08:08,703 INFO [train.py:996] (0/4) Epoch 9, batch 3000, loss[loss=0.258, simple_loss=0.3286, pruned_loss=0.09368, over 21752.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3177, pruned_loss=0.08444, over 4284175.68 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:08:08,704 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 12:08:24,856 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2522, simple_loss=0.3459, pruned_loss=0.07924, over 1796401.00 frames. 2023-06-23 12:08:24,857 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 12:08:37,194 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:09:11,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1481802.0, ans=0.125 2023-06-23 12:09:16,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1481862.0, ans=0.2 2023-06-23 12:09:44,362 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-06-23 12:10:09,601 INFO [train.py:996] (0/4) Epoch 9, batch 3050, loss[loss=0.2101, simple_loss=0.3014, pruned_loss=0.05942, over 21740.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3184, pruned_loss=0.08299, over 4282160.47 frames. ], batch size: 351, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:11:24,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.590e+02 6.164e+02 8.184e+02 1.174e+03 2.237e+03, threshold=1.637e+03, percent-clipped=13.0 2023-06-23 12:11:44,959 INFO [train.py:996] (0/4) Epoch 9, batch 3100, loss[loss=0.2047, simple_loss=0.2886, pruned_loss=0.06042, over 21566.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.317, pruned_loss=0.08137, over 4291608.95 frames. ], batch size: 230, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:11:51,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482342.0, ans=0.1 2023-06-23 12:12:49,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1482462.0, ans=0.125 2023-06-23 12:13:36,668 INFO [train.py:996] (0/4) Epoch 9, batch 3150, loss[loss=0.2692, simple_loss=0.3475, pruned_loss=0.09547, over 21442.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3174, pruned_loss=0.08133, over 4289295.13 frames. ], batch size: 471, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:14:00,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1482702.0, ans=0.125 2023-06-23 12:14:18,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1482762.0, ans=0.125 2023-06-23 12:14:31,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1482762.0, ans=0.0 2023-06-23 12:14:33,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1482762.0, ans=0.0 2023-06-23 12:14:45,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.77 vs. limit=8.0 2023-06-23 12:14:47,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.159e+02 5.985e+02 8.535e+02 1.297e+03 2.485e+03, threshold=1.707e+03, percent-clipped=14.0 2023-06-23 12:15:24,485 INFO [train.py:996] (0/4) Epoch 9, batch 3200, loss[loss=0.2207, simple_loss=0.3012, pruned_loss=0.07009, over 21785.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.316, pruned_loss=0.08112, over 4287113.34 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:15:25,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1482942.0, ans=0.125 2023-06-23 12:15:26,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1482942.0, ans=0.1 2023-06-23 12:15:34,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1482942.0, ans=0.125 2023-06-23 12:15:49,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-23 12:15:55,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1483002.0, ans=0.2 2023-06-23 12:16:00,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1483062.0, ans=0.0 2023-06-23 12:16:16,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1483122.0, ans=0.125 2023-06-23 12:16:22,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-23 12:16:36,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1483122.0, ans=0.2 2023-06-23 12:16:44,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1483182.0, ans=0.035 2023-06-23 12:16:53,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1483182.0, ans=0.0 2023-06-23 12:16:55,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1483182.0, ans=0.125 2023-06-23 12:16:59,998 INFO [train.py:996] (0/4) Epoch 9, batch 3250, loss[loss=0.2221, simple_loss=0.2772, pruned_loss=0.0835, over 21394.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3179, pruned_loss=0.08343, over 4282554.35 frames. ], batch size: 194, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:17:20,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1483302.0, ans=0.0 2023-06-23 12:17:57,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1483422.0, ans=0.2 2023-06-23 12:18:07,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=12.0 2023-06-23 12:18:19,935 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.506e+02 4.912e+02 6.771e+02 1.025e+03 2.208e+03, threshold=1.354e+03, percent-clipped=1.0 2023-06-23 12:18:23,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1483482.0, ans=0.2 2023-06-23 12:18:30,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1483482.0, ans=0.125 2023-06-23 12:18:30,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1483482.0, ans=0.125 2023-06-23 12:18:43,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2023-06-23 12:18:44,469 INFO [train.py:996] (0/4) Epoch 9, batch 3300, loss[loss=0.2237, simple_loss=0.3092, pruned_loss=0.06911, over 21346.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3141, pruned_loss=0.0827, over 4275340.49 frames. ], batch size: 211, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:18:47,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1483542.0, ans=0.125 2023-06-23 12:19:06,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-23 12:20:09,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1483782.0, ans=0.125 2023-06-23 12:20:25,116 INFO [train.py:996] (0/4) Epoch 9, batch 3350, loss[loss=0.2624, simple_loss=0.3438, pruned_loss=0.09051, over 21597.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3169, pruned_loss=0.08402, over 4280556.62 frames. ], batch size: 389, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:20:35,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1483842.0, ans=0.0 2023-06-23 12:21:35,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-23 12:21:40,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.733e+02 6.031e+02 9.696e+02 1.341e+03 2.502e+03, threshold=1.939e+03, percent-clipped=21.0 2023-06-23 12:22:04,014 INFO [train.py:996] (0/4) Epoch 9, batch 3400, loss[loss=0.2681, simple_loss=0.3373, pruned_loss=0.09941, over 21568.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3185, pruned_loss=0.08526, over 4288563.12 frames. ], batch size: 441, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:23:29,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1484382.0, ans=0.0 2023-06-23 12:23:44,219 INFO [train.py:996] (0/4) Epoch 9, batch 3450, loss[loss=0.2291, simple_loss=0.2908, pruned_loss=0.08367, over 21871.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3123, pruned_loss=0.08362, over 4283110.28 frames. ], batch size: 98, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:23:44,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1484442.0, ans=0.0 2023-06-23 12:23:44,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1484442.0, ans=0.125 2023-06-23 12:24:06,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1484502.0, ans=0.125 2023-06-23 12:24:21,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.07 vs. limit=15.0 2023-06-23 12:24:42,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1484622.0, ans=0.125 2023-06-23 12:25:01,680 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.990e+02 5.635e+02 8.079e+02 1.246e+03 2.546e+03, threshold=1.616e+03, percent-clipped=4.0 2023-06-23 12:25:21,113 INFO [train.py:996] (0/4) Epoch 9, batch 3500, loss[loss=0.2744, simple_loss=0.354, pruned_loss=0.0974, over 21737.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3187, pruned_loss=0.08647, over 4273738.46 frames. ], batch size: 351, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:25:43,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1484802.0, ans=0.125 2023-06-23 12:25:48,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1484802.0, ans=0.2 2023-06-23 12:25:48,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-23 12:26:22,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-23 12:26:34,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1484922.0, ans=0.2 2023-06-23 12:26:45,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.75 vs. limit=15.0 2023-06-23 12:26:55,746 INFO [train.py:996] (0/4) Epoch 9, batch 3550, loss[loss=0.2186, simple_loss=0.3119, pruned_loss=0.06268, over 20959.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3226, pruned_loss=0.08783, over 4276197.65 frames. ], batch size: 607, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:27:17,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1485102.0, ans=0.0 2023-06-23 12:27:47,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1485162.0, ans=0.05 2023-06-23 12:28:10,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.810e+02 5.500e+02 7.334e+02 1.032e+03 1.807e+03, threshold=1.467e+03, percent-clipped=3.0 2023-06-23 12:28:29,464 INFO [train.py:996] (0/4) Epoch 9, batch 3600, loss[loss=0.273, simple_loss=0.3376, pruned_loss=0.1042, over 21581.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3166, pruned_loss=0.08707, over 4268653.14 frames. ], batch size: 415, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:28:48,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1485342.0, ans=0.0 2023-06-23 12:29:50,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1485522.0, ans=0.125 2023-06-23 12:30:07,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-23 12:30:11,175 INFO [train.py:996] (0/4) Epoch 9, batch 3650, loss[loss=0.2578, simple_loss=0.3504, pruned_loss=0.08266, over 21700.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3194, pruned_loss=0.08849, over 4270965.11 frames. ], batch size: 441, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:30:33,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-23 12:30:34,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1485642.0, ans=0.125 2023-06-23 12:30:52,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1485702.0, ans=0.0 2023-06-23 12:30:58,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485762.0, ans=0.1 2023-06-23 12:31:10,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1485762.0, ans=0.05 2023-06-23 12:31:34,331 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.982e+02 5.571e+02 7.883e+02 1.166e+03 2.519e+03, threshold=1.577e+03, percent-clipped=13.0 2023-06-23 12:31:52,206 INFO [train.py:996] (0/4) Epoch 9, batch 3700, loss[loss=0.258, simple_loss=0.3304, pruned_loss=0.09286, over 21330.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.321, pruned_loss=0.0882, over 4271950.07 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:32:36,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1486002.0, ans=0.0 2023-06-23 12:32:44,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1486062.0, ans=0.2 2023-06-23 12:32:47,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1486062.0, ans=0.025 2023-06-23 12:32:55,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1486062.0, ans=0.07 2023-06-23 12:32:59,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1486122.0, ans=0.0 2023-06-23 12:33:16,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1486182.0, ans=0.2 2023-06-23 12:33:25,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-23 12:33:26,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1486182.0, ans=0.125 2023-06-23 12:33:41,736 INFO [train.py:996] (0/4) Epoch 9, batch 3750, loss[loss=0.2462, simple_loss=0.3243, pruned_loss=0.0841, over 21604.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3181, pruned_loss=0.08755, over 4277806.14 frames. ], batch size: 508, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:34:04,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-23 12:34:24,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1486362.0, ans=0.0 2023-06-23 12:34:27,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486362.0, ans=0.1 2023-06-23 12:34:32,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.97 vs. limit=10.0 2023-06-23 12:34:51,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1486422.0, ans=0.1 2023-06-23 12:34:54,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 5.333e+02 7.689e+02 1.174e+03 2.476e+03, threshold=1.538e+03, percent-clipped=10.0 2023-06-23 12:35:22,201 INFO [train.py:996] (0/4) Epoch 9, batch 3800, loss[loss=0.2365, simple_loss=0.3179, pruned_loss=0.07752, over 21710.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3168, pruned_loss=0.08559, over 4270507.28 frames. ], batch size: 332, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:35:41,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1486602.0, ans=0.125 2023-06-23 12:35:41,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1486602.0, ans=0.125 2023-06-23 12:35:52,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1486602.0, ans=0.0 2023-06-23 12:36:20,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-23 12:36:20,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486722.0, ans=0.1 2023-06-23 12:36:55,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-23 12:36:56,391 INFO [train.py:996] (0/4) Epoch 9, batch 3850, loss[loss=0.1945, simple_loss=0.2568, pruned_loss=0.06606, over 21991.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3126, pruned_loss=0.08465, over 4270052.09 frames. ], batch size: 103, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:37:31,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-23 12:37:33,733 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-23 12:37:34,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1486962.0, ans=0.125 2023-06-23 12:37:36,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1486962.0, ans=10.0 2023-06-23 12:38:03,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.513e+02 4.770e+02 6.158e+02 8.423e+02 1.897e+03, threshold=1.232e+03, percent-clipped=2.0 2023-06-23 12:38:04,385 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-23 12:38:25,963 INFO [train.py:996] (0/4) Epoch 9, batch 3900, loss[loss=0.2472, simple_loss=0.3301, pruned_loss=0.08219, over 21335.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3081, pruned_loss=0.08371, over 4269527.27 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:38:55,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1487202.0, ans=0.0 2023-06-23 12:39:15,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1487262.0, ans=0.125 2023-06-23 12:39:21,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1487262.0, ans=0.07 2023-06-23 12:39:34,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1487322.0, ans=0.0 2023-06-23 12:40:09,698 INFO [train.py:996] (0/4) Epoch 9, batch 3950, loss[loss=0.2458, simple_loss=0.319, pruned_loss=0.0863, over 19915.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3086, pruned_loss=0.08284, over 4270622.74 frames. ], batch size: 703, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:41:16,572 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.422e+02 5.095e+02 8.161e+02 1.017e+03 2.071e+03, threshold=1.632e+03, percent-clipped=17.0 2023-06-23 12:41:49,085 INFO [train.py:996] (0/4) Epoch 9, batch 4000, loss[loss=0.2565, simple_loss=0.3481, pruned_loss=0.08244, over 20764.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.304, pruned_loss=0.08006, over 4270436.71 frames. ], batch size: 608, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:42:14,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1487802.0, ans=0.125 2023-06-23 12:42:52,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1487922.0, ans=0.125 2023-06-23 12:43:11,545 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-248000.pt 2023-06-23 12:43:29,600 INFO [train.py:996] (0/4) Epoch 9, batch 4050, loss[loss=0.2601, simple_loss=0.354, pruned_loss=0.08315, over 21611.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3033, pruned_loss=0.07853, over 4277239.21 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:43:42,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.36 vs. limit=10.0 2023-06-23 12:43:45,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1488042.0, ans=0.125 2023-06-23 12:44:00,029 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:44:07,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1488162.0, ans=0.125 2023-06-23 12:44:21,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1488222.0, ans=0.0 2023-06-23 12:44:37,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1488222.0, ans=0.5 2023-06-23 12:44:37,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1488222.0, ans=0.2 2023-06-23 12:44:48,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-23 12:44:48,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.596e+02 4.877e+02 6.874e+02 9.034e+02 2.185e+03, threshold=1.375e+03, percent-clipped=7.0 2023-06-23 12:44:54,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-23 12:45:09,757 INFO [train.py:996] (0/4) Epoch 9, batch 4100, loss[loss=0.2502, simple_loss=0.331, pruned_loss=0.08473, over 19922.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3048, pruned_loss=0.07916, over 4278853.38 frames. ], batch size: 702, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:46:36,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488582.0, ans=0.1 2023-06-23 12:46:54,915 INFO [train.py:996] (0/4) Epoch 9, batch 4150, loss[loss=0.1881, simple_loss=0.2907, pruned_loss=0.0427, over 21656.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3058, pruned_loss=0.07702, over 4276049.76 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:46:56,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488642.0, ans=0.1 2023-06-23 12:46:59,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1488642.0, ans=0.125 2023-06-23 12:47:00,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488642.0, ans=0.1 2023-06-23 12:47:03,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1488642.0, ans=0.125 2023-06-23 12:47:06,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1488642.0, ans=0.0 2023-06-23 12:47:13,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-23 12:47:30,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1488762.0, ans=0.125 2023-06-23 12:48:10,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.593e+02 5.949e+02 7.437e+02 1.328e+03 3.049e+03, threshold=1.487e+03, percent-clipped=21.0 2023-06-23 12:48:19,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1488882.0, ans=0.125 2023-06-23 12:48:32,532 INFO [train.py:996] (0/4) Epoch 9, batch 4200, loss[loss=0.2468, simple_loss=0.3422, pruned_loss=0.07566, over 21700.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3067, pruned_loss=0.07777, over 4280265.53 frames. ], batch size: 332, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:48:57,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1489002.0, ans=0.0 2023-06-23 12:49:38,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1489122.0, ans=0.2 2023-06-23 12:50:14,814 INFO [train.py:996] (0/4) Epoch 9, batch 4250, loss[loss=0.2712, simple_loss=0.3444, pruned_loss=0.09896, over 21808.00 frames. ], tot_loss[loss=0.237, simple_loss=0.314, pruned_loss=0.07998, over 4282744.68 frames. ], batch size: 298, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:50:47,533 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-23 12:51:43,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.972e+02 6.108e+02 8.578e+02 1.174e+03 2.664e+03, threshold=1.716e+03, percent-clipped=12.0 2023-06-23 12:51:58,839 INFO [train.py:996] (0/4) Epoch 9, batch 4300, loss[loss=0.2805, simple_loss=0.397, pruned_loss=0.08203, over 21264.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3186, pruned_loss=0.08095, over 4277263.67 frames. ], batch size: 549, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:52:06,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1489542.0, ans=0.125 2023-06-23 12:53:10,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1489722.0, ans=0.0 2023-06-23 12:53:40,726 INFO [train.py:996] (0/4) Epoch 9, batch 4350, loss[loss=0.2414, simple_loss=0.3066, pruned_loss=0.08813, over 21790.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3182, pruned_loss=0.08046, over 4271137.21 frames. ], batch size: 107, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:54:01,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1489842.0, ans=0.125 2023-06-23 12:54:07,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1489902.0, ans=0.0 2023-06-23 12:54:29,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1489902.0, ans=0.0 2023-06-23 12:54:31,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1489962.0, ans=0.125 2023-06-23 12:54:49,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1490022.0, ans=0.125 2023-06-23 12:55:05,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.397e+02 5.616e+02 8.582e+02 1.448e+03 3.184e+03, threshold=1.716e+03, percent-clipped=15.0 2023-06-23 12:55:07,132 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.83 vs. limit=10.0 2023-06-23 12:55:30,738 INFO [train.py:996] (0/4) Epoch 9, batch 4400, loss[loss=0.2888, simple_loss=0.3637, pruned_loss=0.1069, over 21479.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3148, pruned_loss=0.08004, over 4265464.45 frames. ], batch size: 508, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:56:39,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1490322.0, ans=0.0 2023-06-23 12:57:02,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-23 12:57:12,859 INFO [train.py:996] (0/4) Epoch 9, batch 4450, loss[loss=0.288, simple_loss=0.3837, pruned_loss=0.09613, over 21698.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3239, pruned_loss=0.08228, over 4267608.62 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:57:31,096 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:58:05,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1490562.0, ans=0.125 2023-06-23 12:58:32,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1490682.0, ans=0.1 2023-06-23 12:58:39,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.863e+02 6.461e+02 1.015e+03 1.659e+03 5.524e+03, threshold=2.029e+03, percent-clipped=20.0 2023-06-23 12:58:45,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1490682.0, ans=0.025 2023-06-23 12:58:54,109 INFO [train.py:996] (0/4) Epoch 9, batch 4500, loss[loss=0.21, simple_loss=0.2911, pruned_loss=0.06447, over 21480.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3255, pruned_loss=0.0844, over 4269029.34 frames. ], batch size: 194, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:58:59,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1490742.0, ans=0.125 2023-06-23 12:59:06,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1490742.0, ans=0.04949747468305833 2023-06-23 12:59:21,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1490802.0, ans=15.0 2023-06-23 12:59:24,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1490802.0, ans=0.125 2023-06-23 12:59:34,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1490802.0, ans=0.2 2023-06-23 12:59:55,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1490922.0, ans=0.1 2023-06-23 12:59:56,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1490922.0, ans=0.0 2023-06-23 13:00:42,566 INFO [train.py:996] (0/4) Epoch 9, batch 4550, loss[loss=0.3393, simple_loss=0.3941, pruned_loss=0.1422, over 21361.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3278, pruned_loss=0.08519, over 4274276.98 frames. ], batch size: 507, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 13:00:43,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1491042.0, ans=0.0 2023-06-23 13:00:56,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1491042.0, ans=0.125 2023-06-23 13:01:17,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1491102.0, ans=0.1 2023-06-23 13:01:46,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1491222.0, ans=0.2 2023-06-23 13:01:53,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1491222.0, ans=0.04949747468305833 2023-06-23 13:02:03,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 4.940e+02 6.297e+02 8.588e+02 1.962e+03, threshold=1.259e+03, percent-clipped=0.0 2023-06-23 13:02:27,884 INFO [train.py:996] (0/4) Epoch 9, batch 4600, loss[loss=0.2302, simple_loss=0.3028, pruned_loss=0.07876, over 21744.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3288, pruned_loss=0.08708, over 4270970.90 frames. ], batch size: 247, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:02:58,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.09 vs. limit=22.5 2023-06-23 13:03:00,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1491462.0, ans=0.125 2023-06-23 13:03:00,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1491462.0, ans=0.0 2023-06-23 13:03:59,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491582.0, ans=0.1 2023-06-23 13:04:01,951 INFO [train.py:996] (0/4) Epoch 9, batch 4650, loss[loss=0.1836, simple_loss=0.2517, pruned_loss=0.05774, over 21262.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3223, pruned_loss=0.08566, over 4277269.88 frames. ], batch size: 143, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:04:29,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1491702.0, ans=0.125 2023-06-23 13:05:16,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.483e+02 4.910e+02 6.100e+02 8.336e+02 1.525e+03, threshold=1.220e+03, percent-clipped=3.0 2023-06-23 13:05:18,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1491882.0, ans=0.125 2023-06-23 13:05:34,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1491942.0, ans=0.04949747468305833 2023-06-23 13:05:35,276 INFO [train.py:996] (0/4) Epoch 9, batch 4700, loss[loss=0.2248, simple_loss=0.2853, pruned_loss=0.08214, over 21708.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3118, pruned_loss=0.08255, over 4277076.63 frames. ], batch size: 333, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:06:28,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1492062.0, ans=0.0 2023-06-23 13:06:35,317 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.74 vs. limit=5.0 2023-06-23 13:06:55,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1492182.0, ans=0.0 2023-06-23 13:06:56,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1492182.0, ans=0.09899494936611666 2023-06-23 13:07:13,921 INFO [train.py:996] (0/4) Epoch 9, batch 4750, loss[loss=0.193, simple_loss=0.2622, pruned_loss=0.06191, over 21651.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3055, pruned_loss=0.08208, over 4275313.89 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:07:23,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1492242.0, ans=10.0 2023-06-23 13:07:23,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1492242.0, ans=0.125 2023-06-23 13:08:17,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1492422.0, ans=0.0 2023-06-23 13:08:34,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.263e+02 4.594e+02 6.434e+02 8.978e+02 1.748e+03, threshold=1.287e+03, percent-clipped=12.0 2023-06-23 13:08:46,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1492482.0, ans=0.125 2023-06-23 13:08:50,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1492482.0, ans=0.125 2023-06-23 13:08:51,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-23 13:08:54,125 INFO [train.py:996] (0/4) Epoch 9, batch 4800, loss[loss=0.2362, simple_loss=0.3138, pruned_loss=0.07929, over 21454.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.305, pruned_loss=0.08195, over 4283017.05 frames. ], batch size: 194, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:09:08,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1492602.0, ans=0.0 2023-06-23 13:09:19,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1492602.0, ans=0.125 2023-06-23 13:09:20,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-23 13:09:59,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1492722.0, ans=0.0 2023-06-23 13:10:00,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1492722.0, ans=10.0 2023-06-23 13:10:16,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1492782.0, ans=0.125 2023-06-23 13:10:32,164 INFO [train.py:996] (0/4) Epoch 9, batch 4850, loss[loss=0.2966, simple_loss=0.3532, pruned_loss=0.12, over 21739.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3053, pruned_loss=0.082, over 4276572.83 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:10:50,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1492902.0, ans=0.125 2023-06-23 13:11:52,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 5.446e+02 6.914e+02 1.031e+03 2.241e+03, threshold=1.383e+03, percent-clipped=12.0 2023-06-23 13:11:55,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1493082.0, ans=0.125 2023-06-23 13:12:10,730 INFO [train.py:996] (0/4) Epoch 9, batch 4900, loss[loss=0.2256, simple_loss=0.3186, pruned_loss=0.06636, over 21587.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3053, pruned_loss=0.08228, over 4278650.99 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:12:20,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1493142.0, ans=0.125 2023-06-23 13:13:07,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.91 vs. limit=22.5 2023-06-23 13:13:50,341 INFO [train.py:996] (0/4) Epoch 9, batch 4950, loss[loss=0.2257, simple_loss=0.3214, pruned_loss=0.06498, over 21560.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3072, pruned_loss=0.07946, over 4273122.68 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:15:14,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-23 13:15:16,664 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.074e+02 4.885e+02 7.048e+02 1.090e+03 2.586e+03, threshold=1.410e+03, percent-clipped=12.0 2023-06-23 13:15:29,192 INFO [train.py:996] (0/4) Epoch 9, batch 5000, loss[loss=0.2588, simple_loss=0.3281, pruned_loss=0.09473, over 21850.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3075, pruned_loss=0.07616, over 4271772.47 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:15:51,997 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-23 13:16:13,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1493862.0, ans=0.125 2023-06-23 13:16:15,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1493862.0, ans=0.0 2023-06-23 13:17:07,608 INFO [train.py:996] (0/4) Epoch 9, batch 5050, loss[loss=0.2302, simple_loss=0.2952, pruned_loss=0.08262, over 21349.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3081, pruned_loss=0.0784, over 4285392.81 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:17:08,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1494042.0, ans=0.05 2023-06-23 13:17:12,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1494042.0, ans=0.125 2023-06-23 13:17:19,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-23 13:17:53,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1494162.0, ans=0.125 2023-06-23 13:17:59,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1494162.0, ans=0.125 2023-06-23 13:17:59,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1494162.0, ans=0.125 2023-06-23 13:18:20,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1494222.0, ans=0.125 2023-06-23 13:18:20,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1494222.0, ans=0.125 2023-06-23 13:18:28,091 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.800e+02 5.012e+02 6.560e+02 1.020e+03 2.026e+03, threshold=1.312e+03, percent-clipped=12.0 2023-06-23 13:18:28,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1494282.0, ans=0.0 2023-06-23 13:18:44,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-23 13:18:45,162 INFO [train.py:996] (0/4) Epoch 9, batch 5100, loss[loss=0.2335, simple_loss=0.3064, pruned_loss=0.08032, over 21593.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.308, pruned_loss=0.07913, over 4283818.63 frames. ], batch size: 471, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:18:53,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1494342.0, ans=0.0 2023-06-23 13:19:05,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1494402.0, ans=0.2 2023-06-23 13:19:08,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1494402.0, ans=0.035 2023-06-23 13:19:38,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1494462.0, ans=0.125 2023-06-23 13:19:41,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1494462.0, ans=0.125 2023-06-23 13:20:22,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1494582.0, ans=0.0 2023-06-23 13:20:25,860 INFO [train.py:996] (0/4) Epoch 9, batch 5150, loss[loss=0.2435, simple_loss=0.2984, pruned_loss=0.09431, over 21457.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3066, pruned_loss=0.07983, over 4292341.15 frames. ], batch size: 194, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:21:28,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1494822.0, ans=0.0 2023-06-23 13:21:41,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1494822.0, ans=0.125 2023-06-23 13:21:54,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.473e+02 5.169e+02 7.852e+02 1.261e+03 2.554e+03, threshold=1.570e+03, percent-clipped=23.0 2023-06-23 13:22:00,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1494882.0, ans=0.0 2023-06-23 13:22:07,048 INFO [train.py:996] (0/4) Epoch 9, batch 5200, loss[loss=0.2309, simple_loss=0.3088, pruned_loss=0.07652, over 21204.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3097, pruned_loss=0.08088, over 4291218.55 frames. ], batch size: 144, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:22:11,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1494942.0, ans=0.125 2023-06-23 13:22:25,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-23 13:22:39,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1495002.0, ans=0.5 2023-06-23 13:23:18,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1495122.0, ans=0.125 2023-06-23 13:23:23,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1495122.0, ans=0.1 2023-06-23 13:23:39,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1495182.0, ans=0.2 2023-06-23 13:23:45,425 INFO [train.py:996] (0/4) Epoch 9, batch 5250, loss[loss=0.2055, simple_loss=0.2903, pruned_loss=0.06042, over 21581.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3146, pruned_loss=0.07868, over 4275506.27 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:24:19,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1495302.0, ans=0.125 2023-06-23 13:25:10,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.544e+02 5.638e+02 7.794e+02 1.189e+03 2.542e+03, threshold=1.559e+03, percent-clipped=12.0 2023-06-23 13:25:20,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1495482.0, ans=0.0 2023-06-23 13:25:27,938 INFO [train.py:996] (0/4) Epoch 9, batch 5300, loss[loss=0.2444, simple_loss=0.3063, pruned_loss=0.09124, over 21687.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3158, pruned_loss=0.07991, over 4275161.36 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:26:22,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1495662.0, ans=0.125 2023-06-23 13:26:37,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1495722.0, ans=0.125 2023-06-23 13:26:46,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1495782.0, ans=0.5 2023-06-23 13:27:01,903 INFO [train.py:996] (0/4) Epoch 9, batch 5350, loss[loss=0.2383, simple_loss=0.308, pruned_loss=0.0843, over 21911.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3145, pruned_loss=0.08098, over 4279495.53 frames. ], batch size: 351, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:27:36,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-23 13:28:10,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1496022.0, ans=0.125 2023-06-23 13:28:15,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1496022.0, ans=0.2 2023-06-23 13:28:28,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 6.687e+02 9.729e+02 1.336e+03 3.211e+03, threshold=1.946e+03, percent-clipped=15.0 2023-06-23 13:28:46,046 INFO [train.py:996] (0/4) Epoch 9, batch 5400, loss[loss=0.2747, simple_loss=0.3584, pruned_loss=0.09547, over 20981.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3144, pruned_loss=0.0822, over 4280271.11 frames. ], batch size: 607, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:28:49,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1496142.0, ans=0.0 2023-06-23 13:29:00,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.34 vs. limit=12.0 2023-06-23 13:29:01,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1496142.0, ans=0.2 2023-06-23 13:29:20,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1496202.0, ans=0.2 2023-06-23 13:29:22,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2023-06-23 13:29:42,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1496262.0, ans=0.0 2023-06-23 13:30:30,284 INFO [train.py:996] (0/4) Epoch 9, batch 5450, loss[loss=0.2903, simple_loss=0.3761, pruned_loss=0.1022, over 21734.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3159, pruned_loss=0.08129, over 4281339.24 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:30:42,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1496442.0, ans=0.0 2023-06-23 13:30:50,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1496502.0, ans=0.2 2023-06-23 13:31:29,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1496622.0, ans=0.025 2023-06-23 13:31:46,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1496622.0, ans=0.0 2023-06-23 13:31:48,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1496682.0, ans=0.125 2023-06-23 13:31:53,806 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.097e+02 7.378e+02 1.209e+03 3.523e+03, threshold=1.476e+03, percent-clipped=4.0 2023-06-23 13:32:04,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-23 13:32:09,886 INFO [train.py:996] (0/4) Epoch 9, batch 5500, loss[loss=0.2506, simple_loss=0.3494, pruned_loss=0.07594, over 21641.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3217, pruned_loss=0.07862, over 4283614.78 frames. ], batch size: 389, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:32:30,371 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.594e-03 2023-06-23 13:32:52,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1496862.0, ans=0.125 2023-06-23 13:33:00,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1496862.0, ans=0.0 2023-06-23 13:33:21,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1496922.0, ans=0.125 2023-06-23 13:33:50,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-23 13:33:50,898 INFO [train.py:996] (0/4) Epoch 9, batch 5550, loss[loss=0.2066, simple_loss=0.2822, pruned_loss=0.06551, over 21297.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3172, pruned_loss=0.07513, over 4276681.79 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:34:47,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1497162.0, ans=0.0 2023-06-23 13:35:02,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1497222.0, ans=0.1 2023-06-23 13:35:17,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.428e+02 4.782e+02 7.322e+02 1.097e+03 2.363e+03, threshold=1.464e+03, percent-clipped=11.0 2023-06-23 13:35:30,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1497282.0, ans=0.1 2023-06-23 13:35:38,005 INFO [train.py:996] (0/4) Epoch 9, batch 5600, loss[loss=0.2316, simple_loss=0.2867, pruned_loss=0.08828, over 20750.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.314, pruned_loss=0.07317, over 4274490.41 frames. ], batch size: 609, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:35:54,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1497402.0, ans=0.125 2023-06-23 13:36:13,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-23 13:36:44,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1497522.0, ans=0.2 2023-06-23 13:37:10,937 INFO [train.py:996] (0/4) Epoch 9, batch 5650, loss[loss=0.2363, simple_loss=0.3078, pruned_loss=0.08238, over 21749.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3186, pruned_loss=0.07696, over 4281123.71 frames. ], batch size: 112, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:37:16,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1497642.0, ans=0.0 2023-06-23 13:37:43,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1497702.0, ans=0.125 2023-06-23 13:37:57,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1497762.0, ans=0.2 2023-06-23 13:38:29,815 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.16 vs. limit=6.0 2023-06-23 13:38:34,977 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.803e+02 5.906e+02 7.888e+02 1.290e+03 2.997e+03, threshold=1.578e+03, percent-clipped=20.0 2023-06-23 13:38:46,245 INFO [train.py:996] (0/4) Epoch 9, batch 5700, loss[loss=0.219, simple_loss=0.3179, pruned_loss=0.06007, over 21645.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3183, pruned_loss=0.07772, over 4281255.75 frames. ], batch size: 389, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:38:54,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-23 13:38:57,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-23 13:39:04,936 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.40 vs. limit=15.0 2023-06-23 13:39:59,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1498122.0, ans=0.125 2023-06-23 13:40:20,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1498182.0, ans=0.1 2023-06-23 13:40:31,248 INFO [train.py:996] (0/4) Epoch 9, batch 5750, loss[loss=0.176, simple_loss=0.2278, pruned_loss=0.06208, over 19997.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3134, pruned_loss=0.07509, over 4280439.88 frames. ], batch size: 703, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:40:35,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1498242.0, ans=0.09899494936611666 2023-06-23 13:41:39,648 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:41:52,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.170e+02 4.651e+02 7.542e+02 1.104e+03 3.145e+03, threshold=1.508e+03, percent-clipped=9.0 2023-06-23 13:41:52,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1498482.0, ans=0.1 2023-06-23 13:42:00,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1498482.0, ans=0.125 2023-06-23 13:42:06,760 INFO [train.py:996] (0/4) Epoch 9, batch 5800, loss[loss=0.222, simple_loss=0.3183, pruned_loss=0.06289, over 21685.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3135, pruned_loss=0.07404, over 4278748.96 frames. ], batch size: 263, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:42:36,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1498602.0, ans=0.2 2023-06-23 13:43:00,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1498662.0, ans=0.2 2023-06-23 13:43:03,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1498662.0, ans=0.125 2023-06-23 13:43:37,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1498782.0, ans=0.0 2023-06-23 13:43:42,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.99 vs. limit=8.0 2023-06-23 13:43:52,458 INFO [train.py:996] (0/4) Epoch 9, batch 5850, loss[loss=0.1656, simple_loss=0.2603, pruned_loss=0.03543, over 21345.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3101, pruned_loss=0.0689, over 4281816.18 frames. ], batch size: 194, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:43:57,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1498842.0, ans=0.125 2023-06-23 13:43:58,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-23 13:44:29,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1498902.0, ans=0.2 2023-06-23 13:45:06,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1499022.0, ans=0.125 2023-06-23 13:45:18,563 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.169e+02 5.972e+02 8.890e+02 1.873e+03, threshold=1.194e+03, percent-clipped=6.0 2023-06-23 13:45:31,297 INFO [train.py:996] (0/4) Epoch 9, batch 5900, loss[loss=0.2203, simple_loss=0.2988, pruned_loss=0.07086, over 21882.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.3053, pruned_loss=0.06514, over 4288072.35 frames. ], batch size: 316, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:45:46,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1499202.0, ans=0.125 2023-06-23 13:45:47,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-23 13:46:03,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-23 13:47:06,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1499382.0, ans=0.0 2023-06-23 13:47:10,309 INFO [train.py:996] (0/4) Epoch 9, batch 5950, loss[loss=0.1694, simple_loss=0.26, pruned_loss=0.03943, over 21303.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3048, pruned_loss=0.06762, over 4290041.72 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:47:38,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1499502.0, ans=0.125 2023-06-23 13:47:41,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1499502.0, ans=0.1 2023-06-23 13:47:48,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.68 vs. limit=22.5 2023-06-23 13:48:19,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1499622.0, ans=0.125 2023-06-23 13:48:40,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.364e+02 5.592e+02 7.986e+02 1.183e+03 2.385e+03, threshold=1.597e+03, percent-clipped=25.0 2023-06-23 13:48:48,927 INFO [train.py:996] (0/4) Epoch 9, batch 6000, loss[loss=0.2128, simple_loss=0.2707, pruned_loss=0.07743, over 21493.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3004, pruned_loss=0.07078, over 4286503.26 frames. ], batch size: 212, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:48:48,928 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 13:49:10,222 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2648, simple_loss=0.3557, pruned_loss=0.08691, over 1796401.00 frames. 2023-06-23 13:49:10,223 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 13:49:10,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1499742.0, ans=0.125 2023-06-23 13:50:18,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1499922.0, ans=0.0 2023-06-23 13:50:49,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1500042.0, ans=0.0 2023-06-23 13:50:51,010 INFO [train.py:996] (0/4) Epoch 9, batch 6050, loss[loss=0.2168, simple_loss=0.2824, pruned_loss=0.07558, over 21597.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2969, pruned_loss=0.07322, over 4271030.88 frames. ], batch size: 415, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:51:07,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500102.0, ans=0.1 2023-06-23 13:51:57,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1500222.0, ans=0.125 2023-06-23 13:52:15,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.157e+02 5.140e+02 6.887e+02 9.775e+02 3.553e+03, threshold=1.377e+03, percent-clipped=5.0 2023-06-23 13:52:19,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1500282.0, ans=0.125 2023-06-23 13:52:28,832 INFO [train.py:996] (0/4) Epoch 9, batch 6100, loss[loss=0.2266, simple_loss=0.3016, pruned_loss=0.07582, over 21462.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2963, pruned_loss=0.0723, over 4277353.59 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:52:37,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1500342.0, ans=0.0 2023-06-23 13:52:43,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-23 13:52:44,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1500402.0, ans=15.0 2023-06-23 13:53:07,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1500462.0, ans=0.04949747468305833 2023-06-23 13:53:11,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1500462.0, ans=0.0 2023-06-23 13:53:35,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1500522.0, ans=0.0 2023-06-23 13:53:40,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1500582.0, ans=0.2 2023-06-23 13:54:01,359 INFO [train.py:996] (0/4) Epoch 9, batch 6150, loss[loss=0.2454, simple_loss=0.3033, pruned_loss=0.09374, over 15977.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2987, pruned_loss=0.07478, over 4276671.16 frames. ], batch size: 62, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:54:07,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1500642.0, ans=0.125 2023-06-23 13:54:23,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1500702.0, ans=0.125 2023-06-23 13:54:35,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1500702.0, ans=0.2 2023-06-23 13:54:41,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500762.0, ans=0.1 2023-06-23 13:55:02,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1500822.0, ans=0.125 2023-06-23 13:55:32,772 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.545e+02 5.300e+02 7.269e+02 1.178e+03 2.947e+03, threshold=1.454e+03, percent-clipped=13.0 2023-06-23 13:55:33,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1500882.0, ans=0.0 2023-06-23 13:55:45,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500942.0, ans=0.1 2023-06-23 13:55:46,206 INFO [train.py:996] (0/4) Epoch 9, batch 6200, loss[loss=0.2497, simple_loss=0.3193, pruned_loss=0.09005, over 21374.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3014, pruned_loss=0.07521, over 4280539.82 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:56:09,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1501002.0, ans=0.125 2023-06-23 13:56:28,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1501062.0, ans=0.125 2023-06-23 13:56:30,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1501062.0, ans=0.125 2023-06-23 13:56:43,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1501122.0, ans=0.125 2023-06-23 13:57:25,631 INFO [train.py:996] (0/4) Epoch 9, batch 6250, loss[loss=0.2793, simple_loss=0.3738, pruned_loss=0.09238, over 21495.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3056, pruned_loss=0.07466, over 4277854.10 frames. ], batch size: 507, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:57:26,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1501242.0, ans=0.125 2023-06-23 13:57:32,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1501242.0, ans=0.04949747468305833 2023-06-23 13:57:51,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1501302.0, ans=0.1 2023-06-23 13:58:56,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.754e+02 9.579e+02 1.636e+03 2.645e+03, threshold=1.916e+03, percent-clipped=27.0 2023-06-23 13:59:03,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1501542.0, ans=0.125 2023-06-23 13:59:04,191 INFO [train.py:996] (0/4) Epoch 9, batch 6300, loss[loss=0.2227, simple_loss=0.2884, pruned_loss=0.07855, over 21461.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3083, pruned_loss=0.07357, over 4280284.82 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:59:05,219 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=22.5 2023-06-23 13:59:06,133 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:59:11,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-23 13:59:30,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1501602.0, ans=0.1 2023-06-23 13:59:32,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-23 13:59:38,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1501602.0, ans=0.0 2023-06-23 13:59:51,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1501662.0, ans=0.035 2023-06-23 14:00:49,583 INFO [train.py:996] (0/4) Epoch 9, batch 6350, loss[loss=0.2904, simple_loss=0.3596, pruned_loss=0.1106, over 21826.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3132, pruned_loss=0.07813, over 4286878.77 frames. ], batch size: 118, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:01:00,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1501842.0, ans=0.1 2023-06-23 14:01:03,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1501842.0, ans=0.0 2023-06-23 14:02:24,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.408e+02 6.300e+02 8.863e+02 1.224e+03 2.908e+03, threshold=1.773e+03, percent-clipped=5.0 2023-06-23 14:02:26,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1502082.0, ans=0.1 2023-06-23 14:02:32,232 INFO [train.py:996] (0/4) Epoch 9, batch 6400, loss[loss=0.253, simple_loss=0.3252, pruned_loss=0.09038, over 21374.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.32, pruned_loss=0.08253, over 4289947.28 frames. ], batch size: 548, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:02:37,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1502142.0, ans=0.125 2023-06-23 14:03:27,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1502262.0, ans=0.0 2023-06-23 14:04:05,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1502382.0, ans=0.0 2023-06-23 14:04:06,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1502382.0, ans=0.125 2023-06-23 14:04:10,691 INFO [train.py:996] (0/4) Epoch 9, batch 6450, loss[loss=0.1954, simple_loss=0.2728, pruned_loss=0.05905, over 21658.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3215, pruned_loss=0.08212, over 4288163.59 frames. ], batch size: 247, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:04:43,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.86 vs. limit=22.5 2023-06-23 14:05:28,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1502622.0, ans=0.125 2023-06-23 14:05:42,803 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 5.544e+02 7.373e+02 1.174e+03 2.232e+03, threshold=1.475e+03, percent-clipped=4.0 2023-06-23 14:05:45,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1502682.0, ans=0.125 2023-06-23 14:05:51,142 INFO [train.py:996] (0/4) Epoch 9, batch 6500, loss[loss=0.2099, simple_loss=0.2672, pruned_loss=0.07627, over 21302.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3169, pruned_loss=0.08141, over 4276245.82 frames. ], batch size: 177, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:06:14,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1502802.0, ans=0.1 2023-06-23 14:07:12,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1502982.0, ans=0.125 2023-06-23 14:07:12,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1502982.0, ans=0.125 2023-06-23 14:07:35,328 INFO [train.py:996] (0/4) Epoch 9, batch 6550, loss[loss=0.2747, simple_loss=0.341, pruned_loss=0.1042, over 21754.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3154, pruned_loss=0.08062, over 4282065.47 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:07:58,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1503102.0, ans=0.125 2023-06-23 14:08:02,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1503102.0, ans=0.0 2023-06-23 14:08:48,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1503222.0, ans=0.2 2023-06-23 14:09:02,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.667e+02 5.627e+02 7.547e+02 1.040e+03 2.189e+03, threshold=1.509e+03, percent-clipped=8.0 2023-06-23 14:09:15,141 INFO [train.py:996] (0/4) Epoch 9, batch 6600, loss[loss=0.2296, simple_loss=0.2833, pruned_loss=0.08789, over 21402.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3107, pruned_loss=0.08061, over 4272803.95 frames. ], batch size: 509, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:09:49,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1503402.0, ans=0.1 2023-06-23 14:09:50,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1503462.0, ans=0.125 2023-06-23 14:10:03,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1503462.0, ans=0.1 2023-06-23 14:10:05,744 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=22.5 2023-06-23 14:10:17,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1503522.0, ans=0.0 2023-06-23 14:10:35,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-23 14:10:55,771 INFO [train.py:996] (0/4) Epoch 9, batch 6650, loss[loss=0.1995, simple_loss=0.2791, pruned_loss=0.05997, over 21782.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3034, pruned_loss=0.07713, over 4262144.56 frames. ], batch size: 352, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:11:36,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1503762.0, ans=0.2 2023-06-23 14:12:29,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 5.915e+02 8.562e+02 1.227e+03 3.234e+03, threshold=1.712e+03, percent-clipped=18.0 2023-06-23 14:12:36,202 INFO [train.py:996] (0/4) Epoch 9, batch 6700, loss[loss=0.2304, simple_loss=0.351, pruned_loss=0.05493, over 19780.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2998, pruned_loss=0.07578, over 4260988.10 frames. ], batch size: 702, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:12:39,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1503942.0, ans=0.0 2023-06-23 14:12:58,462 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-23 14:13:31,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1504122.0, ans=0.0 2023-06-23 14:14:14,348 INFO [train.py:996] (0/4) Epoch 9, batch 6750, loss[loss=0.2492, simple_loss=0.3143, pruned_loss=0.09204, over 21774.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2977, pruned_loss=0.07646, over 4267250.56 frames. ], batch size: 112, lr: 3.34e-03, grad_scale: 8.0 2023-06-23 14:14:26,535 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:14:43,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1504302.0, ans=0.0 2023-06-23 14:15:09,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1504362.0, ans=0.0 2023-06-23 14:15:25,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1504422.0, ans=0.125 2023-06-23 14:15:29,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1504422.0, ans=0.2 2023-06-23 14:15:36,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1504482.0, ans=0.04949747468305833 2023-06-23 14:15:48,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.896e+02 6.568e+02 9.733e+02 1.340e+03 2.605e+03, threshold=1.947e+03, percent-clipped=12.0 2023-06-23 14:15:50,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1504482.0, ans=0.125 2023-06-23 14:15:53,581 INFO [train.py:996] (0/4) Epoch 9, batch 6800, loss[loss=0.2782, simple_loss=0.3112, pruned_loss=0.1226, over 21434.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3029, pruned_loss=0.07916, over 4270457.45 frames. ], batch size: 508, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:16:51,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1504722.0, ans=0.05 2023-06-23 14:17:08,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1504722.0, ans=0.125 2023-06-23 14:17:32,355 INFO [train.py:996] (0/4) Epoch 9, batch 6850, loss[loss=0.224, simple_loss=0.2836, pruned_loss=0.0822, over 21745.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3001, pruned_loss=0.07928, over 4270587.72 frames. ], batch size: 351, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:17:32,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1504842.0, ans=0.125 2023-06-23 14:17:45,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1504842.0, ans=0.07 2023-06-23 14:18:20,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1504962.0, ans=0.125 2023-06-23 14:18:27,851 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:19:07,213 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 4.770e+02 6.261e+02 9.211e+02 1.923e+03, threshold=1.252e+03, percent-clipped=0.0 2023-06-23 14:19:12,191 INFO [train.py:996] (0/4) Epoch 9, batch 6900, loss[loss=0.2636, simple_loss=0.3249, pruned_loss=0.1011, over 21853.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2993, pruned_loss=0.07883, over 4281341.46 frames. ], batch size: 351, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:19:25,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1505142.0, ans=0.1 2023-06-23 14:19:36,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1505202.0, ans=0.125 2023-06-23 14:20:02,517 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:20:13,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1505322.0, ans=0.0 2023-06-23 14:20:44,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1505382.0, ans=0.125 2023-06-23 14:20:51,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-23 14:20:51,941 INFO [train.py:996] (0/4) Epoch 9, batch 6950, loss[loss=0.1996, simple_loss=0.2794, pruned_loss=0.05992, over 21282.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.299, pruned_loss=0.07582, over 4278493.71 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:21:45,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1505562.0, ans=0.07 2023-06-23 14:21:50,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1505562.0, ans=0.2 2023-06-23 14:21:52,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1505562.0, ans=0.0 2023-06-23 14:22:14,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1505682.0, ans=0.125 2023-06-23 14:22:26,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 5.568e+02 8.095e+02 1.122e+03 2.896e+03, threshold=1.619e+03, percent-clipped=20.0 2023-06-23 14:22:31,452 INFO [train.py:996] (0/4) Epoch 9, batch 7000, loss[loss=0.1997, simple_loss=0.2655, pruned_loss=0.06697, over 21557.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3011, pruned_loss=0.07792, over 4280326.92 frames. ], batch size: 263, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:23:51,212 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.74 vs. limit=15.0 2023-06-23 14:24:16,410 INFO [train.py:996] (0/4) Epoch 9, batch 7050, loss[loss=0.2279, simple_loss=0.2976, pruned_loss=0.07909, over 21829.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2989, pruned_loss=0.07675, over 4269554.90 frames. ], batch size: 118, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:24:36,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1506102.0, ans=0.015 2023-06-23 14:25:05,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1506162.0, ans=0.125 2023-06-23 14:25:08,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1506162.0, ans=0.125 2023-06-23 14:25:50,184 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 5.176e+02 7.948e+02 1.176e+03 2.286e+03, threshold=1.590e+03, percent-clipped=9.0 2023-06-23 14:25:55,068 INFO [train.py:996] (0/4) Epoch 9, batch 7100, loss[loss=0.2166, simple_loss=0.2973, pruned_loss=0.06791, over 21746.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3038, pruned_loss=0.07867, over 4276454.78 frames. ], batch size: 247, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:27:21,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1506582.0, ans=0.125 2023-06-23 14:27:26,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1506582.0, ans=0.125 2023-06-23 14:27:35,258 INFO [train.py:996] (0/4) Epoch 9, batch 7150, loss[loss=0.3214, simple_loss=0.3734, pruned_loss=0.1347, over 21345.00 frames. ], tot_loss[loss=0.228, simple_loss=0.302, pruned_loss=0.077, over 4269121.55 frames. ], batch size: 507, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:27:37,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1506642.0, ans=0.1 2023-06-23 14:27:45,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1506642.0, ans=0.5 2023-06-23 14:28:16,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1506762.0, ans=0.04949747468305833 2023-06-23 14:28:26,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1506762.0, ans=0.1 2023-06-23 14:28:58,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1506882.0, ans=0.1 2023-06-23 14:29:07,060 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:29:10,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.602e+02 5.826e+02 7.818e+02 1.087e+03 2.405e+03, threshold=1.564e+03, percent-clipped=10.0 2023-06-23 14:29:21,009 INFO [train.py:996] (0/4) Epoch 9, batch 7200, loss[loss=0.2023, simple_loss=0.2681, pruned_loss=0.06822, over 21567.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3044, pruned_loss=0.07896, over 4272424.61 frames. ], batch size: 263, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:29:42,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1507002.0, ans=0.1 2023-06-23 14:30:00,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1507002.0, ans=0.0 2023-06-23 14:30:06,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1507062.0, ans=0.0 2023-06-23 14:30:59,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1507242.0, ans=0.125 2023-06-23 14:31:00,727 INFO [train.py:996] (0/4) Epoch 9, batch 7250, loss[loss=0.2083, simple_loss=0.2806, pruned_loss=0.06804, over 21769.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2984, pruned_loss=0.07808, over 4275067.48 frames. ], batch size: 118, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:31:04,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-23 14:31:13,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1507242.0, ans=0.2 2023-06-23 14:31:28,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1507302.0, ans=0.2 2023-06-23 14:31:41,963 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=12.0 2023-06-23 14:32:24,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1507482.0, ans=0.0 2023-06-23 14:32:37,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.651e+02 4.870e+02 5.610e+02 7.177e+02 1.494e+03, threshold=1.122e+03, percent-clipped=0.0 2023-06-23 14:32:39,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1507542.0, ans=0.1 2023-06-23 14:32:44,948 INFO [train.py:996] (0/4) Epoch 9, batch 7300, loss[loss=0.2051, simple_loss=0.2647, pruned_loss=0.07274, over 21752.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2933, pruned_loss=0.07781, over 4268773.88 frames. ], batch size: 300, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:33:28,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1507662.0, ans=0.125 2023-06-23 14:33:43,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1507722.0, ans=6.0 2023-06-23 14:33:48,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1507722.0, ans=15.0 2023-06-23 14:34:25,797 INFO [train.py:996] (0/4) Epoch 9, batch 7350, loss[loss=0.2578, simple_loss=0.3119, pruned_loss=0.1018, over 21855.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.293, pruned_loss=0.07964, over 4269769.64 frames. ], batch size: 98, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:34:26,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1507842.0, ans=0.0 2023-06-23 14:35:18,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1507962.0, ans=0.2 2023-06-23 14:35:21,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-23 14:35:21,881 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-23 14:36:02,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.713e+02 6.185e+02 8.335e+02 1.224e+03 2.285e+03, threshold=1.667e+03, percent-clipped=37.0 2023-06-23 14:36:06,158 INFO [train.py:996] (0/4) Epoch 9, batch 7400, loss[loss=0.2447, simple_loss=0.3416, pruned_loss=0.07391, over 21572.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2986, pruned_loss=0.08044, over 4272675.96 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:36:18,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1508142.0, ans=0.125 2023-06-23 14:36:37,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1508202.0, ans=0.07 2023-06-23 14:36:50,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1508262.0, ans=0.125 2023-06-23 14:37:13,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1508322.0, ans=0.2 2023-06-23 14:37:35,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-23 14:37:36,212 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:37:39,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1508382.0, ans=0.125 2023-06-23 14:37:46,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-23 14:37:47,571 INFO [train.py:996] (0/4) Epoch 9, batch 7450, loss[loss=0.253, simple_loss=0.3356, pruned_loss=0.08522, over 19966.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.298, pruned_loss=0.07974, over 4266315.66 frames. ], batch size: 703, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:37:49,561 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:38:05,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=12.0 2023-06-23 14:38:09,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1508502.0, ans=0.0 2023-06-23 14:39:26,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.876e+02 5.448e+02 8.462e+02 1.438e+03 2.608e+03, threshold=1.692e+03, percent-clipped=12.0 2023-06-23 14:39:35,283 INFO [train.py:996] (0/4) Epoch 9, batch 7500, loss[loss=0.2261, simple_loss=0.2914, pruned_loss=0.08042, over 21436.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3052, pruned_loss=0.08109, over 4267146.72 frames. ], batch size: 211, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:40:00,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-23 14:40:26,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1508862.0, ans=0.125 2023-06-23 14:40:28,605 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:40:57,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-23 14:41:16,346 INFO [train.py:996] (0/4) Epoch 9, batch 7550, loss[loss=0.21, simple_loss=0.2917, pruned_loss=0.06412, over 21421.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3104, pruned_loss=0.07954, over 4266892.40 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:41:45,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1509102.0, ans=0.0 2023-06-23 14:41:50,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1509102.0, ans=0.125 2023-06-23 14:42:35,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1509282.0, ans=0.0 2023-06-23 14:42:52,866 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.575e+02 5.410e+02 7.103e+02 1.048e+03 2.085e+03, threshold=1.421e+03, percent-clipped=3.0 2023-06-23 14:42:56,199 INFO [train.py:996] (0/4) Epoch 9, batch 7600, loss[loss=0.2148, simple_loss=0.2774, pruned_loss=0.07604, over 21318.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3085, pruned_loss=0.07804, over 4268326.12 frames. ], batch size: 176, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:42:58,612 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:43:01,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1509342.0, ans=0.2 2023-06-23 14:43:31,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1509402.0, ans=0.1 2023-06-23 14:43:38,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1509462.0, ans=0.0 2023-06-23 14:43:51,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1509462.0, ans=0.0 2023-06-23 14:44:34,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1509582.0, ans=0.1 2023-06-23 14:44:37,249 INFO [train.py:996] (0/4) Epoch 9, batch 7650, loss[loss=0.2829, simple_loss=0.3263, pruned_loss=0.1197, over 21811.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3075, pruned_loss=0.07929, over 4272727.74 frames. ], batch size: 508, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:44:57,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-23 14:45:04,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1509702.0, ans=0.0 2023-06-23 14:45:05,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-23 14:45:49,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1509822.0, ans=0.1 2023-06-23 14:46:15,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.771e+02 5.554e+02 6.899e+02 1.039e+03 2.407e+03, threshold=1.380e+03, percent-clipped=12.0 2023-06-23 14:46:18,631 INFO [train.py:996] (0/4) Epoch 9, batch 7700, loss[loss=0.2817, simple_loss=0.3484, pruned_loss=0.1075, over 21478.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3114, pruned_loss=0.08286, over 4280568.19 frames. ], batch size: 194, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:46:22,500 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:46:35,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1509942.0, ans=0.0 2023-06-23 14:47:19,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-23 14:47:43,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1510182.0, ans=0.0 2023-06-23 14:48:05,192 INFO [train.py:996] (0/4) Epoch 9, batch 7750, loss[loss=0.2264, simple_loss=0.2798, pruned_loss=0.08652, over 20894.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3191, pruned_loss=0.08436, over 4276752.76 frames. ], batch size: 608, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:48:18,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-06-23 14:48:35,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1510302.0, ans=0.125 2023-06-23 14:48:43,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1510302.0, ans=0.125 2023-06-23 14:48:58,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1510362.0, ans=0.0 2023-06-23 14:49:43,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1510482.0, ans=0.125 2023-06-23 14:49:44,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.753e+02 6.023e+02 8.792e+02 1.462e+03 2.647e+03, threshold=1.758e+03, percent-clipped=26.0 2023-06-23 14:49:46,186 INFO [train.py:996] (0/4) Epoch 9, batch 7800, loss[loss=0.1934, simple_loss=0.2373, pruned_loss=0.07475, over 21859.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3191, pruned_loss=0.08433, over 4277399.18 frames. ], batch size: 98, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:49:50,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-23 14:50:23,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1510602.0, ans=0.2 2023-06-23 14:51:13,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-23 14:51:22,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1510782.0, ans=0.125 2023-06-23 14:51:25,391 INFO [train.py:996] (0/4) Epoch 9, batch 7850, loss[loss=0.218, simple_loss=0.2792, pruned_loss=0.07842, over 21228.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3101, pruned_loss=0.08293, over 4257784.11 frames. ], batch size: 177, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:52:07,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1510962.0, ans=0.2 2023-06-23 14:52:21,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1510962.0, ans=0.0 2023-06-23 14:53:05,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.699e+02 5.401e+02 8.491e+02 1.335e+03 3.211e+03, threshold=1.698e+03, percent-clipped=14.0 2023-06-23 14:53:07,031 INFO [train.py:996] (0/4) Epoch 9, batch 7900, loss[loss=0.1997, simple_loss=0.2605, pruned_loss=0.06946, over 21428.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3079, pruned_loss=0.08243, over 4262605.90 frames. ], batch size: 212, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:53:24,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1511142.0, ans=0.07 2023-06-23 14:54:05,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1511262.0, ans=0.0 2023-06-23 14:54:49,001 INFO [train.py:996] (0/4) Epoch 9, batch 7950, loss[loss=0.3176, simple_loss=0.3865, pruned_loss=0.1244, over 21611.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3105, pruned_loss=0.08182, over 4260675.49 frames. ], batch size: 507, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:55:56,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-23 14:56:12,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-23 14:56:13,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1511682.0, ans=0.125 2023-06-23 14:56:40,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.215e+02 6.156e+02 9.056e+02 1.636e+03 2.892e+03, threshold=1.811e+03, percent-clipped=22.0 2023-06-23 14:56:41,921 INFO [train.py:996] (0/4) Epoch 9, batch 8000, loss[loss=0.229, simple_loss=0.3158, pruned_loss=0.07107, over 21783.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3147, pruned_loss=0.08263, over 4253349.98 frames. ], batch size: 282, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:56:54,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1511742.0, ans=0.125 2023-06-23 14:57:06,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.51 vs. limit=8.0 2023-06-23 14:58:13,463 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-252000.pt 2023-06-23 14:58:32,841 INFO [train.py:996] (0/4) Epoch 9, batch 8050, loss[loss=0.217, simple_loss=0.2818, pruned_loss=0.07612, over 21449.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3191, pruned_loss=0.08341, over 4252271.48 frames. ], batch size: 194, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:59:07,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1512102.0, ans=0.05 2023-06-23 14:59:10,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1512162.0, ans=0.0 2023-06-23 14:59:12,161 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-23 14:59:33,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1512222.0, ans=0.125 2023-06-23 15:00:10,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.139e+02 6.350e+02 8.777e+02 1.241e+03 2.449e+03, threshold=1.755e+03, percent-clipped=9.0 2023-06-23 15:00:12,609 INFO [train.py:996] (0/4) Epoch 9, batch 8100, loss[loss=0.2387, simple_loss=0.3135, pruned_loss=0.08193, over 21557.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3168, pruned_loss=0.08433, over 4261341.14 frames. ], batch size: 131, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:00:18,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1512342.0, ans=0.0 2023-06-23 15:00:42,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1512402.0, ans=0.125 2023-06-23 15:01:57,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1512582.0, ans=0.2 2023-06-23 15:02:01,808 INFO [train.py:996] (0/4) Epoch 9, batch 8150, loss[loss=0.1955, simple_loss=0.2673, pruned_loss=0.06188, over 21247.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3239, pruned_loss=0.08524, over 4262822.86 frames. ], batch size: 159, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:02:36,399 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-23 15:02:58,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1512822.0, ans=0.2 2023-06-23 15:03:40,782 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.771e+02 6.551e+02 1.054e+03 1.725e+03 4.751e+03, threshold=2.109e+03, percent-clipped=24.0 2023-06-23 15:03:40,813 INFO [train.py:996] (0/4) Epoch 9, batch 8200, loss[loss=0.2048, simple_loss=0.2689, pruned_loss=0.07034, over 21321.00 frames. ], tot_loss[loss=0.241, simple_loss=0.316, pruned_loss=0.08297, over 4249470.76 frames. ], batch size: 551, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:04:44,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1513122.0, ans=0.125 2023-06-23 15:05:08,592 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:05:22,556 INFO [train.py:996] (0/4) Epoch 9, batch 8250, loss[loss=0.2859, simple_loss=0.3702, pruned_loss=0.1008, over 21619.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3168, pruned_loss=0.08426, over 4251921.83 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:07:04,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.470e+02 6.606e+02 8.935e+02 1.467e+03 2.616e+03, threshold=1.787e+03, percent-clipped=8.0 2023-06-23 15:07:04,508 INFO [train.py:996] (0/4) Epoch 9, batch 8300, loss[loss=0.2136, simple_loss=0.2886, pruned_loss=0.06932, over 21438.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3147, pruned_loss=0.08185, over 4255615.66 frames. ], batch size: 195, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:07:36,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-23 15:08:22,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1513722.0, ans=0.125 2023-06-23 15:08:49,812 INFO [train.py:996] (0/4) Epoch 9, batch 8350, loss[loss=0.2055, simple_loss=0.2879, pruned_loss=0.06149, over 21801.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3141, pruned_loss=0.08084, over 4252333.87 frames. ], batch size: 317, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:09:01,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1513842.0, ans=0.2 2023-06-23 15:09:11,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1513902.0, ans=0.125 2023-06-23 15:09:23,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1513902.0, ans=0.125 2023-06-23 15:10:30,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.376e+02 4.465e+02 5.586e+02 8.616e+02 2.675e+03, threshold=1.117e+03, percent-clipped=3.0 2023-06-23 15:10:30,605 INFO [train.py:996] (0/4) Epoch 9, batch 8400, loss[loss=0.1979, simple_loss=0.2785, pruned_loss=0.05858, over 21276.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3102, pruned_loss=0.07791, over 4253834.56 frames. ], batch size: 176, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:10:40,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1514142.0, ans=0.1 2023-06-23 15:11:24,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1514262.0, ans=0.0 2023-06-23 15:11:50,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1514382.0, ans=0.125 2023-06-23 15:11:54,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-23 15:12:09,848 INFO [train.py:996] (0/4) Epoch 9, batch 8450, loss[loss=0.2273, simple_loss=0.2989, pruned_loss=0.07779, over 21862.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3094, pruned_loss=0.07757, over 4260954.47 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:12:31,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1514502.0, ans=0.125 2023-06-23 15:12:47,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1514562.0, ans=0.125 2023-06-23 15:12:49,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1514562.0, ans=0.125 2023-06-23 15:13:01,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-23 15:13:05,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1514622.0, ans=0.125 2023-06-23 15:13:49,166 INFO [train.py:996] (0/4) Epoch 9, batch 8500, loss[loss=0.2012, simple_loss=0.2539, pruned_loss=0.0743, over 21255.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3067, pruned_loss=0.07854, over 4246792.36 frames. ], batch size: 548, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:13:50,609 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.348e+02 5.860e+02 7.972e+02 1.284e+03 3.475e+03, threshold=1.594e+03, percent-clipped=30.0 2023-06-23 15:14:01,054 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:14:08,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1514802.0, ans=0.125 2023-06-23 15:14:21,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-23 15:15:07,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1514982.0, ans=0.2 2023-06-23 15:15:09,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1514982.0, ans=15.0 2023-06-23 15:15:12,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1514982.0, ans=0.2 2023-06-23 15:15:25,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1514982.0, ans=0.0 2023-06-23 15:15:26,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1514982.0, ans=0.125 2023-06-23 15:15:29,030 INFO [train.py:996] (0/4) Epoch 9, batch 8550, loss[loss=0.2802, simple_loss=0.369, pruned_loss=0.09576, over 21830.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3113, pruned_loss=0.08099, over 4257629.17 frames. ], batch size: 316, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:15:55,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1515102.0, ans=0.0 2023-06-23 15:15:56,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1515102.0, ans=0.125 2023-06-23 15:16:32,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1515222.0, ans=0.1 2023-06-23 15:17:12,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=12.0 2023-06-23 15:17:16,090 INFO [train.py:996] (0/4) Epoch 9, batch 8600, loss[loss=0.2604, simple_loss=0.3373, pruned_loss=0.09177, over 21416.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3191, pruned_loss=0.08376, over 4268687.31 frames. ], batch size: 131, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:17:17,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.589e+02 6.156e+02 8.850e+02 1.190e+03 2.823e+03, threshold=1.770e+03, percent-clipped=15.0 2023-06-23 15:18:14,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1515522.0, ans=0.125 2023-06-23 15:18:31,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1515522.0, ans=0.04949747468305833 2023-06-23 15:18:53,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-23 15:18:58,338 INFO [train.py:996] (0/4) Epoch 9, batch 8650, loss[loss=0.1774, simple_loss=0.2619, pruned_loss=0.0465, over 21162.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3249, pruned_loss=0.08368, over 4269635.27 frames. ], batch size: 143, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:19:00,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1515642.0, ans=0.1 2023-06-23 15:19:00,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-23 15:20:37,574 INFO [train.py:996] (0/4) Epoch 9, batch 8700, loss[loss=0.2066, simple_loss=0.2711, pruned_loss=0.07111, over 21311.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3159, pruned_loss=0.07931, over 4265122.52 frames. ], batch size: 160, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:20:39,033 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 5.219e+02 7.580e+02 1.289e+03 2.063e+03, threshold=1.516e+03, percent-clipped=5.0 2023-06-23 15:20:42,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1515942.0, ans=0.2 2023-06-23 15:20:50,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=12.0 2023-06-23 15:21:41,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1516122.0, ans=0.2 2023-06-23 15:22:16,400 INFO [train.py:996] (0/4) Epoch 9, batch 8750, loss[loss=0.2384, simple_loss=0.3065, pruned_loss=0.08519, over 21379.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3114, pruned_loss=0.08019, over 4272555.82 frames. ], batch size: 144, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:23:06,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1516362.0, ans=0.0 2023-06-23 15:23:33,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1516422.0, ans=0.05 2023-06-23 15:23:59,323 INFO [train.py:996] (0/4) Epoch 9, batch 8800, loss[loss=0.3013, simple_loss=0.375, pruned_loss=0.1138, over 21759.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3196, pruned_loss=0.08282, over 4274522.70 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:24:00,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.75 vs. limit=6.0 2023-06-23 15:24:00,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.692e+02 5.630e+02 7.362e+02 1.054e+03 2.858e+03, threshold=1.472e+03, percent-clipped=8.0 2023-06-23 15:25:18,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1516722.0, ans=0.125 2023-06-23 15:25:31,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1516782.0, ans=0.125 2023-06-23 15:25:37,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1516842.0, ans=0.2 2023-06-23 15:25:43,312 INFO [train.py:996] (0/4) Epoch 9, batch 8850, loss[loss=0.2492, simple_loss=0.3386, pruned_loss=0.07987, over 21298.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3282, pruned_loss=0.08627, over 4270583.42 frames. ], batch size: 548, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:26:46,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1517022.0, ans=0.0 2023-06-23 15:27:23,389 INFO [train.py:996] (0/4) Epoch 9, batch 8900, loss[loss=0.2658, simple_loss=0.3558, pruned_loss=0.08792, over 21433.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3231, pruned_loss=0.08425, over 4256123.56 frames. ], batch size: 471, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:27:30,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.861e+02 8.789e+02 1.394e+03 2.613e+03, threshold=1.758e+03, percent-clipped=19.0 2023-06-23 15:27:45,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-23 15:28:06,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1517262.0, ans=0.1 2023-06-23 15:28:24,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1517262.0, ans=0.1 2023-06-23 15:29:10,583 INFO [train.py:996] (0/4) Epoch 9, batch 8950, loss[loss=0.1935, simple_loss=0.2551, pruned_loss=0.06591, over 21258.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3224, pruned_loss=0.08358, over 4255031.77 frames. ], batch size: 176, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:29:55,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1517562.0, ans=0.125 2023-06-23 15:29:55,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1517562.0, ans=0.125 2023-06-23 15:30:14,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1517622.0, ans=0.0 2023-06-23 15:30:15,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2023-06-23 15:30:17,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-23 15:30:20,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1517622.0, ans=0.0 2023-06-23 15:30:46,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1517682.0, ans=0.125 2023-06-23 15:30:49,424 INFO [train.py:996] (0/4) Epoch 9, batch 9000, loss[loss=0.2204, simple_loss=0.2726, pruned_loss=0.08405, over 21352.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.316, pruned_loss=0.08292, over 4251803.98 frames. ], batch size: 211, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:30:49,425 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 15:31:06,586 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.258, simple_loss=0.3541, pruned_loss=0.08091, over 1796401.00 frames. 2023-06-23 15:31:06,587 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 15:31:08,196 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.792e+02 6.929e+02 1.126e+03 1.882e+03 3.988e+03, threshold=2.252e+03, percent-clipped=24.0 2023-06-23 15:31:25,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1517742.0, ans=0.1 2023-06-23 15:32:06,881 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.09 vs. limit=22.5 2023-06-23 15:32:14,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1517922.0, ans=0.125 2023-06-23 15:32:53,773 INFO [train.py:996] (0/4) Epoch 9, batch 9050, loss[loss=0.3061, simple_loss=0.3758, pruned_loss=0.1181, over 21830.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3137, pruned_loss=0.07989, over 4252054.64 frames. ], batch size: 118, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:33:56,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1518222.0, ans=0.125 2023-06-23 15:34:32,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1518282.0, ans=0.2 2023-06-23 15:34:39,773 INFO [train.py:996] (0/4) Epoch 9, batch 9100, loss[loss=0.2203, simple_loss=0.3073, pruned_loss=0.06661, over 21745.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.317, pruned_loss=0.08101, over 4256393.31 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:34:42,882 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.536e+02 5.248e+02 7.167e+02 1.150e+03 2.223e+03, threshold=1.433e+03, percent-clipped=0.0 2023-06-23 15:34:53,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1518342.0, ans=0.015 2023-06-23 15:35:46,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1518522.0, ans=0.1 2023-06-23 15:36:21,762 INFO [train.py:996] (0/4) Epoch 9, batch 9150, loss[loss=0.2388, simple_loss=0.3322, pruned_loss=0.07266, over 21758.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3189, pruned_loss=0.07888, over 4258282.75 frames. ], batch size: 332, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:36:30,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1518642.0, ans=0.125 2023-06-23 15:37:50,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-23 15:37:57,539 INFO [train.py:996] (0/4) Epoch 9, batch 9200, loss[loss=0.3234, simple_loss=0.3829, pruned_loss=0.1319, over 21453.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3187, pruned_loss=0.0774, over 4265020.78 frames. ], batch size: 471, lr: 3.32e-03, grad_scale: 32.0 2023-06-23 15:38:01,473 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 6.542e+02 9.064e+02 1.359e+03 2.938e+03, threshold=1.813e+03, percent-clipped=21.0 2023-06-23 15:38:01,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1518942.0, ans=0.07 2023-06-23 15:38:31,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1519002.0, ans=0.1 2023-06-23 15:38:39,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1519062.0, ans=0.1 2023-06-23 15:38:42,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1519062.0, ans=0.0 2023-06-23 15:39:30,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1519182.0, ans=0.125 2023-06-23 15:39:33,635 INFO [train.py:996] (0/4) Epoch 9, batch 9250, loss[loss=0.2065, simple_loss=0.2806, pruned_loss=0.06625, over 21770.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3237, pruned_loss=0.08106, over 4261339.02 frames. ], batch size: 102, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:40:05,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-23 15:40:24,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1519362.0, ans=0.1 2023-06-23 15:40:42,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1519422.0, ans=0.2 2023-06-23 15:41:00,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1519482.0, ans=0.125 2023-06-23 15:41:03,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1519482.0, ans=0.125 2023-06-23 15:41:16,035 INFO [train.py:996] (0/4) Epoch 9, batch 9300, loss[loss=0.2053, simple_loss=0.2669, pruned_loss=0.07185, over 21194.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3189, pruned_loss=0.08164, over 4251918.30 frames. ], batch size: 176, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:41:20,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.633e+02 6.631e+02 9.639e+02 1.652e+03 4.303e+03, threshold=1.928e+03, percent-clipped=19.0 2023-06-23 15:41:29,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1519542.0, ans=0.0 2023-06-23 15:41:30,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.92 vs. limit=15.0 2023-06-23 15:41:36,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1519542.0, ans=0.125 2023-06-23 15:41:57,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1519662.0, ans=0.125 2023-06-23 15:42:04,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1519662.0, ans=0.0 2023-06-23 15:43:03,539 INFO [train.py:996] (0/4) Epoch 9, batch 9350, loss[loss=0.2556, simple_loss=0.3387, pruned_loss=0.08619, over 21711.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3252, pruned_loss=0.08293, over 4258282.68 frames. ], batch size: 332, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:43:22,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1519842.0, ans=0.125 2023-06-23 15:43:24,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1519902.0, ans=0.0 2023-06-23 15:44:11,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1520022.0, ans=0.0 2023-06-23 15:44:18,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1520022.0, ans=0.125 2023-06-23 15:44:25,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1520082.0, ans=0.1 2023-06-23 15:44:50,888 INFO [train.py:996] (0/4) Epoch 9, batch 9400, loss[loss=0.2563, simple_loss=0.3127, pruned_loss=0.09998, over 21498.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3248, pruned_loss=0.08336, over 4263251.37 frames. ], batch size: 441, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:44:57,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.099e+02 5.147e+02 6.319e+02 1.049e+03 2.062e+03, threshold=1.264e+03, percent-clipped=1.0 2023-06-23 15:45:07,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-23 15:45:22,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1520202.0, ans=0.125 2023-06-23 15:46:30,675 INFO [train.py:996] (0/4) Epoch 9, batch 9450, loss[loss=0.2521, simple_loss=0.3459, pruned_loss=0.07916, over 20723.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.317, pruned_loss=0.08153, over 4256400.04 frames. ], batch size: 607, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:46:55,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1520502.0, ans=0.0 2023-06-23 15:47:04,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1520502.0, ans=0.05 2023-06-23 15:47:38,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1520622.0, ans=0.0 2023-06-23 15:48:02,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1520682.0, ans=0.125 2023-06-23 15:48:06,761 INFO [train.py:996] (0/4) Epoch 9, batch 9500, loss[loss=0.2278, simple_loss=0.3026, pruned_loss=0.0765, over 21376.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3096, pruned_loss=0.07926, over 4257985.64 frames. ], batch size: 194, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:48:13,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.451e+02 6.669e+02 1.059e+03 1.542e+03 2.765e+03, threshold=2.119e+03, percent-clipped=38.0 2023-06-23 15:48:53,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1520862.0, ans=0.125 2023-06-23 15:48:58,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1520862.0, ans=0.1 2023-06-23 15:49:01,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1520862.0, ans=0.2 2023-06-23 15:49:06,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1520922.0, ans=0.125 2023-06-23 15:49:35,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1520982.0, ans=0.125 2023-06-23 15:49:37,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1520982.0, ans=0.025 2023-06-23 15:49:48,343 INFO [train.py:996] (0/4) Epoch 9, batch 9550, loss[loss=0.2568, simple_loss=0.3518, pruned_loss=0.08095, over 21801.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3141, pruned_loss=0.08184, over 4264003.65 frames. ], batch size: 282, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:50:02,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1521102.0, ans=0.0 2023-06-23 15:50:27,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1521102.0, ans=0.0 2023-06-23 15:50:33,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1521162.0, ans=0.125 2023-06-23 15:51:04,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1521222.0, ans=0.035 2023-06-23 15:51:28,290 INFO [train.py:996] (0/4) Epoch 9, batch 9600, loss[loss=0.2356, simple_loss=0.309, pruned_loss=0.08109, over 21915.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.316, pruned_loss=0.08338, over 4270160.22 frames. ], batch size: 316, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:51:35,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.691e+02 5.650e+02 7.031e+02 8.940e+02 1.543e+03, threshold=1.406e+03, percent-clipped=0.0 2023-06-23 15:51:42,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1521342.0, ans=0.125 2023-06-23 15:51:47,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1521402.0, ans=0.1 2023-06-23 15:51:57,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1521402.0, ans=0.125 2023-06-23 15:52:41,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1521522.0, ans=0.0 2023-06-23 15:52:43,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1521522.0, ans=0.1 2023-06-23 15:53:03,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1521582.0, ans=0.0 2023-06-23 15:53:09,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1521642.0, ans=0.125 2023-06-23 15:53:10,514 INFO [train.py:996] (0/4) Epoch 9, batch 9650, loss[loss=0.2658, simple_loss=0.3921, pruned_loss=0.06976, over 20761.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3154, pruned_loss=0.08244, over 4277572.82 frames. ], batch size: 607, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:53:14,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1521642.0, ans=0.125 2023-06-23 15:54:03,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1521762.0, ans=0.125 2023-06-23 15:54:23,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-23 15:54:51,611 INFO [train.py:996] (0/4) Epoch 9, batch 9700, loss[loss=0.2566, simple_loss=0.3223, pruned_loss=0.09546, over 16353.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3164, pruned_loss=0.08228, over 4268886.49 frames. ], batch size: 60, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:55:00,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1521942.0, ans=0.0 2023-06-23 15:55:02,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.621e+02 5.546e+02 7.387e+02 1.131e+03 2.841e+03, threshold=1.477e+03, percent-clipped=15.0 2023-06-23 15:55:23,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-23 15:55:59,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522122.0, ans=0.1 2023-06-23 15:56:10,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1522122.0, ans=0.125 2023-06-23 15:56:20,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1522182.0, ans=0.125 2023-06-23 15:56:32,931 INFO [train.py:996] (0/4) Epoch 9, batch 9750, loss[loss=0.2547, simple_loss=0.2959, pruned_loss=0.1067, over 21380.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3112, pruned_loss=0.08133, over 4259507.24 frames. ], batch size: 508, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:57:13,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1522362.0, ans=0.125 2023-06-23 15:57:21,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1522362.0, ans=0.0 2023-06-23 15:57:40,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1522422.0, ans=0.2 2023-06-23 15:57:50,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1522482.0, ans=0.025 2023-06-23 15:58:11,394 INFO [train.py:996] (0/4) Epoch 9, batch 9800, loss[loss=0.2483, simple_loss=0.3121, pruned_loss=0.09226, over 21906.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3128, pruned_loss=0.08244, over 4257237.00 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:58:18,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.522e+02 5.907e+02 7.792e+02 1.093e+03 2.144e+03, threshold=1.558e+03, percent-clipped=9.0 2023-06-23 15:58:43,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522602.0, ans=0.1 2023-06-23 15:59:05,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522662.0, ans=0.1 2023-06-23 15:59:23,055 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:59:39,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1522782.0, ans=0.1 2023-06-23 15:59:40,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1522782.0, ans=0.0 2023-06-23 15:59:49,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-23 15:59:53,363 INFO [train.py:996] (0/4) Epoch 9, batch 9850, loss[loss=0.2393, simple_loss=0.3065, pruned_loss=0.086, over 20184.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3098, pruned_loss=0.08206, over 4253880.53 frames. ], batch size: 707, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:01:34,838 INFO [train.py:996] (0/4) Epoch 9, batch 9900, loss[loss=0.2027, simple_loss=0.2809, pruned_loss=0.06227, over 15365.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3086, pruned_loss=0.08145, over 4243213.57 frames. ], batch size: 60, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:01:45,575 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 5.699e+02 7.870e+02 1.232e+03 3.104e+03, threshold=1.574e+03, percent-clipped=11.0 2023-06-23 16:01:59,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-23 16:02:10,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1523202.0, ans=0.125 2023-06-23 16:02:54,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.08 vs. limit=10.0 2023-06-23 16:03:15,838 INFO [train.py:996] (0/4) Epoch 9, batch 9950, loss[loss=0.2998, simple_loss=0.3388, pruned_loss=0.1304, over 21393.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3105, pruned_loss=0.08381, over 4237835.40 frames. ], batch size: 509, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:03:34,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1523442.0, ans=0.125 2023-06-23 16:04:22,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1523622.0, ans=0.2 2023-06-23 16:05:02,527 INFO [train.py:996] (0/4) Epoch 9, batch 10000, loss[loss=0.2176, simple_loss=0.2817, pruned_loss=0.07672, over 21434.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3048, pruned_loss=0.08269, over 4248894.38 frames. ], batch size: 131, lr: 3.32e-03, grad_scale: 32.0 2023-06-23 16:05:14,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.772e+02 5.460e+02 7.211e+02 1.053e+03 2.107e+03, threshold=1.442e+03, percent-clipped=5.0 2023-06-23 16:05:35,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1523802.0, ans=0.125 2023-06-23 16:06:19,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1523922.0, ans=0.2 2023-06-23 16:06:30,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1523982.0, ans=0.125 2023-06-23 16:06:40,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1523982.0, ans=0.2 2023-06-23 16:06:50,050 INFO [train.py:996] (0/4) Epoch 9, batch 10050, loss[loss=0.1794, simple_loss=0.2556, pruned_loss=0.05156, over 21376.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3079, pruned_loss=0.08329, over 4252024.35 frames. ], batch size: 211, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:06:50,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1524042.0, ans=0.95 2023-06-23 16:06:57,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1524042.0, ans=0.1 2023-06-23 16:07:11,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-23 16:07:55,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1524222.0, ans=0.125 2023-06-23 16:08:03,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1524222.0, ans=0.125 2023-06-23 16:08:33,196 INFO [train.py:996] (0/4) Epoch 9, batch 10100, loss[loss=0.2476, simple_loss=0.3298, pruned_loss=0.08272, over 21642.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3067, pruned_loss=0.08122, over 4256453.48 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:08:41,447 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.724e+02 5.845e+02 8.901e+02 1.389e+03 2.930e+03, threshold=1.780e+03, percent-clipped=23.0 2023-06-23 16:09:33,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1524522.0, ans=0.0 2023-06-23 16:09:35,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-23 16:10:01,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1524582.0, ans=10.0 2023-06-23 16:10:08,499 INFO [train.py:996] (0/4) Epoch 9, batch 10150, loss[loss=0.2201, simple_loss=0.3057, pruned_loss=0.0673, over 21704.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3116, pruned_loss=0.08352, over 4264040.58 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:10:35,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1524702.0, ans=0.125 2023-06-23 16:10:54,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1524762.0, ans=0.125 2023-06-23 16:11:46,012 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:11:48,563 INFO [train.py:996] (0/4) Epoch 9, batch 10200, loss[loss=0.2081, simple_loss=0.2963, pruned_loss=0.05995, over 21650.00 frames. ], tot_loss[loss=0.234, simple_loss=0.308, pruned_loss=0.07999, over 4259285.16 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:11:55,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1524942.0, ans=0.125 2023-06-23 16:12:03,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.213e+02 5.208e+02 7.016e+02 1.136e+03 3.363e+03, threshold=1.403e+03, percent-clipped=6.0 2023-06-23 16:12:08,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1525002.0, ans=0.1 2023-06-23 16:12:08,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1525002.0, ans=0.1 2023-06-23 16:12:44,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1525062.0, ans=0.1 2023-06-23 16:12:52,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1525122.0, ans=0.1 2023-06-23 16:13:24,901 INFO [train.py:996] (0/4) Epoch 9, batch 10250, loss[loss=0.2844, simple_loss=0.3581, pruned_loss=0.1053, over 21348.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3046, pruned_loss=0.07534, over 4262635.86 frames. ], batch size: 507, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:13:26,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-23 16:13:56,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1525302.0, ans=0.125 2023-06-23 16:15:13,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1525542.0, ans=0.2 2023-06-23 16:15:14,148 INFO [train.py:996] (0/4) Epoch 9, batch 10300, loss[loss=0.2428, simple_loss=0.3126, pruned_loss=0.08647, over 21303.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3077, pruned_loss=0.07656, over 4258303.77 frames. ], batch size: 176, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:15:23,731 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.44 vs. limit=22.5 2023-06-23 16:15:24,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.918e+02 5.852e+02 8.943e+02 1.203e+03 2.933e+03, threshold=1.789e+03, percent-clipped=17.0 2023-06-23 16:16:43,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-23 16:16:56,877 INFO [train.py:996] (0/4) Epoch 9, batch 10350, loss[loss=0.2568, simple_loss=0.3343, pruned_loss=0.08969, over 21461.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3106, pruned_loss=0.07653, over 4269471.47 frames. ], batch size: 471, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:17:28,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1525902.0, ans=0.125 2023-06-23 16:17:52,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1525962.0, ans=0.1 2023-06-23 16:18:15,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1526022.0, ans=0.125 2023-06-23 16:18:45,382 INFO [train.py:996] (0/4) Epoch 9, batch 10400, loss[loss=0.211, simple_loss=0.2811, pruned_loss=0.07043, over 21712.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3049, pruned_loss=0.0758, over 4262546.76 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:18:55,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.672e+02 5.585e+02 9.781e+02 1.543e+03 3.065e+03, threshold=1.956e+03, percent-clipped=20.0 2023-06-23 16:19:30,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1526262.0, ans=0.0 2023-06-23 16:20:28,491 INFO [train.py:996] (0/4) Epoch 9, batch 10450, loss[loss=0.3771, simple_loss=0.433, pruned_loss=0.1606, over 21379.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3094, pruned_loss=0.07935, over 4262500.54 frames. ], batch size: 507, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:20:41,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-23 16:20:51,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1526442.0, ans=0.025 2023-06-23 16:21:24,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1526562.0, ans=0.125 2023-06-23 16:21:42,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1526622.0, ans=0.125 2023-06-23 16:22:01,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1526682.0, ans=0.5 2023-06-23 16:22:09,078 INFO [train.py:996] (0/4) Epoch 9, batch 10500, loss[loss=0.2022, simple_loss=0.2729, pruned_loss=0.06575, over 21664.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.308, pruned_loss=0.0779, over 4269137.54 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:22:23,399 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.525e+02 6.343e+02 8.149e+02 1.174e+03 2.736e+03, threshold=1.630e+03, percent-clipped=6.0 2023-06-23 16:22:29,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1526742.0, ans=0.05 2023-06-23 16:23:31,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1526982.0, ans=0.2 2023-06-23 16:23:53,915 INFO [train.py:996] (0/4) Epoch 9, batch 10550, loss[loss=0.1909, simple_loss=0.2508, pruned_loss=0.06552, over 21868.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3025, pruned_loss=0.07738, over 4267345.19 frames. ], batch size: 98, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:24:08,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1527042.0, ans=0.125 2023-06-23 16:24:40,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1527162.0, ans=0.2 2023-06-23 16:24:40,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1527162.0, ans=0.0 2023-06-23 16:24:57,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-23 16:25:35,500 INFO [train.py:996] (0/4) Epoch 9, batch 10600, loss[loss=0.1917, simple_loss=0.2871, pruned_loss=0.04814, over 21716.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2988, pruned_loss=0.07606, over 4276189.45 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:25:50,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.855e+02 5.123e+02 6.754e+02 9.468e+02 2.113e+03, threshold=1.351e+03, percent-clipped=4.0 2023-06-23 16:25:52,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1527342.0, ans=0.125 2023-06-23 16:26:12,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-23 16:26:34,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-23 16:26:50,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-23 16:26:51,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1527522.0, ans=0.0 2023-06-23 16:27:22,979 INFO [train.py:996] (0/4) Epoch 9, batch 10650, loss[loss=0.1896, simple_loss=0.2754, pruned_loss=0.05192, over 21713.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.301, pruned_loss=0.07543, over 4253504.64 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:27:27,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1527642.0, ans=0.125 2023-06-23 16:27:33,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1527642.0, ans=0.0 2023-06-23 16:27:52,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1527702.0, ans=0.0 2023-06-23 16:28:27,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1527822.0, ans=0.2 2023-06-23 16:29:02,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1527942.0, ans=0.0 2023-06-23 16:29:03,477 INFO [train.py:996] (0/4) Epoch 9, batch 10700, loss[loss=0.2746, simple_loss=0.3486, pruned_loss=0.1003, over 21764.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2992, pruned_loss=0.07548, over 4259031.58 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:29:12,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.925e+02 6.514e+02 1.117e+03 1.445e+03 3.043e+03, threshold=2.235e+03, percent-clipped=29.0 2023-06-23 16:29:44,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-23 16:30:07,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1528122.0, ans=0.0 2023-06-23 16:30:47,163 INFO [train.py:996] (0/4) Epoch 9, batch 10750, loss[loss=0.2295, simple_loss=0.3067, pruned_loss=0.07619, over 21421.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3115, pruned_loss=0.08058, over 4269438.96 frames. ], batch size: 194, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:30:50,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1528242.0, ans=0.0 2023-06-23 16:31:02,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-23 16:32:18,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1528482.0, ans=0.2 2023-06-23 16:32:33,893 INFO [train.py:996] (0/4) Epoch 9, batch 10800, loss[loss=0.2715, simple_loss=0.3382, pruned_loss=0.1024, over 21788.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3159, pruned_loss=0.08158, over 4266049.60 frames. ], batch size: 247, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 16:32:37,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-23 16:32:40,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1528542.0, ans=0.0 2023-06-23 16:32:43,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.788e+02 5.066e+02 7.349e+02 1.067e+03 2.269e+03, threshold=1.470e+03, percent-clipped=1.0 2023-06-23 16:32:44,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1528542.0, ans=22.5 2023-06-23 16:33:20,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1528662.0, ans=0.125 2023-06-23 16:33:42,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1528722.0, ans=0.125 2023-06-23 16:34:10,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1528782.0, ans=0.0 2023-06-23 16:34:14,725 INFO [train.py:996] (0/4) Epoch 9, batch 10850, loss[loss=0.2259, simple_loss=0.2923, pruned_loss=0.07977, over 21548.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3161, pruned_loss=0.08097, over 4265925.96 frames. ], batch size: 391, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:34:59,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1528962.0, ans=0.125 2023-06-23 16:35:15,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-23 16:35:21,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1529022.0, ans=0.04949747468305833 2023-06-23 16:35:56,333 INFO [train.py:996] (0/4) Epoch 9, batch 10900, loss[loss=0.2308, simple_loss=0.3125, pruned_loss=0.0745, over 21808.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3107, pruned_loss=0.07894, over 4267752.64 frames. ], batch size: 371, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:36:12,640 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.519e+02 5.159e+02 7.524e+02 1.150e+03 2.135e+03, threshold=1.505e+03, percent-clipped=11.0 2023-06-23 16:36:16,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1529202.0, ans=0.125 2023-06-23 16:36:35,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1529202.0, ans=0.0 2023-06-23 16:36:51,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1529262.0, ans=0.125 2023-06-23 16:37:36,171 INFO [train.py:996] (0/4) Epoch 9, batch 10950, loss[loss=0.2284, simple_loss=0.2831, pruned_loss=0.08684, over 21310.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3046, pruned_loss=0.077, over 4259862.49 frames. ], batch size: 144, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:37:36,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1529442.0, ans=0.125 2023-06-23 16:37:54,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1529442.0, ans=0.2 2023-06-23 16:37:55,822 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:38:24,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1529562.0, ans=0.125 2023-06-23 16:38:25,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529562.0, ans=0.1 2023-06-23 16:38:44,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1529622.0, ans=0.125 2023-06-23 16:39:07,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529682.0, ans=0.1 2023-06-23 16:39:16,279 INFO [train.py:996] (0/4) Epoch 9, batch 11000, loss[loss=0.2344, simple_loss=0.3061, pruned_loss=0.08135, over 21471.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3033, pruned_loss=0.07794, over 4261571.48 frames. ], batch size: 131, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:39:32,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.640e+02 5.350e+02 8.050e+02 1.212e+03 3.028e+03, threshold=1.610e+03, percent-clipped=11.0 2023-06-23 16:39:52,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1529802.0, ans=0.07 2023-06-23 16:40:54,215 INFO [train.py:996] (0/4) Epoch 9, batch 11050, loss[loss=0.2135, simple_loss=0.2708, pruned_loss=0.07808, over 21272.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3005, pruned_loss=0.07931, over 4270588.63 frames. ], batch size: 176, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:41:09,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1530042.0, ans=0.125 2023-06-23 16:41:29,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.86 vs. limit=15.0 2023-06-23 16:41:37,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1530162.0, ans=0.0 2023-06-23 16:41:40,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1530162.0, ans=0.05 2023-06-23 16:42:38,866 INFO [train.py:996] (0/4) Epoch 9, batch 11100, loss[loss=0.252, simple_loss=0.3188, pruned_loss=0.09265, over 21362.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.299, pruned_loss=0.07921, over 4268047.35 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:42:39,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1530342.0, ans=0.07 2023-06-23 16:42:51,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.448e+02 5.087e+02 6.616e+02 8.877e+02 2.244e+03, threshold=1.323e+03, percent-clipped=5.0 2023-06-23 16:43:12,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1530402.0, ans=0.05 2023-06-23 16:44:18,624 INFO [train.py:996] (0/4) Epoch 9, batch 11150, loss[loss=0.2205, simple_loss=0.286, pruned_loss=0.07755, over 21531.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2972, pruned_loss=0.07891, over 4268952.81 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:44:19,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1530642.0, ans=0.125 2023-06-23 16:44:20,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1530642.0, ans=0.125 2023-06-23 16:45:33,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1530822.0, ans=0.2 2023-06-23 16:45:41,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1530882.0, ans=0.1 2023-06-23 16:45:42,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1530882.0, ans=0.125 2023-06-23 16:45:58,175 INFO [train.py:996] (0/4) Epoch 9, batch 11200, loss[loss=0.1866, simple_loss=0.2519, pruned_loss=0.06062, over 21473.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2974, pruned_loss=0.07806, over 4273865.89 frames. ], batch size: 230, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:45:59,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=12.0 2023-06-23 16:46:03,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1530942.0, ans=0.0 2023-06-23 16:46:10,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 5.536e+02 7.546e+02 1.213e+03 2.221e+03, threshold=1.509e+03, percent-clipped=19.0 2023-06-23 16:46:43,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1531062.0, ans=0.125 2023-06-23 16:47:32,763 INFO [train.py:996] (0/4) Epoch 9, batch 11250, loss[loss=0.2964, simple_loss=0.3438, pruned_loss=0.1245, over 21659.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2965, pruned_loss=0.07787, over 4273724.78 frames. ], batch size: 508, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:47:47,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1531242.0, ans=0.015 2023-06-23 16:48:26,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1531362.0, ans=0.125 2023-06-23 16:48:30,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=12.0 2023-06-23 16:48:32,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-23 16:48:38,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1531422.0, ans=0.035 2023-06-23 16:49:11,440 INFO [train.py:996] (0/4) Epoch 9, batch 11300, loss[loss=0.1961, simple_loss=0.2844, pruned_loss=0.05392, over 21811.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2977, pruned_loss=0.07817, over 4284818.16 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:49:28,321 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 5.299e+02 7.050e+02 1.034e+03 1.810e+03, threshold=1.410e+03, percent-clipped=1.0 2023-06-23 16:49:46,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1531602.0, ans=0.125 2023-06-23 16:50:26,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1531722.0, ans=0.0 2023-06-23 16:50:56,828 INFO [train.py:996] (0/4) Epoch 9, batch 11350, loss[loss=0.2341, simple_loss=0.3058, pruned_loss=0.08117, over 21172.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3004, pruned_loss=0.07783, over 4280977.18 frames. ], batch size: 143, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:51:09,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-23 16:51:48,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1531962.0, ans=0.0 2023-06-23 16:52:11,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=22.5 2023-06-23 16:52:36,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.91 vs. limit=5.0 2023-06-23 16:52:39,066 INFO [train.py:996] (0/4) Epoch 9, batch 11400, loss[loss=0.2518, simple_loss=0.3252, pruned_loss=0.08926, over 21448.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3056, pruned_loss=0.07974, over 4279112.52 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:52:56,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 6.591e+02 8.859e+02 1.390e+03 3.018e+03, threshold=1.772e+03, percent-clipped=23.0 2023-06-23 16:53:00,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1532202.0, ans=0.125 2023-06-23 16:53:37,327 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-23 16:53:56,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1532322.0, ans=0.04949747468305833 2023-06-23 16:53:57,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1532382.0, ans=0.125 2023-06-23 16:54:10,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.29 vs. limit=22.5 2023-06-23 16:54:20,222 INFO [train.py:996] (0/4) Epoch 9, batch 11450, loss[loss=0.2749, simple_loss=0.3452, pruned_loss=0.1023, over 21705.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3069, pruned_loss=0.07872, over 4281499.90 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:54:35,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1532442.0, ans=0.125 2023-06-23 16:55:00,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1532502.0, ans=0.125 2023-06-23 16:56:02,594 INFO [train.py:996] (0/4) Epoch 9, batch 11500, loss[loss=0.2397, simple_loss=0.3374, pruned_loss=0.07105, over 21630.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3108, pruned_loss=0.0798, over 4285761.85 frames. ], batch size: 414, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:56:19,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.487e+02 5.481e+02 7.380e+02 1.202e+03 2.850e+03, threshold=1.476e+03, percent-clipped=9.0 2023-06-23 16:56:25,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1532802.0, ans=0.125 2023-06-23 16:56:43,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1532862.0, ans=0.125 2023-06-23 16:57:04,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1532922.0, ans=0.0 2023-06-23 16:57:49,167 INFO [train.py:996] (0/4) Epoch 9, batch 11550, loss[loss=0.267, simple_loss=0.3599, pruned_loss=0.08709, over 21628.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3166, pruned_loss=0.07962, over 4286356.25 frames. ], batch size: 263, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:57:51,475 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:57:54,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1533042.0, ans=0.0 2023-06-23 16:57:56,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1533042.0, ans=0.025 2023-06-23 16:58:36,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1533162.0, ans=0.0 2023-06-23 16:58:50,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1533162.0, ans=0.125 2023-06-23 16:59:01,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1533222.0, ans=0.125 2023-06-23 16:59:31,706 INFO [train.py:996] (0/4) Epoch 9, batch 11600, loss[loss=0.2492, simple_loss=0.3475, pruned_loss=0.07541, over 21693.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3283, pruned_loss=0.08117, over 4281701.65 frames. ], batch size: 263, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 16:59:34,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-23 16:59:50,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.192e+02 7.053e+02 9.279e+02 1.499e+03 3.190e+03, threshold=1.856e+03, percent-clipped=25.0 2023-06-23 16:59:58,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.63 vs. limit=12.0 2023-06-23 17:00:10,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1533462.0, ans=0.125 2023-06-23 17:00:47,209 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.546e-03 2023-06-23 17:01:12,933 INFO [train.py:996] (0/4) Epoch 9, batch 11650, loss[loss=0.2382, simple_loss=0.3236, pruned_loss=0.07638, over 21426.00 frames. ], tot_loss[loss=0.251, simple_loss=0.336, pruned_loss=0.08301, over 4275968.78 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:02:35,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1533882.0, ans=0.1 2023-06-23 17:02:52,860 INFO [train.py:996] (0/4) Epoch 9, batch 11700, loss[loss=0.2088, simple_loss=0.2761, pruned_loss=0.07074, over 21581.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3262, pruned_loss=0.08174, over 4277205.95 frames. ], batch size: 263, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:03:06,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1533942.0, ans=0.07 2023-06-23 17:03:10,686 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.713e+02 7.514e+02 1.058e+03 1.633e+03 4.255e+03, threshold=2.116e+03, percent-clipped=16.0 2023-06-23 17:03:19,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1534002.0, ans=0.0 2023-06-23 17:03:19,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-23 17:03:32,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1534062.0, ans=0.125 2023-06-23 17:03:37,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1534062.0, ans=0.2 2023-06-23 17:03:37,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1534062.0, ans=0.1 2023-06-23 17:04:31,766 INFO [train.py:996] (0/4) Epoch 9, batch 11750, loss[loss=0.2536, simple_loss=0.3171, pruned_loss=0.09506, over 21269.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3173, pruned_loss=0.08154, over 4272862.86 frames. ], batch size: 176, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:05:21,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1534362.0, ans=0.1 2023-06-23 17:05:37,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1534422.0, ans=0.2 2023-06-23 17:06:17,874 INFO [train.py:996] (0/4) Epoch 9, batch 11800, loss[loss=0.2229, simple_loss=0.3211, pruned_loss=0.06233, over 20736.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3189, pruned_loss=0.08383, over 4271858.80 frames. ], batch size: 607, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:06:32,027 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.688e+02 5.572e+02 8.368e+02 1.434e+03 3.192e+03, threshold=1.674e+03, percent-clipped=11.0 2023-06-23 17:06:44,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1534602.0, ans=0.125 2023-06-23 17:07:35,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1534782.0, ans=0.2 2023-06-23 17:07:58,040 INFO [train.py:996] (0/4) Epoch 9, batch 11850, loss[loss=0.2365, simple_loss=0.3287, pruned_loss=0.07219, over 21437.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.32, pruned_loss=0.08349, over 4276078.29 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:08:00,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1534842.0, ans=0.2 2023-06-23 17:08:21,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1534902.0, ans=0.125 2023-06-23 17:09:11,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1535082.0, ans=0.125 2023-06-23 17:09:39,314 INFO [train.py:996] (0/4) Epoch 9, batch 11900, loss[loss=0.2704, simple_loss=0.3429, pruned_loss=0.0989, over 21378.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3199, pruned_loss=0.08048, over 4276211.52 frames. ], batch size: 471, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:09:59,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.827e+02 5.472e+02 7.234e+02 9.480e+02 2.463e+03, threshold=1.447e+03, percent-clipped=1.0 2023-06-23 17:10:33,278 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-23 17:11:10,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1535382.0, ans=0.125 2023-06-23 17:11:15,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1535382.0, ans=0.0 2023-06-23 17:11:26,105 INFO [train.py:996] (0/4) Epoch 9, batch 11950, loss[loss=0.2094, simple_loss=0.3016, pruned_loss=0.05866, over 21722.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3214, pruned_loss=0.07871, over 4272078.78 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:11:28,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1535442.0, ans=0.125 2023-06-23 17:12:13,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1535562.0, ans=0.0 2023-06-23 17:12:40,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1535622.0, ans=0.125 2023-06-23 17:13:03,664 INFO [train.py:996] (0/4) Epoch 9, batch 12000, loss[loss=0.2036, simple_loss=0.267, pruned_loss=0.07009, over 21198.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3158, pruned_loss=0.07695, over 4275856.75 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:13:03,665 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 17:13:24,466 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2567, simple_loss=0.3528, pruned_loss=0.08029, over 1796401.00 frames. 2023-06-23 17:13:24,467 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 17:13:38,399 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.383e+02 5.788e+02 7.844e+02 1.305e+03 3.845e+03, threshold=1.569e+03, percent-clipped=19.0 2023-06-23 17:14:42,178 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-256000.pt 2023-06-23 17:15:03,719 INFO [train.py:996] (0/4) Epoch 9, batch 12050, loss[loss=0.2446, simple_loss=0.3061, pruned_loss=0.09158, over 21290.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3113, pruned_loss=0.07855, over 4283203.75 frames. ], batch size: 159, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:15:28,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1536102.0, ans=0.015 2023-06-23 17:15:35,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-23 17:16:00,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-23 17:16:37,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1536282.0, ans=0.0 2023-06-23 17:16:45,373 INFO [train.py:996] (0/4) Epoch 9, batch 12100, loss[loss=0.2352, simple_loss=0.3134, pruned_loss=0.07848, over 21707.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3169, pruned_loss=0.08285, over 4283498.05 frames. ], batch size: 298, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:16:46,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1536342.0, ans=0.125 2023-06-23 17:16:58,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1536342.0, ans=0.2 2023-06-23 17:17:01,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 6.749e+02 9.796e+02 1.461e+03 3.096e+03, threshold=1.959e+03, percent-clipped=20.0 2023-06-23 17:17:09,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.93 vs. limit=22.5 2023-06-23 17:17:56,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1536522.0, ans=0.125 2023-06-23 17:18:31,936 INFO [train.py:996] (0/4) Epoch 9, batch 12150, loss[loss=0.1928, simple_loss=0.2474, pruned_loss=0.06907, over 20855.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3218, pruned_loss=0.08248, over 4268461.28 frames. ], batch size: 611, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:18:32,305 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:18:42,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1536642.0, ans=0.05 2023-06-23 17:19:05,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-23 17:19:35,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1536762.0, ans=0.0 2023-06-23 17:19:40,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1536822.0, ans=0.0 2023-06-23 17:19:43,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1536822.0, ans=0.125 2023-06-23 17:19:43,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1536822.0, ans=0.125 2023-06-23 17:19:48,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1536822.0, ans=0.125 2023-06-23 17:20:11,287 INFO [train.py:996] (0/4) Epoch 9, batch 12200, loss[loss=0.2132, simple_loss=0.2745, pruned_loss=0.07598, over 21177.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3163, pruned_loss=0.08156, over 4271893.03 frames. ], batch size: 160, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:20:32,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.485e+02 6.911e+02 1.120e+03 1.509e+03 3.105e+03, threshold=2.240e+03, percent-clipped=12.0 2023-06-23 17:21:17,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1537122.0, ans=0.0 2023-06-23 17:21:18,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1537122.0, ans=0.125 2023-06-23 17:21:39,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1537182.0, ans=0.0 2023-06-23 17:21:40,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.87 vs. limit=5.0 2023-06-23 17:21:41,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1537182.0, ans=0.125 2023-06-23 17:21:45,651 INFO [train.py:996] (0/4) Epoch 9, batch 12250, loss[loss=0.171, simple_loss=0.249, pruned_loss=0.04646, over 21275.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3075, pruned_loss=0.07827, over 4275168.91 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:21:52,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1537242.0, ans=0.1 2023-06-23 17:21:54,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1537242.0, ans=0.2 2023-06-23 17:22:37,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1537362.0, ans=0.1 2023-06-23 17:22:41,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1537362.0, ans=0.0 2023-06-23 17:23:24,465 INFO [train.py:996] (0/4) Epoch 9, batch 12300, loss[loss=0.1763, simple_loss=0.2544, pruned_loss=0.04907, over 21288.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2998, pruned_loss=0.07215, over 4281856.41 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:23:33,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1537542.0, ans=0.0 2023-06-23 17:23:45,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.150e+02 7.519e+02 1.212e+03 3.138e+03, threshold=1.504e+03, percent-clipped=3.0 2023-06-23 17:23:51,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1537602.0, ans=0.2 2023-06-23 17:24:06,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1537662.0, ans=0.035 2023-06-23 17:24:15,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1537662.0, ans=0.125 2023-06-23 17:25:02,685 INFO [train.py:996] (0/4) Epoch 9, batch 12350, loss[loss=0.2273, simple_loss=0.3135, pruned_loss=0.0705, over 21901.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3049, pruned_loss=0.07243, over 4278414.22 frames. ], batch size: 316, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:25:19,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1537842.0, ans=0.0 2023-06-23 17:25:21,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-23 17:25:22,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1537902.0, ans=0.125 2023-06-23 17:26:35,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-23 17:26:40,777 INFO [train.py:996] (0/4) Epoch 9, batch 12400, loss[loss=0.2259, simple_loss=0.286, pruned_loss=0.0829, over 21667.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3067, pruned_loss=0.07589, over 4282214.82 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:27:01,908 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.603e+02 7.484e+02 1.004e+03 2.626e+03, threshold=1.497e+03, percent-clipped=10.0 2023-06-23 17:28:25,799 INFO [train.py:996] (0/4) Epoch 9, batch 12450, loss[loss=0.2686, simple_loss=0.3372, pruned_loss=0.09998, over 21815.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3112, pruned_loss=0.07945, over 4287450.93 frames. ], batch size: 282, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:28:40,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-23 17:28:43,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1538442.0, ans=0.125 2023-06-23 17:28:46,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1538502.0, ans=0.0 2023-06-23 17:28:52,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1538502.0, ans=0.125 2023-06-23 17:29:13,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1538562.0, ans=0.2 2023-06-23 17:30:00,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1538682.0, ans=0.125 2023-06-23 17:30:10,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1538742.0, ans=0.125 2023-06-23 17:30:11,458 INFO [train.py:996] (0/4) Epoch 9, batch 12500, loss[loss=0.2687, simple_loss=0.3572, pruned_loss=0.0901, over 21775.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3214, pruned_loss=0.08305, over 4282634.27 frames. ], batch size: 124, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:30:33,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 5.952e+02 7.744e+02 1.112e+03 2.842e+03, threshold=1.549e+03, percent-clipped=7.0 2023-06-23 17:30:45,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1538802.0, ans=0.125 2023-06-23 17:30:50,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1538802.0, ans=0.125 2023-06-23 17:30:55,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1538862.0, ans=0.125 2023-06-23 17:31:53,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1538982.0, ans=0.125 2023-06-23 17:31:58,475 INFO [train.py:996] (0/4) Epoch 9, batch 12550, loss[loss=0.2303, simple_loss=0.3174, pruned_loss=0.07162, over 21664.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3255, pruned_loss=0.08458, over 4279337.57 frames. ], batch size: 298, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:33:25,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-23 17:33:32,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1539282.0, ans=0.04949747468305833 2023-06-23 17:33:43,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1539342.0, ans=0.125 2023-06-23 17:33:45,022 INFO [train.py:996] (0/4) Epoch 9, batch 12600, loss[loss=0.2768, simple_loss=0.3588, pruned_loss=0.09741, over 21419.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3247, pruned_loss=0.08246, over 4282761.63 frames. ], batch size: 507, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:33:56,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-23 17:34:03,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.580e+02 5.911e+02 8.328e+02 1.277e+03 2.400e+03, threshold=1.666e+03, percent-clipped=14.0 2023-06-23 17:34:47,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1539522.0, ans=0.125 2023-06-23 17:35:25,035 INFO [train.py:996] (0/4) Epoch 9, batch 12650, loss[loss=0.2275, simple_loss=0.3009, pruned_loss=0.07702, over 21874.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3176, pruned_loss=0.0786, over 4279038.34 frames. ], batch size: 124, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:35:34,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1539642.0, ans=0.125 2023-06-23 17:37:05,738 INFO [train.py:996] (0/4) Epoch 9, batch 12700, loss[loss=0.2878, simple_loss=0.3517, pruned_loss=0.1119, over 21807.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3171, pruned_loss=0.08133, over 4285312.05 frames. ], batch size: 441, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:37:09,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1539942.0, ans=10.0 2023-06-23 17:37:22,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1540002.0, ans=0.125 2023-06-23 17:37:23,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.729e+02 5.436e+02 7.219e+02 1.107e+03 2.161e+03, threshold=1.444e+03, percent-clipped=5.0 2023-06-23 17:37:37,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1540002.0, ans=0.125 2023-06-23 17:38:12,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-23 17:38:30,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1540182.0, ans=0.1 2023-06-23 17:38:46,100 INFO [train.py:996] (0/4) Epoch 9, batch 12750, loss[loss=0.2652, simple_loss=0.3344, pruned_loss=0.098, over 19936.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3179, pruned_loss=0.08171, over 4284048.46 frames. ], batch size: 702, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:39:37,523 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-06-23 17:39:47,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1540422.0, ans=0.125 2023-06-23 17:39:50,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1540422.0, ans=0.125 2023-06-23 17:40:26,705 INFO [train.py:996] (0/4) Epoch 9, batch 12800, loss[loss=0.2429, simple_loss=0.3112, pruned_loss=0.08728, over 21857.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3167, pruned_loss=0.08197, over 4283154.24 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:40:45,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1540542.0, ans=0.0 2023-06-23 17:40:51,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.701e+02 5.271e+02 6.331e+02 9.056e+02 1.664e+03, threshold=1.266e+03, percent-clipped=3.0 2023-06-23 17:40:53,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1540602.0, ans=0.2 2023-06-23 17:41:09,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1540662.0, ans=0.125 2023-06-23 17:42:08,032 INFO [train.py:996] (0/4) Epoch 9, batch 12850, loss[loss=0.2183, simple_loss=0.3135, pruned_loss=0.06152, over 21831.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3196, pruned_loss=0.08387, over 4286689.42 frames. ], batch size: 371, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:43:14,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1541022.0, ans=0.0 2023-06-23 17:43:54,700 INFO [train.py:996] (0/4) Epoch 9, batch 12900, loss[loss=0.1833, simple_loss=0.259, pruned_loss=0.05379, over 21787.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3167, pruned_loss=0.08, over 4287544.12 frames. ], batch size: 118, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:44:24,426 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 5.349e+02 7.787e+02 1.135e+03 3.186e+03, threshold=1.557e+03, percent-clipped=18.0 2023-06-23 17:44:48,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1541262.0, ans=0.1 2023-06-23 17:45:41,772 INFO [train.py:996] (0/4) Epoch 9, batch 12950, loss[loss=0.1571, simple_loss=0.228, pruned_loss=0.04314, over 17043.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3152, pruned_loss=0.07824, over 4283103.07 frames. ], batch size: 63, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:45:57,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1541442.0, ans=0.04949747468305833 2023-06-23 17:46:03,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1541502.0, ans=0.0 2023-06-23 17:46:13,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1541502.0, ans=0.09899494936611666 2023-06-23 17:47:27,848 INFO [train.py:996] (0/4) Epoch 9, batch 13000, loss[loss=0.1784, simple_loss=0.2656, pruned_loss=0.04561, over 21617.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3153, pruned_loss=0.07785, over 4282664.46 frames. ], batch size: 263, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:47:46,465 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.593e+02 5.869e+02 8.632e+02 1.298e+03 2.714e+03, threshold=1.726e+03, percent-clipped=15.0 2023-06-23 17:48:17,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1541862.0, ans=0.0 2023-06-23 17:49:01,569 INFO [train.py:996] (0/4) Epoch 9, batch 13050, loss[loss=0.2138, simple_loss=0.2878, pruned_loss=0.06992, over 21705.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3128, pruned_loss=0.07586, over 4269927.22 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:49:18,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1542042.0, ans=0.07 2023-06-23 17:49:34,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1542102.0, ans=0.125 2023-06-23 17:49:42,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1542162.0, ans=0.035 2023-06-23 17:50:46,375 INFO [train.py:996] (0/4) Epoch 9, batch 13100, loss[loss=0.2227, simple_loss=0.3125, pruned_loss=0.06647, over 21783.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3144, pruned_loss=0.07651, over 4268652.12 frames. ], batch size: 332, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:50:52,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1542342.0, ans=0.125 2023-06-23 17:50:57,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1542342.0, ans=0.125 2023-06-23 17:51:06,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 5.747e+02 7.827e+02 1.039e+03 1.771e+03, threshold=1.565e+03, percent-clipped=1.0 2023-06-23 17:51:06,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1542402.0, ans=0.125 2023-06-23 17:51:14,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-23 17:51:23,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1542462.0, ans=0.125 2023-06-23 17:51:31,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1542462.0, ans=0.0 2023-06-23 17:52:22,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1542582.0, ans=0.125 2023-06-23 17:52:28,878 INFO [train.py:996] (0/4) Epoch 9, batch 13150, loss[loss=0.2147, simple_loss=0.2932, pruned_loss=0.06809, over 21718.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3177, pruned_loss=0.08, over 4273523.29 frames. ], batch size: 332, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:53:28,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1542762.0, ans=0.2 2023-06-23 17:53:30,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1542762.0, ans=0.0 2023-06-23 17:53:34,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1542822.0, ans=0.125 2023-06-23 17:53:39,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-23 17:54:10,311 INFO [train.py:996] (0/4) Epoch 9, batch 13200, loss[loss=0.2982, simple_loss=0.3614, pruned_loss=0.1175, over 21582.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3147, pruned_loss=0.08024, over 4273766.12 frames. ], batch size: 415, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:54:33,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.479e+02 5.951e+02 7.570e+02 1.042e+03 3.191e+03, threshold=1.514e+03, percent-clipped=13.0 2023-06-23 17:54:37,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1543002.0, ans=0.0 2023-06-23 17:55:37,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1543182.0, ans=0.125 2023-06-23 17:55:42,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1543182.0, ans=0.2 2023-06-23 17:55:50,024 INFO [train.py:996] (0/4) Epoch 9, batch 13250, loss[loss=0.2541, simple_loss=0.3375, pruned_loss=0.08537, over 20662.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3151, pruned_loss=0.08139, over 4273395.28 frames. ], batch size: 607, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:55:50,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1543242.0, ans=0.125 2023-06-23 17:56:03,820 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:56:09,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-23 17:56:25,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1543302.0, ans=0.0 2023-06-23 17:56:35,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1543302.0, ans=0.125 2023-06-23 17:57:14,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1543482.0, ans=0.0 2023-06-23 17:57:36,356 INFO [train.py:996] (0/4) Epoch 9, batch 13300, loss[loss=0.2636, simple_loss=0.3525, pruned_loss=0.0873, over 21621.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.318, pruned_loss=0.0805, over 4274193.64 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:57:38,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1543542.0, ans=0.125 2023-06-23 17:58:08,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.720e+02 5.402e+02 7.318e+02 1.029e+03 1.964e+03, threshold=1.464e+03, percent-clipped=5.0 2023-06-23 17:58:22,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-23 17:58:26,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1543662.0, ans=0.125 2023-06-23 17:58:46,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1543722.0, ans=0.125 2023-06-23 17:59:10,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1543782.0, ans=0.0 2023-06-23 17:59:15,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1543782.0, ans=0.125 2023-06-23 17:59:18,333 INFO [train.py:996] (0/4) Epoch 9, batch 13350, loss[loss=0.2443, simple_loss=0.3266, pruned_loss=0.08102, over 21718.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3225, pruned_loss=0.08305, over 4274662.30 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:59:23,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1543842.0, ans=0.125 2023-06-23 17:59:49,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1543902.0, ans=0.125 2023-06-23 18:00:18,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-23 18:00:45,231 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-23 18:00:57,225 INFO [train.py:996] (0/4) Epoch 9, batch 13400, loss[loss=0.2549, simple_loss=0.3281, pruned_loss=0.09081, over 21712.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3236, pruned_loss=0.08383, over 4277171.03 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:01:35,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.876e+02 6.088e+02 8.910e+02 1.105e+03 2.382e+03, threshold=1.782e+03, percent-clipped=11.0 2023-06-23 18:02:23,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-23 18:02:33,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1544382.0, ans=0.0 2023-06-23 18:02:46,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1544382.0, ans=0.0 2023-06-23 18:02:50,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-23 18:02:50,801 INFO [train.py:996] (0/4) Epoch 9, batch 13450, loss[loss=0.26, simple_loss=0.3264, pruned_loss=0.0968, over 21645.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3253, pruned_loss=0.08688, over 4270070.34 frames. ], batch size: 391, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:03:07,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1544442.0, ans=0.05 2023-06-23 18:03:09,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1544442.0, ans=0.125 2023-06-23 18:03:52,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1544622.0, ans=0.1 2023-06-23 18:04:30,762 INFO [train.py:996] (0/4) Epoch 9, batch 13500, loss[loss=0.2254, simple_loss=0.3036, pruned_loss=0.07363, over 21872.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.317, pruned_loss=0.08468, over 4273402.10 frames. ], batch size: 317, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:04:50,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1544802.0, ans=0.0 2023-06-23 18:04:54,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.791e+02 5.275e+02 7.519e+02 1.324e+03 2.778e+03, threshold=1.504e+03, percent-clipped=14.0 2023-06-23 18:05:21,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1544862.0, ans=0.125 2023-06-23 18:05:45,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1544922.0, ans=0.125 2023-06-23 18:06:13,534 INFO [train.py:996] (0/4) Epoch 9, batch 13550, loss[loss=0.246, simple_loss=0.3534, pruned_loss=0.06934, over 21708.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3205, pruned_loss=0.08406, over 4274042.57 frames. ], batch size: 298, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:06:15,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1545042.0, ans=0.2 2023-06-23 18:06:26,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.19 vs. limit=10.0 2023-06-23 18:07:04,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1545162.0, ans=0.125 2023-06-23 18:07:30,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1545222.0, ans=0.0 2023-06-23 18:07:49,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1545282.0, ans=0.125 2023-06-23 18:07:55,239 INFO [train.py:996] (0/4) Epoch 9, batch 13600, loss[loss=0.2089, simple_loss=0.29, pruned_loss=0.06393, over 21414.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.322, pruned_loss=0.08453, over 4274471.64 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:08:18,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.790e+02 6.429e+02 9.165e+02 1.553e+03 3.162e+03, threshold=1.833e+03, percent-clipped=25.0 2023-06-23 18:09:03,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1545522.0, ans=0.1 2023-06-23 18:09:24,913 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-23 18:09:30,326 INFO [train.py:996] (0/4) Epoch 9, batch 13650, loss[loss=0.2169, simple_loss=0.2849, pruned_loss=0.07444, over 21632.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3157, pruned_loss=0.08121, over 4270449.37 frames. ], batch size: 332, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:10:32,619 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-23 18:11:05,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1545882.0, ans=0.1 2023-06-23 18:11:07,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1545882.0, ans=0.125 2023-06-23 18:11:13,790 INFO [train.py:996] (0/4) Epoch 9, batch 13700, loss[loss=0.2298, simple_loss=0.2923, pruned_loss=0.08368, over 21604.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3106, pruned_loss=0.08148, over 4270849.62 frames. ], batch size: 263, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:11:41,514 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.658e+02 5.677e+02 7.972e+02 1.070e+03 2.613e+03, threshold=1.594e+03, percent-clipped=4.0 2023-06-23 18:12:13,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-23 18:12:40,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1546182.0, ans=0.125 2023-06-23 18:12:43,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1546182.0, ans=0.2 2023-06-23 18:12:51,932 INFO [train.py:996] (0/4) Epoch 9, batch 13750, loss[loss=0.2955, simple_loss=0.3667, pruned_loss=0.1122, over 21513.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3093, pruned_loss=0.08155, over 4271458.50 frames. ], batch size: 508, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:14:02,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1546422.0, ans=0.125 2023-06-23 18:14:15,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1546422.0, ans=0.125 2023-06-23 18:14:34,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1546542.0, ans=0.0 2023-06-23 18:14:35,704 INFO [train.py:996] (0/4) Epoch 9, batch 13800, loss[loss=0.3556, simple_loss=0.4421, pruned_loss=0.1346, over 21463.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.315, pruned_loss=0.08068, over 4273795.24 frames. ], batch size: 507, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:15:13,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-23 18:15:16,355 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.059e+02 5.824e+02 9.603e+02 1.417e+03 3.093e+03, threshold=1.921e+03, percent-clipped=19.0 2023-06-23 18:15:43,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1546662.0, ans=0.125 2023-06-23 18:15:46,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-23 18:16:23,580 INFO [train.py:996] (0/4) Epoch 9, batch 13850, loss[loss=0.2054, simple_loss=0.2801, pruned_loss=0.06538, over 21861.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3212, pruned_loss=0.08146, over 4272295.98 frames. ], batch size: 107, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:16:34,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1546842.0, ans=0.1 2023-06-23 18:17:25,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1546962.0, ans=0.1 2023-06-23 18:17:28,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1547022.0, ans=0.2 2023-06-23 18:17:52,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1547082.0, ans=0.0 2023-06-23 18:18:14,873 INFO [train.py:996] (0/4) Epoch 9, batch 13900, loss[loss=0.2548, simple_loss=0.3254, pruned_loss=0.09207, over 21413.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3253, pruned_loss=0.08406, over 4272073.54 frames. ], batch size: 211, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:18:23,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1547142.0, ans=0.0 2023-06-23 18:18:41,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.964e+02 6.028e+02 8.450e+02 1.187e+03 2.483e+03, threshold=1.690e+03, percent-clipped=4.0 2023-06-23 18:18:47,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-23 18:19:06,024 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:19:31,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1547382.0, ans=0.2 2023-06-23 18:19:49,980 INFO [train.py:996] (0/4) Epoch 9, batch 13950, loss[loss=0.2189, simple_loss=0.2929, pruned_loss=0.07242, over 21803.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3243, pruned_loss=0.08598, over 4280134.15 frames. ], batch size: 298, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:20:13,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-23 18:20:42,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-23 18:20:48,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1547622.0, ans=0.125 2023-06-23 18:21:29,628 INFO [train.py:996] (0/4) Epoch 9, batch 14000, loss[loss=0.2108, simple_loss=0.3099, pruned_loss=0.05586, over 21399.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3218, pruned_loss=0.08414, over 4277004.40 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 32.0 2023-06-23 18:21:56,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.630e+02 5.065e+02 8.726e+02 1.240e+03 2.803e+03, threshold=1.745e+03, percent-clipped=8.0 2023-06-23 18:22:38,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1547922.0, ans=0.0 2023-06-23 18:23:03,124 INFO [train.py:996] (0/4) Epoch 9, batch 14050, loss[loss=0.2013, simple_loss=0.2807, pruned_loss=0.06089, over 21404.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3158, pruned_loss=0.07947, over 4273293.53 frames. ], batch size: 211, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:23:10,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1548042.0, ans=0.0 2023-06-23 18:23:35,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2023-06-23 18:24:02,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=22.5 2023-06-23 18:24:11,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1548222.0, ans=0.125 2023-06-23 18:24:22,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-23 18:24:26,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1548282.0, ans=0.1 2023-06-23 18:24:29,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1548282.0, ans=0.0 2023-06-23 18:24:42,574 INFO [train.py:996] (0/4) Epoch 9, batch 14100, loss[loss=0.2274, simple_loss=0.3033, pruned_loss=0.07576, over 21695.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3087, pruned_loss=0.07886, over 4273187.09 frames. ], batch size: 332, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:25:10,284 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.343e+02 6.298e+02 9.143e+02 1.408e+03 2.663e+03, threshold=1.829e+03, percent-clipped=10.0 2023-06-23 18:25:46,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-23 18:26:08,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-23 18:26:09,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1548642.0, ans=0.125 2023-06-23 18:26:10,621 INFO [train.py:996] (0/4) Epoch 9, batch 14150, loss[loss=0.2164, simple_loss=0.3075, pruned_loss=0.06268, over 21628.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.313, pruned_loss=0.07966, over 4259130.57 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:27:12,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1548822.0, ans=0.2 2023-06-23 18:27:48,779 INFO [train.py:996] (0/4) Epoch 9, batch 14200, loss[loss=0.2472, simple_loss=0.3477, pruned_loss=0.07331, over 19903.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3126, pruned_loss=0.07871, over 4260254.74 frames. ], batch size: 702, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:28:00,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=22.5 2023-06-23 18:28:00,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1548942.0, ans=0.0 2023-06-23 18:28:05,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1548942.0, ans=0.125 2023-06-23 18:28:16,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1549002.0, ans=0.125 2023-06-23 18:28:22,650 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.420e+02 7.650e+02 1.190e+03 2.098e+03, threshold=1.530e+03, percent-clipped=4.0 2023-06-23 18:28:26,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1549002.0, ans=0.0 2023-06-23 18:28:46,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1549062.0, ans=0.125 2023-06-23 18:29:01,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1549122.0, ans=0.125 2023-06-23 18:29:10,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1549182.0, ans=0.125 2023-06-23 18:29:15,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1549182.0, ans=0.0 2023-06-23 18:29:27,862 INFO [train.py:996] (0/4) Epoch 9, batch 14250, loss[loss=0.234, simple_loss=0.3, pruned_loss=0.08396, over 21874.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3078, pruned_loss=0.07881, over 4269665.07 frames. ], batch size: 98, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:30:00,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1549302.0, ans=0.125 2023-06-23 18:30:26,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-23 18:30:44,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1549422.0, ans=0.0 2023-06-23 18:30:56,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1549482.0, ans=0.125 2023-06-23 18:31:09,754 INFO [train.py:996] (0/4) Epoch 9, batch 14300, loss[loss=0.3656, simple_loss=0.4427, pruned_loss=0.1443, over 21679.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.31, pruned_loss=0.07882, over 4261346.57 frames. ], batch size: 414, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:31:11,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1549542.0, ans=0.1 2023-06-23 18:31:40,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1549602.0, ans=15.0 2023-06-23 18:31:49,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.696e+02 4.681e+02 6.476e+02 1.240e+03 3.295e+03, threshold=1.295e+03, percent-clipped=18.0 2023-06-23 18:32:07,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1549662.0, ans=0.1 2023-06-23 18:32:33,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1549782.0, ans=0.0 2023-06-23 18:32:40,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1549782.0, ans=0.0 2023-06-23 18:32:49,948 INFO [train.py:996] (0/4) Epoch 9, batch 14350, loss[loss=0.1986, simple_loss=0.2909, pruned_loss=0.05311, over 21032.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3153, pruned_loss=0.07878, over 4250474.76 frames. ], batch size: 608, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:32:57,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1549842.0, ans=0.2 2023-06-23 18:33:15,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1549902.0, ans=0.0 2023-06-23 18:34:09,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1550022.0, ans=0.125 2023-06-23 18:34:14,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1550082.0, ans=0.0 2023-06-23 18:34:34,671 INFO [train.py:996] (0/4) Epoch 9, batch 14400, loss[loss=0.1796, simple_loss=0.2265, pruned_loss=0.06628, over 16907.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3121, pruned_loss=0.079, over 4248342.20 frames. ], batch size: 61, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:35:05,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1550202.0, ans=0.125 2023-06-23 18:35:09,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 4.844e+02 6.439e+02 1.111e+03 2.671e+03, threshold=1.288e+03, percent-clipped=19.0 2023-06-23 18:35:10,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1550202.0, ans=0.0 2023-06-23 18:36:07,895 INFO [train.py:996] (0/4) Epoch 9, batch 14450, loss[loss=0.2249, simple_loss=0.2948, pruned_loss=0.07747, over 21799.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3065, pruned_loss=0.07908, over 4253537.41 frames. ], batch size: 112, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:36:39,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1550502.0, ans=0.0 2023-06-23 18:36:46,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1550502.0, ans=0.125 2023-06-23 18:36:47,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1550502.0, ans=0.125 2023-06-23 18:37:43,338 INFO [train.py:996] (0/4) Epoch 9, batch 14500, loss[loss=0.2064, simple_loss=0.3042, pruned_loss=0.05427, over 21264.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3022, pruned_loss=0.0787, over 4262069.51 frames. ], batch size: 176, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:37:50,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1550742.0, ans=0.0 2023-06-23 18:38:13,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1550802.0, ans=0.2 2023-06-23 18:38:23,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.792e+02 5.208e+02 6.817e+02 8.713e+02 1.535e+03, threshold=1.363e+03, percent-clipped=1.0 2023-06-23 18:38:38,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1550862.0, ans=0.2 2023-06-23 18:39:17,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1550982.0, ans=0.125 2023-06-23 18:39:29,524 INFO [train.py:996] (0/4) Epoch 9, batch 14550, loss[loss=0.3302, simple_loss=0.3898, pruned_loss=0.1353, over 21324.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.307, pruned_loss=0.08088, over 4265561.39 frames. ], batch size: 507, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:39:38,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1551042.0, ans=0.125 2023-06-23 18:39:45,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1551042.0, ans=0.2 2023-06-23 18:39:49,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.81 vs. limit=15.0 2023-06-23 18:39:50,529 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-23 18:41:04,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1551342.0, ans=0.125 2023-06-23 18:41:10,904 INFO [train.py:996] (0/4) Epoch 9, batch 14600, loss[loss=0.2534, simple_loss=0.3218, pruned_loss=0.09248, over 21607.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3166, pruned_loss=0.08571, over 4270212.07 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:41:28,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1551342.0, ans=0.0 2023-06-23 18:41:30,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-23 18:41:42,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.404e+02 6.083e+02 8.730e+02 1.243e+03 2.471e+03, threshold=1.746e+03, percent-clipped=17.0 2023-06-23 18:42:17,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1551522.0, ans=0.125 2023-06-23 18:42:25,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1551582.0, ans=0.125 2023-06-23 18:42:25,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-23 18:42:45,950 INFO [train.py:996] (0/4) Epoch 9, batch 14650, loss[loss=0.2208, simple_loss=0.3028, pruned_loss=0.06938, over 20112.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3186, pruned_loss=0.08445, over 4254439.39 frames. ], batch size: 702, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:43:00,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1551642.0, ans=0.125 2023-06-23 18:43:00,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1551642.0, ans=0.1 2023-06-23 18:43:04,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1551642.0, ans=0.125 2023-06-23 18:43:16,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1551702.0, ans=0.0 2023-06-23 18:43:49,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1551822.0, ans=0.125 2023-06-23 18:43:52,248 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:43:55,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1551822.0, ans=0.0 2023-06-23 18:44:05,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1551882.0, ans=0.0 2023-06-23 18:44:06,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=22.5 2023-06-23 18:44:19,292 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-23 18:44:21,217 INFO [train.py:996] (0/4) Epoch 9, batch 14700, loss[loss=0.2114, simple_loss=0.3116, pruned_loss=0.05564, over 21776.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3131, pruned_loss=0.07878, over 4256517.79 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:44:39,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1551942.0, ans=0.125 2023-06-23 18:44:39,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1551942.0, ans=0.125 2023-06-23 18:44:59,155 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 5.196e+02 7.542e+02 1.109e+03 2.941e+03, threshold=1.508e+03, percent-clipped=7.0 2023-06-23 18:45:02,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-23 18:45:29,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1552122.0, ans=0.125 2023-06-23 18:45:55,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-23 18:45:56,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1552182.0, ans=0.015 2023-06-23 18:46:08,870 INFO [train.py:996] (0/4) Epoch 9, batch 14750, loss[loss=0.2989, simple_loss=0.3632, pruned_loss=0.1173, over 21487.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3157, pruned_loss=0.08072, over 4260836.47 frames. ], batch size: 131, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:46:23,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1552242.0, ans=0.1 2023-06-23 18:46:24,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1552302.0, ans=0.125 2023-06-23 18:47:45,569 INFO [train.py:996] (0/4) Epoch 9, batch 14800, loss[loss=0.2755, simple_loss=0.3456, pruned_loss=0.1027, over 21781.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3287, pruned_loss=0.08781, over 4260757.06 frames. ], batch size: 351, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:48:14,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1552602.0, ans=0.035 2023-06-23 18:48:16,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.500e+02 8.733e+02 1.311e+03 2.731e+03, threshold=1.747e+03, percent-clipped=18.0 2023-06-23 18:48:20,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1552662.0, ans=0.0 2023-06-23 18:48:24,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.30 vs. limit=15.0 2023-06-23 18:48:30,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1552662.0, ans=0.09899494936611666 2023-06-23 18:48:55,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1552722.0, ans=0.0 2023-06-23 18:49:32,209 INFO [train.py:996] (0/4) Epoch 9, batch 14850, loss[loss=0.314, simple_loss=0.3865, pruned_loss=0.1207, over 21412.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3227, pruned_loss=0.08712, over 4263991.67 frames. ], batch size: 471, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:49:35,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1552842.0, ans=0.125 2023-06-23 18:49:47,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1552902.0, ans=0.125 2023-06-23 18:50:17,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1552962.0, ans=0.125 2023-06-23 18:50:31,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-23 18:50:49,947 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:51:15,506 INFO [train.py:996] (0/4) Epoch 9, batch 14900, loss[loss=0.2645, simple_loss=0.3395, pruned_loss=0.09479, over 21270.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.325, pruned_loss=0.08811, over 4266669.26 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:51:54,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.598e+02 5.568e+02 9.380e+02 1.428e+03 3.360e+03, threshold=1.876e+03, percent-clipped=13.0 2023-06-23 18:51:55,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1553202.0, ans=0.2 2023-06-23 18:51:58,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1553262.0, ans=0.5 2023-06-23 18:52:55,915 INFO [train.py:996] (0/4) Epoch 9, batch 14950, loss[loss=0.2232, simple_loss=0.3207, pruned_loss=0.06287, over 21234.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3245, pruned_loss=0.08617, over 4262201.77 frames. ], batch size: 549, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:53:32,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1553502.0, ans=0.5 2023-06-23 18:54:37,628 INFO [train.py:996] (0/4) Epoch 9, batch 15000, loss[loss=0.241, simple_loss=0.3036, pruned_loss=0.08918, over 21325.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3289, pruned_loss=0.0892, over 4263308.40 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:54:37,629 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 18:54:58,177 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2574, simple_loss=0.352, pruned_loss=0.08137, over 1796401.00 frames. 2023-06-23 18:54:58,178 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 18:55:04,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-23 18:55:10,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1553742.0, ans=0.1 2023-06-23 18:55:20,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1553802.0, ans=0.0 2023-06-23 18:55:32,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.760e+02 5.829e+02 9.207e+02 1.364e+03 3.991e+03, threshold=1.841e+03, percent-clipped=17.0 2023-06-23 18:55:34,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1553862.0, ans=0.0 2023-06-23 18:55:46,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1553862.0, ans=0.2 2023-06-23 18:56:17,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1553982.0, ans=0.0 2023-06-23 18:56:39,797 INFO [train.py:996] (0/4) Epoch 9, batch 15050, loss[loss=0.2938, simple_loss=0.3878, pruned_loss=0.09993, over 20701.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3294, pruned_loss=0.08927, over 4262221.99 frames. ], batch size: 607, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:57:59,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-23 18:58:20,519 INFO [train.py:996] (0/4) Epoch 9, batch 15100, loss[loss=0.2299, simple_loss=0.3045, pruned_loss=0.07766, over 21813.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3304, pruned_loss=0.08806, over 4262420.14 frames. ], batch size: 247, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:58:40,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1554402.0, ans=0.125 2023-06-23 18:58:59,274 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 5.498e+02 7.540e+02 1.313e+03 2.793e+03, threshold=1.508e+03, percent-clipped=8.0 2023-06-23 18:58:59,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1554402.0, ans=0.125 2023-06-23 18:59:39,232 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-23 18:59:48,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-23 19:00:04,651 INFO [train.py:996] (0/4) Epoch 9, batch 15150, loss[loss=0.2671, simple_loss=0.3159, pruned_loss=0.1092, over 21593.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3266, pruned_loss=0.08862, over 4267448.91 frames. ], batch size: 415, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 19:00:05,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1554642.0, ans=0.0 2023-06-23 19:00:39,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1554702.0, ans=0.125 2023-06-23 19:01:14,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1554822.0, ans=0.125 2023-06-23 19:01:17,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1554822.0, ans=0.125 2023-06-23 19:01:38,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-23 19:01:45,750 INFO [train.py:996] (0/4) Epoch 9, batch 15200, loss[loss=0.2329, simple_loss=0.3238, pruned_loss=0.07095, over 21679.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.319, pruned_loss=0.08423, over 4260929.35 frames. ], batch size: 415, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:02:10,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-23 19:02:16,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-23 19:02:19,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.584e+02 6.669e+02 9.281e+02 1.408e+03 4.015e+03, threshold=1.856e+03, percent-clipped=19.0 2023-06-23 19:02:26,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1555062.0, ans=0.0 2023-06-23 19:02:32,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1555062.0, ans=0.125 2023-06-23 19:02:36,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-23 19:03:27,236 INFO [train.py:996] (0/4) Epoch 9, batch 15250, loss[loss=0.1935, simple_loss=0.2576, pruned_loss=0.06472, over 21280.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3126, pruned_loss=0.08288, over 4259113.01 frames. ], batch size: 551, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:03:31,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.91 vs. limit=22.5 2023-06-23 19:04:00,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1555302.0, ans=0.125 2023-06-23 19:05:06,545 INFO [train.py:996] (0/4) Epoch 9, batch 15300, loss[loss=0.2729, simple_loss=0.3391, pruned_loss=0.1033, over 21444.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3149, pruned_loss=0.08542, over 4254557.87 frames. ], batch size: 194, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:05:15,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1555542.0, ans=0.1 2023-06-23 19:05:41,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.812e+02 5.913e+02 8.241e+02 1.222e+03 2.288e+03, threshold=1.648e+03, percent-clipped=6.0 2023-06-23 19:05:52,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1555662.0, ans=0.125 2023-06-23 19:06:19,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1555722.0, ans=0.125 2023-06-23 19:06:29,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=12.0 2023-06-23 19:06:42,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.61 vs. limit=6.0 2023-06-23 19:06:52,451 INFO [train.py:996] (0/4) Epoch 9, batch 15350, loss[loss=0.2359, simple_loss=0.3372, pruned_loss=0.06726, over 21844.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3174, pruned_loss=0.08605, over 4258988.80 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:06:56,536 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2023-06-23 19:07:06,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-23 19:07:21,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1555902.0, ans=0.125 2023-06-23 19:08:26,435 INFO [train.py:996] (0/4) Epoch 9, batch 15400, loss[loss=0.2373, simple_loss=0.3099, pruned_loss=0.08232, over 21867.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3173, pruned_loss=0.08408, over 4266264.68 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:08:29,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1556142.0, ans=0.2 2023-06-23 19:08:50,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-23 19:08:58,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 6.075e+02 7.840e+02 1.049e+03 1.941e+03, threshold=1.568e+03, percent-clipped=4.0 2023-06-23 19:09:19,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1556262.0, ans=0.2 2023-06-23 19:10:04,882 INFO [train.py:996] (0/4) Epoch 9, batch 15450, loss[loss=0.2186, simple_loss=0.3084, pruned_loss=0.06437, over 21848.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3145, pruned_loss=0.08267, over 4261180.86 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:10:23,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1556442.0, ans=0.125 2023-06-23 19:10:25,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-23 19:10:27,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-23 19:10:39,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1556502.0, ans=0.125 2023-06-23 19:11:37,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1556682.0, ans=0.125 2023-06-23 19:11:46,680 INFO [train.py:996] (0/4) Epoch 9, batch 15500, loss[loss=0.2888, simple_loss=0.3522, pruned_loss=0.1128, over 21281.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3178, pruned_loss=0.0828, over 4250945.75 frames. ], batch size: 159, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:12:26,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 5.144e+02 7.122e+02 1.016e+03 2.468e+03, threshold=1.424e+03, percent-clipped=4.0 2023-06-23 19:13:34,008 INFO [train.py:996] (0/4) Epoch 9, batch 15550, loss[loss=0.2255, simple_loss=0.3051, pruned_loss=0.07292, over 21708.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3177, pruned_loss=0.08118, over 4255057.40 frames. ], batch size: 332, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:13:37,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1557042.0, ans=0.2 2023-06-23 19:13:39,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1557042.0, ans=0.2 2023-06-23 19:13:40,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557042.0, ans=0.1 2023-06-23 19:13:46,012 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:13:46,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1557042.0, ans=0.125 2023-06-23 19:13:47,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1557042.0, ans=0.125 2023-06-23 19:13:52,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1557102.0, ans=0.125 2023-06-23 19:15:14,772 INFO [train.py:996] (0/4) Epoch 9, batch 15600, loss[loss=0.2014, simple_loss=0.2854, pruned_loss=0.05872, over 21663.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3109, pruned_loss=0.07949, over 4255288.38 frames. ], batch size: 247, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:15:22,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1557342.0, ans=15.0 2023-06-23 19:15:38,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1557402.0, ans=0.125 2023-06-23 19:15:49,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.952e+02 5.285e+02 6.916e+02 1.084e+03 2.169e+03, threshold=1.383e+03, percent-clipped=9.0 2023-06-23 19:15:52,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-23 19:16:11,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.54 vs. limit=22.5 2023-06-23 19:16:19,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1557522.0, ans=0.07 2023-06-23 19:16:24,756 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-23 19:16:30,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1557522.0, ans=0.125 2023-06-23 19:16:38,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557582.0, ans=0.1 2023-06-23 19:16:52,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557582.0, ans=0.1 2023-06-23 19:16:55,699 INFO [train.py:996] (0/4) Epoch 9, batch 15650, loss[loss=0.2339, simple_loss=0.2918, pruned_loss=0.08795, over 21726.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.311, pruned_loss=0.07989, over 4257605.66 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:17:02,481 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:17:52,318 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-23 19:18:20,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557882.0, ans=0.1 2023-06-23 19:18:27,274 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:18:31,501 INFO [train.py:996] (0/4) Epoch 9, batch 15700, loss[loss=0.2126, simple_loss=0.287, pruned_loss=0.0691, over 21645.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3067, pruned_loss=0.07898, over 4252729.80 frames. ], batch size: 263, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:18:32,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1557942.0, ans=0.0 2023-06-23 19:18:38,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1557942.0, ans=0.1 2023-06-23 19:18:40,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.59 vs. limit=10.0 2023-06-23 19:19:07,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.566e+02 5.534e+02 7.672e+02 1.120e+03 2.103e+03, threshold=1.534e+03, percent-clipped=13.0 2023-06-23 19:20:11,354 INFO [train.py:996] (0/4) Epoch 9, batch 15750, loss[loss=0.2228, simple_loss=0.2909, pruned_loss=0.07731, over 21457.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3029, pruned_loss=0.07881, over 4241957.34 frames. ], batch size: 389, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:20:20,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-23 19:20:34,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1558302.0, ans=0.125 2023-06-23 19:21:11,319 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:21:50,641 INFO [train.py:996] (0/4) Epoch 9, batch 15800, loss[loss=0.2077, simple_loss=0.2693, pruned_loss=0.073, over 21489.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2987, pruned_loss=0.07827, over 4239657.96 frames. ], batch size: 212, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:21:51,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-23 19:22:11,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1558602.0, ans=0.125 2023-06-23 19:22:19,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1558602.0, ans=10.0 2023-06-23 19:22:24,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1558602.0, ans=0.125 2023-06-23 19:22:26,971 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.286e+02 6.867e+02 8.896e+02 1.872e+03, threshold=1.373e+03, percent-clipped=1.0 2023-06-23 19:23:18,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1558782.0, ans=0.125 2023-06-23 19:23:21,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1558782.0, ans=0.0 2023-06-23 19:23:30,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-23 19:23:30,844 INFO [train.py:996] (0/4) Epoch 9, batch 15850, loss[loss=0.2524, simple_loss=0.3216, pruned_loss=0.09157, over 21959.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2995, pruned_loss=0.07983, over 4251955.60 frames. ], batch size: 317, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:23:53,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1558902.0, ans=0.035 2023-06-23 19:24:38,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1559022.0, ans=0.0 2023-06-23 19:25:10,613 INFO [train.py:996] (0/4) Epoch 9, batch 15900, loss[loss=0.2375, simple_loss=0.3209, pruned_loss=0.07708, over 21653.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2972, pruned_loss=0.07935, over 4238176.43 frames. ], batch size: 298, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:25:46,399 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.755e+02 5.068e+02 6.374e+02 9.133e+02 1.940e+03, threshold=1.275e+03, percent-clipped=6.0 2023-06-23 19:25:50,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1559262.0, ans=0.125 2023-06-23 19:26:04,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1559262.0, ans=0.2 2023-06-23 19:26:51,990 INFO [train.py:996] (0/4) Epoch 9, batch 15950, loss[loss=0.181, simple_loss=0.2768, pruned_loss=0.0426, over 21616.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2982, pruned_loss=0.07758, over 4233238.35 frames. ], batch size: 263, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:27:26,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1559502.0, ans=0.0 2023-06-23 19:27:26,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1559502.0, ans=0.1 2023-06-23 19:27:37,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1559562.0, ans=0.05 2023-06-23 19:28:09,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-23 19:28:29,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1559682.0, ans=0.125 2023-06-23 19:28:32,480 INFO [train.py:996] (0/4) Epoch 9, batch 16000, loss[loss=0.2914, simple_loss=0.3639, pruned_loss=0.1095, over 21585.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3001, pruned_loss=0.07679, over 4245520.73 frames. ], batch size: 471, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:28:52,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1559802.0, ans=0.1 2023-06-23 19:28:55,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1559802.0, ans=0.0 2023-06-23 19:29:07,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.766e+02 5.425e+02 7.645e+02 1.261e+03 2.910e+03, threshold=1.529e+03, percent-clipped=25.0 2023-06-23 19:29:43,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-23 19:29:55,381 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-260000.pt 2023-06-23 19:30:05,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.98 vs. limit=10.0 2023-06-23 19:30:09,391 INFO [train.py:996] (0/4) Epoch 9, batch 16050, loss[loss=0.3578, simple_loss=0.4813, pruned_loss=0.1171, over 19780.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3022, pruned_loss=0.07415, over 4257646.40 frames. ], batch size: 702, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:30:14,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1560042.0, ans=0.125 2023-06-23 19:30:48,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1560162.0, ans=0.0 2023-06-23 19:31:48,840 INFO [train.py:996] (0/4) Epoch 9, batch 16100, loss[loss=0.2379, simple_loss=0.3041, pruned_loss=0.08588, over 21541.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3081, pruned_loss=0.07614, over 4271716.39 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:31:52,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1560342.0, ans=0.125 2023-06-23 19:31:55,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1560342.0, ans=0.025 2023-06-23 19:32:01,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-23 19:32:20,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1560402.0, ans=0.125 2023-06-23 19:32:24,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.839e+02 1.039e+03 1.501e+03 2.959e+03, threshold=2.078e+03, percent-clipped=23.0 2023-06-23 19:32:37,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1560462.0, ans=0.125 2023-06-23 19:32:56,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1560522.0, ans=0.125 2023-06-23 19:33:20,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1560582.0, ans=0.0 2023-06-23 19:33:29,454 INFO [train.py:996] (0/4) Epoch 9, batch 16150, loss[loss=0.2437, simple_loss=0.3064, pruned_loss=0.09055, over 21931.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3098, pruned_loss=0.07801, over 4279433.31 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:33:56,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1560702.0, ans=0.07 2023-06-23 19:33:59,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1560702.0, ans=0.125 2023-06-23 19:34:31,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1560822.0, ans=0.0 2023-06-23 19:34:33,646 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-06-23 19:34:34,892 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-23 19:34:54,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1560882.0, ans=0.1 2023-06-23 19:35:08,335 INFO [train.py:996] (0/4) Epoch 9, batch 16200, loss[loss=0.2397, simple_loss=0.3276, pruned_loss=0.07587, over 21456.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3138, pruned_loss=0.07956, over 4289190.73 frames. ], batch size: 211, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:35:30,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1561002.0, ans=0.0 2023-06-23 19:35:45,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.932e+02 6.489e+02 9.592e+02 1.250e+03 2.736e+03, threshold=1.918e+03, percent-clipped=7.0 2023-06-23 19:36:37,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1561182.0, ans=0.125 2023-06-23 19:36:39,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1561182.0, ans=0.125 2023-06-23 19:36:56,514 INFO [train.py:996] (0/4) Epoch 9, batch 16250, loss[loss=0.2323, simple_loss=0.3159, pruned_loss=0.07439, over 21331.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3151, pruned_loss=0.08046, over 4290245.68 frames. ], batch size: 549, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:37:03,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1561242.0, ans=0.04949747468305833 2023-06-23 19:37:27,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1561302.0, ans=0.125 2023-06-23 19:37:42,286 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-23 19:37:55,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1561362.0, ans=0.0 2023-06-23 19:37:55,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1561362.0, ans=0.0 2023-06-23 19:38:25,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1561482.0, ans=0.125 2023-06-23 19:38:36,189 INFO [train.py:996] (0/4) Epoch 9, batch 16300, loss[loss=0.1709, simple_loss=0.2439, pruned_loss=0.04899, over 21820.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3088, pruned_loss=0.07657, over 4278024.27 frames. ], batch size: 107, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:39:18,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.285e+02 4.893e+02 6.873e+02 9.809e+02 2.054e+03, threshold=1.375e+03, percent-clipped=1.0 2023-06-23 19:39:23,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1561662.0, ans=0.1 2023-06-23 19:39:57,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-23 19:40:15,725 INFO [train.py:996] (0/4) Epoch 9, batch 16350, loss[loss=0.2855, simple_loss=0.3536, pruned_loss=0.1087, over 21902.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3102, pruned_loss=0.07805, over 4276367.62 frames. ], batch size: 372, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:40:28,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-23 19:41:14,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1561962.0, ans=0.1 2023-06-23 19:41:16,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1562022.0, ans=0.5 2023-06-23 19:41:30,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-23 19:41:50,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1562082.0, ans=0.1 2023-06-23 19:41:54,455 INFO [train.py:996] (0/4) Epoch 9, batch 16400, loss[loss=0.307, simple_loss=0.3554, pruned_loss=0.1293, over 21703.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3111, pruned_loss=0.07887, over 4262673.24 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:42:35,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1562262.0, ans=0.0 2023-06-23 19:42:37,806 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 4.704e+02 7.326e+02 1.027e+03 2.811e+03, threshold=1.465e+03, percent-clipped=10.0 2023-06-23 19:42:49,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1562262.0, ans=0.0 2023-06-23 19:42:56,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1562322.0, ans=0.07 2023-06-23 19:43:13,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1562382.0, ans=0.125 2023-06-23 19:43:14,342 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.36 vs. limit=10.0 2023-06-23 19:43:34,540 INFO [train.py:996] (0/4) Epoch 9, batch 16450, loss[loss=0.2309, simple_loss=0.3077, pruned_loss=0.07702, over 21553.00 frames. ], tot_loss[loss=0.237, simple_loss=0.313, pruned_loss=0.08049, over 4269275.25 frames. ], batch size: 131, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:44:00,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1562502.0, ans=0.125 2023-06-23 19:44:14,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-23 19:45:15,090 INFO [train.py:996] (0/4) Epoch 9, batch 16500, loss[loss=0.2129, simple_loss=0.2688, pruned_loss=0.07848, over 21257.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.312, pruned_loss=0.08122, over 4275775.19 frames. ], batch size: 176, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:45:18,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1562742.0, ans=0.125 2023-06-23 19:45:48,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1562802.0, ans=0.2 2023-06-23 19:45:51,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1562802.0, ans=0.2 2023-06-23 19:46:03,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 6.770e+02 9.881e+02 1.345e+03 3.319e+03, threshold=1.976e+03, percent-clipped=17.0 2023-06-23 19:46:04,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1562862.0, ans=0.125 2023-06-23 19:46:23,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1562922.0, ans=0.1 2023-06-23 19:46:56,225 INFO [train.py:996] (0/4) Epoch 9, batch 16550, loss[loss=0.2272, simple_loss=0.3189, pruned_loss=0.06773, over 21630.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3143, pruned_loss=0.08035, over 4261020.51 frames. ], batch size: 414, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:47:14,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1563042.0, ans=0.0 2023-06-23 19:47:34,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-23 19:47:37,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1563102.0, ans=0.0 2023-06-23 19:48:47,711 INFO [train.py:996] (0/4) Epoch 9, batch 16600, loss[loss=0.2865, simple_loss=0.4051, pruned_loss=0.08392, over 19732.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3232, pruned_loss=0.08376, over 4260234.21 frames. ], batch size: 702, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:48:53,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-06-23 19:49:09,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1563402.0, ans=0.0 2023-06-23 19:49:09,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-23 19:49:20,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1563402.0, ans=0.125 2023-06-23 19:49:26,434 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.870e+02 6.782e+02 8.603e+02 1.169e+03 2.865e+03, threshold=1.721e+03, percent-clipped=6.0 2023-06-23 19:49:41,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1563462.0, ans=0.0 2023-06-23 19:50:19,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1563582.0, ans=0.125 2023-06-23 19:50:28,642 INFO [train.py:996] (0/4) Epoch 9, batch 16650, loss[loss=0.2464, simple_loss=0.3248, pruned_loss=0.08399, over 21787.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.332, pruned_loss=0.08639, over 4269029.58 frames. ], batch size: 298, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:51:00,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1563702.0, ans=0.0 2023-06-23 19:51:14,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-23 19:51:39,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1563822.0, ans=0.1 2023-06-23 19:52:05,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1563882.0, ans=0.1 2023-06-23 19:52:07,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1563942.0, ans=0.0 2023-06-23 19:52:08,221 INFO [train.py:996] (0/4) Epoch 9, batch 16700, loss[loss=0.2197, simple_loss=0.3026, pruned_loss=0.06839, over 20714.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3322, pruned_loss=0.0867, over 4267515.00 frames. ], batch size: 607, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:52:32,658 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-23 19:52:57,952 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.483e+02 5.925e+02 7.840e+02 1.052e+03 2.518e+03, threshold=1.568e+03, percent-clipped=7.0 2023-06-23 19:52:58,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1564062.0, ans=0.125 2023-06-23 19:53:20,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1564122.0, ans=0.125 2023-06-23 19:53:32,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1564122.0, ans=0.0 2023-06-23 19:53:55,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1564182.0, ans=0.125 2023-06-23 19:53:57,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1564242.0, ans=0.5 2023-06-23 19:53:58,341 INFO [train.py:996] (0/4) Epoch 9, batch 16750, loss[loss=0.2644, simple_loss=0.3378, pruned_loss=0.09548, over 21787.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3338, pruned_loss=0.08883, over 4262734.00 frames. ], batch size: 124, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:54:45,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=12.0 2023-06-23 19:54:50,217 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:55:03,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1564422.0, ans=22.5 2023-06-23 19:55:26,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1564482.0, ans=0.1 2023-06-23 19:55:43,497 INFO [train.py:996] (0/4) Epoch 9, batch 16800, loss[loss=0.3164, simple_loss=0.3785, pruned_loss=0.1271, over 21597.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3379, pruned_loss=0.08868, over 4262398.34 frames. ], batch size: 471, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:56:26,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 6.649e+02 8.449e+02 1.121e+03 2.457e+03, threshold=1.690e+03, percent-clipped=14.0 2023-06-23 19:56:29,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1564662.0, ans=0.125 2023-06-23 19:56:29,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1564662.0, ans=0.1 2023-06-23 19:56:32,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1564662.0, ans=0.125 2023-06-23 19:56:34,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-23 19:57:23,022 INFO [train.py:996] (0/4) Epoch 9, batch 16850, loss[loss=0.238, simple_loss=0.2978, pruned_loss=0.0891, over 21565.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3325, pruned_loss=0.08844, over 4267515.80 frames. ], batch size: 195, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:58:42,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1565082.0, ans=0.2 2023-06-23 19:58:53,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1565082.0, ans=0.025 2023-06-23 19:59:07,902 INFO [train.py:996] (0/4) Epoch 9, batch 16900, loss[loss=0.2132, simple_loss=0.2854, pruned_loss=0.07047, over 21654.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3284, pruned_loss=0.08714, over 4271051.81 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 19:59:46,593 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.622e+02 4.767e+02 6.726e+02 1.260e+03 2.714e+03, threshold=1.345e+03, percent-clipped=10.0 2023-06-23 20:00:31,194 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:00:33,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.73 vs. limit=10.0 2023-06-23 20:00:45,064 INFO [train.py:996] (0/4) Epoch 9, batch 16950, loss[loss=0.2071, simple_loss=0.2765, pruned_loss=0.06889, over 21823.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3201, pruned_loss=0.08503, over 4268938.41 frames. ], batch size: 282, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:01:01,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1565502.0, ans=0.0 2023-06-23 20:01:17,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1565502.0, ans=0.125 2023-06-23 20:01:17,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1565502.0, ans=0.125 2023-06-23 20:01:19,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1565502.0, ans=0.125 2023-06-23 20:01:30,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1565562.0, ans=0.2 2023-06-23 20:01:33,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1565562.0, ans=0.125 2023-06-23 20:01:55,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1565622.0, ans=0.05 2023-06-23 20:02:06,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-23 20:02:23,844 INFO [train.py:996] (0/4) Epoch 9, batch 17000, loss[loss=0.228, simple_loss=0.2941, pruned_loss=0.08096, over 21681.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.316, pruned_loss=0.08473, over 4274709.47 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:02:37,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1565742.0, ans=0.09899494936611666 2023-06-23 20:03:04,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.656e+02 4.823e+02 5.834e+02 7.335e+02 1.533e+03, threshold=1.167e+03, percent-clipped=2.0 2023-06-23 20:03:08,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1565862.0, ans=0.0 2023-06-23 20:03:32,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1565922.0, ans=0.05 2023-06-23 20:03:34,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1565922.0, ans=0.125 2023-06-23 20:04:05,386 INFO [train.py:996] (0/4) Epoch 9, batch 17050, loss[loss=0.2254, simple_loss=0.286, pruned_loss=0.08241, over 20231.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3231, pruned_loss=0.08735, over 4281211.36 frames. ], batch size: 703, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:04:10,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1566042.0, ans=0.0 2023-06-23 20:04:51,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1566162.0, ans=0.1 2023-06-23 20:04:59,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1566162.0, ans=0.125 2023-06-23 20:05:30,132 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:05:40,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1566282.0, ans=0.125 2023-06-23 20:05:40,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.36 vs. limit=15.0 2023-06-23 20:05:44,758 INFO [train.py:996] (0/4) Epoch 9, batch 17100, loss[loss=0.1991, simple_loss=0.2693, pruned_loss=0.06449, over 21685.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3207, pruned_loss=0.08725, over 4291557.30 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:06:10,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1566402.0, ans=0.0 2023-06-23 20:06:24,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.937e+02 5.247e+02 7.699e+02 1.064e+03 2.324e+03, threshold=1.540e+03, percent-clipped=17.0 2023-06-23 20:06:27,716 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-23 20:06:46,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1566522.0, ans=0.0 2023-06-23 20:06:57,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1566522.0, ans=0.125 2023-06-23 20:07:23,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1566642.0, ans=0.125 2023-06-23 20:07:24,733 INFO [train.py:996] (0/4) Epoch 9, batch 17150, loss[loss=0.2518, simple_loss=0.3048, pruned_loss=0.09941, over 21395.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3171, pruned_loss=0.08591, over 4284268.89 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:07:47,456 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:07:50,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-23 20:07:54,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1566702.0, ans=0.0 2023-06-23 20:07:56,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1566702.0, ans=0.04949747468305833 2023-06-23 20:08:15,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1566762.0, ans=0.2 2023-06-23 20:08:45,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1566822.0, ans=0.1 2023-06-23 20:08:49,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1566882.0, ans=0.0 2023-06-23 20:09:05,729 INFO [train.py:996] (0/4) Epoch 9, batch 17200, loss[loss=0.2646, simple_loss=0.3278, pruned_loss=0.1007, over 21739.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3167, pruned_loss=0.08506, over 4284348.83 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:09:53,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.320e+02 8.105e+02 1.085e+03 1.487e+03 3.292e+03, threshold=2.169e+03, percent-clipped=22.0 2023-06-23 20:09:54,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1567062.0, ans=0.04949747468305833 2023-06-23 20:10:35,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1567182.0, ans=0.0 2023-06-23 20:10:49,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1567182.0, ans=0.0 2023-06-23 20:10:52,888 INFO [train.py:996] (0/4) Epoch 9, batch 17250, loss[loss=0.2952, simple_loss=0.3574, pruned_loss=0.1165, over 21805.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3197, pruned_loss=0.08692, over 4280141.30 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:11:41,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1567362.0, ans=0.035 2023-06-23 20:11:49,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1567362.0, ans=0.2 2023-06-23 20:12:35,688 INFO [train.py:996] (0/4) Epoch 9, batch 17300, loss[loss=0.2724, simple_loss=0.3413, pruned_loss=0.1017, over 21929.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3273, pruned_loss=0.09053, over 4277625.12 frames. ], batch size: 372, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:12:52,857 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:13:05,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1567602.0, ans=0.0 2023-06-23 20:13:28,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 5.597e+02 7.560e+02 1.039e+03 2.489e+03, threshold=1.512e+03, percent-clipped=1.0 2023-06-23 20:13:28,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1567662.0, ans=0.1 2023-06-23 20:14:19,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1567782.0, ans=0.0 2023-06-23 20:14:22,084 INFO [train.py:996] (0/4) Epoch 9, batch 17350, loss[loss=0.2108, simple_loss=0.2918, pruned_loss=0.06483, over 21432.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3275, pruned_loss=0.09051, over 4275268.28 frames. ], batch size: 211, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:14:34,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1567842.0, ans=0.0 2023-06-23 20:15:20,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1567962.0, ans=0.125 2023-06-23 20:15:56,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1568082.0, ans=0.0 2023-06-23 20:16:08,001 INFO [train.py:996] (0/4) Epoch 9, batch 17400, loss[loss=0.2465, simple_loss=0.3321, pruned_loss=0.08048, over 21739.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3237, pruned_loss=0.08685, over 4267178.89 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:16:35,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1568202.0, ans=0.0 2023-06-23 20:16:43,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1568202.0, ans=0.125 2023-06-23 20:16:55,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 5.667e+02 9.192e+02 1.513e+03 3.310e+03, threshold=1.838e+03, percent-clipped=24.0 2023-06-23 20:16:59,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1568262.0, ans=0.125 2023-06-23 20:17:49,327 INFO [train.py:996] (0/4) Epoch 9, batch 17450, loss[loss=0.207, simple_loss=0.3007, pruned_loss=0.05668, over 21628.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3207, pruned_loss=0.08422, over 4273923.34 frames. ], batch size: 247, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:18:07,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-23 20:18:51,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1568622.0, ans=0.125 2023-06-23 20:18:55,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1568622.0, ans=0.125 2023-06-23 20:19:21,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1568682.0, ans=0.035 2023-06-23 20:19:21,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1568682.0, ans=0.0 2023-06-23 20:19:27,738 INFO [train.py:996] (0/4) Epoch 9, batch 17500, loss[loss=0.2229, simple_loss=0.2885, pruned_loss=0.0787, over 21671.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3166, pruned_loss=0.08108, over 4273601.11 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:20:11,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1568862.0, ans=0.125 2023-06-23 20:20:19,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.530e+02 5.781e+02 8.305e+02 1.162e+03 2.249e+03, threshold=1.661e+03, percent-clipped=4.0 2023-06-23 20:20:31,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.82 vs. limit=10.0 2023-06-23 20:20:41,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1568922.0, ans=0.125 2023-06-23 20:20:41,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1568922.0, ans=0.125 2023-06-23 20:20:59,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1568982.0, ans=0.0 2023-06-23 20:21:04,911 INFO [train.py:996] (0/4) Epoch 9, batch 17550, loss[loss=0.2303, simple_loss=0.315, pruned_loss=0.07282, over 21395.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3166, pruned_loss=0.07969, over 4269286.11 frames. ], batch size: 131, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:21:46,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.99 vs. limit=10.0 2023-06-23 20:21:57,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1569162.0, ans=0.1 2023-06-23 20:22:18,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1569222.0, ans=0.0 2023-06-23 20:22:32,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1569282.0, ans=0.0 2023-06-23 20:22:43,274 INFO [train.py:996] (0/4) Epoch 9, batch 17600, loss[loss=0.2615, simple_loss=0.3301, pruned_loss=0.09651, over 21724.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3181, pruned_loss=0.07935, over 4266547.84 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:22:54,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1569342.0, ans=0.125 2023-06-23 20:23:17,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1569402.0, ans=0.125 2023-06-23 20:23:35,484 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:23:38,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.747e+02 5.641e+02 7.220e+02 1.088e+03 2.051e+03, threshold=1.444e+03, percent-clipped=1.0 2023-06-23 20:24:05,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1569522.0, ans=0.2 2023-06-23 20:24:10,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1569582.0, ans=0.2 2023-06-23 20:24:25,260 INFO [train.py:996] (0/4) Epoch 9, batch 17650, loss[loss=0.2611, simple_loss=0.3458, pruned_loss=0.08819, over 21473.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3171, pruned_loss=0.08022, over 4245652.95 frames. ], batch size: 131, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:24:25,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1569642.0, ans=0.1 2023-06-23 20:26:06,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1569882.0, ans=0.2 2023-06-23 20:26:14,546 INFO [train.py:996] (0/4) Epoch 9, batch 17700, loss[loss=0.316, simple_loss=0.3824, pruned_loss=0.1248, over 21448.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3113, pruned_loss=0.07842, over 4252536.49 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:26:24,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1569942.0, ans=0.07 2023-06-23 20:26:37,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1570002.0, ans=0.125 2023-06-23 20:26:39,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1570002.0, ans=0.1 2023-06-23 20:26:43,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-06-23 20:26:59,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1570062.0, ans=0.0 2023-06-23 20:27:02,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 6.405e+02 1.162e+03 1.769e+03 3.070e+03, threshold=2.325e+03, percent-clipped=36.0 2023-06-23 20:27:54,403 INFO [train.py:996] (0/4) Epoch 9, batch 17750, loss[loss=0.3361, simple_loss=0.3938, pruned_loss=0.1392, over 21470.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3184, pruned_loss=0.08201, over 4255484.81 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:27:54,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1570242.0, ans=0.125 2023-06-23 20:27:56,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1570242.0, ans=0.125 2023-06-23 20:28:16,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1570302.0, ans=0.0 2023-06-23 20:28:19,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1570302.0, ans=0.1 2023-06-23 20:28:19,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1570302.0, ans=0.05 2023-06-23 20:28:31,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.53 vs. limit=22.5 2023-06-23 20:28:39,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1570362.0, ans=0.04949747468305833 2023-06-23 20:29:17,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1570422.0, ans=6.0 2023-06-23 20:29:26,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1570482.0, ans=0.125 2023-06-23 20:29:40,677 INFO [train.py:996] (0/4) Epoch 9, batch 17800, loss[loss=0.3255, simple_loss=0.3992, pruned_loss=0.1259, over 21414.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3176, pruned_loss=0.08148, over 4257813.11 frames. ], batch size: 507, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:30:11,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1570602.0, ans=0.0 2023-06-23 20:30:19,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-23 20:30:24,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.850e+02 7.073e+02 9.847e+02 1.535e+03 2.589e+03, threshold=1.969e+03, percent-clipped=1.0 2023-06-23 20:30:41,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-23 20:31:14,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1570782.0, ans=0.0 2023-06-23 20:31:17,183 INFO [train.py:996] (0/4) Epoch 9, batch 17850, loss[loss=0.3213, simple_loss=0.3838, pruned_loss=0.1293, over 21740.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.318, pruned_loss=0.08168, over 4266513.46 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:32:04,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1570962.0, ans=0.0 2023-06-23 20:32:40,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1571082.0, ans=0.125 2023-06-23 20:32:56,166 INFO [train.py:996] (0/4) Epoch 9, batch 17900, loss[loss=0.2612, simple_loss=0.3514, pruned_loss=0.08551, over 21778.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3233, pruned_loss=0.08395, over 4269795.86 frames. ], batch size: 282, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:32:57,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-06-23 20:33:29,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1571202.0, ans=0.1 2023-06-23 20:33:49,667 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.056e+02 5.757e+02 7.357e+02 9.993e+02 2.226e+03, threshold=1.471e+03, percent-clipped=2.0 2023-06-23 20:33:58,823 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-23 20:34:08,773 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.74 vs. limit=5.0 2023-06-23 20:34:21,881 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.54 vs. limit=15.0 2023-06-23 20:34:41,353 INFO [train.py:996] (0/4) Epoch 9, batch 17950, loss[loss=0.2295, simple_loss=0.321, pruned_loss=0.069, over 21628.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.321, pruned_loss=0.08, over 4268896.79 frames. ], batch size: 389, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:34:41,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1571442.0, ans=0.0 2023-06-23 20:34:47,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-23 20:34:48,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1571442.0, ans=0.0 2023-06-23 20:35:28,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1571562.0, ans=0.09899494936611666 2023-06-23 20:35:52,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1571622.0, ans=0.125 2023-06-23 20:35:56,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1571622.0, ans=0.125 2023-06-23 20:36:10,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2023-06-23 20:36:19,128 INFO [train.py:996] (0/4) Epoch 9, batch 18000, loss[loss=0.2108, simple_loss=0.2768, pruned_loss=0.07242, over 21664.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.314, pruned_loss=0.07852, over 4275860.54 frames. ], batch size: 333, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:36:19,129 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 20:36:36,001 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2626, simple_loss=0.3575, pruned_loss=0.08385, over 1796401.00 frames. 2023-06-23 20:36:36,001 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 20:37:11,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1571802.0, ans=0.0 2023-06-23 20:37:17,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1571862.0, ans=0.125 2023-06-23 20:37:28,973 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:37:29,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.324e+02 4.978e+02 7.262e+02 1.032e+03 1.973e+03, threshold=1.452e+03, percent-clipped=7.0 2023-06-23 20:38:20,230 INFO [train.py:996] (0/4) Epoch 9, batch 18050, loss[loss=0.198, simple_loss=0.2713, pruned_loss=0.06239, over 21512.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.308, pruned_loss=0.07735, over 4263702.55 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:38:24,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1572042.0, ans=0.0 2023-06-23 20:38:31,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-23 20:39:23,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1572222.0, ans=0.125 2023-06-23 20:39:58,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.08 vs. limit=10.0 2023-06-23 20:40:00,666 INFO [train.py:996] (0/4) Epoch 9, batch 18100, loss[loss=0.2829, simple_loss=0.3574, pruned_loss=0.1042, over 21829.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3127, pruned_loss=0.07999, over 4270799.47 frames. ], batch size: 118, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:40:12,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1572342.0, ans=0.125 2023-06-23 20:40:40,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1572402.0, ans=0.125 2023-06-23 20:40:55,088 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 5.547e+02 7.703e+02 1.178e+03 2.128e+03, threshold=1.541e+03, percent-clipped=13.0 2023-06-23 20:41:38,962 INFO [train.py:996] (0/4) Epoch 9, batch 18150, loss[loss=0.18, simple_loss=0.2546, pruned_loss=0.0527, over 21513.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3144, pruned_loss=0.07969, over 4272143.59 frames. ], batch size: 132, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:41:59,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1572702.0, ans=0.1 2023-06-23 20:42:32,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1572762.0, ans=0.125 2023-06-23 20:42:40,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1572822.0, ans=0.1 2023-06-23 20:43:05,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1572882.0, ans=0.04949747468305833 2023-06-23 20:43:11,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1572882.0, ans=0.125 2023-06-23 20:43:15,634 INFO [train.py:996] (0/4) Epoch 9, batch 18200, loss[loss=0.1953, simple_loss=0.2702, pruned_loss=0.06022, over 21818.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3091, pruned_loss=0.0799, over 4272174.63 frames. ], batch size: 118, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:44:03,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.079e+02 5.748e+02 7.422e+02 1.064e+03 2.381e+03, threshold=1.484e+03, percent-clipped=9.0 2023-06-23 20:44:51,697 INFO [train.py:996] (0/4) Epoch 9, batch 18250, loss[loss=0.196, simple_loss=0.2622, pruned_loss=0.06489, over 21352.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3006, pruned_loss=0.07658, over 4249600.10 frames. ], batch size: 144, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:45:03,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-06-23 20:45:09,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1573302.0, ans=0.125 2023-06-23 20:45:13,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1573302.0, ans=0.125 2023-06-23 20:45:30,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1573362.0, ans=0.1 2023-06-23 20:45:30,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.57 vs. limit=10.0 2023-06-23 20:46:08,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1573482.0, ans=0.125 2023-06-23 20:46:15,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1573482.0, ans=0.125 2023-06-23 20:46:18,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-23 20:46:30,617 INFO [train.py:996] (0/4) Epoch 9, batch 18300, loss[loss=0.2065, simple_loss=0.2763, pruned_loss=0.06833, over 21348.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2997, pruned_loss=0.07563, over 4251278.44 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:46:31,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-23 20:46:38,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1573542.0, ans=0.1 2023-06-23 20:47:05,262 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:47:14,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 5.362e+02 7.234e+02 9.653e+02 2.224e+03, threshold=1.447e+03, percent-clipped=7.0 2023-06-23 20:47:57,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-06-23 20:48:08,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1573842.0, ans=0.125 2023-06-23 20:48:09,464 INFO [train.py:996] (0/4) Epoch 9, batch 18350, loss[loss=0.2066, simple_loss=0.2859, pruned_loss=0.06367, over 21649.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3076, pruned_loss=0.07634, over 4245168.45 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:48:16,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1573842.0, ans=0.05 2023-06-23 20:49:48,149 INFO [train.py:996] (0/4) Epoch 9, batch 18400, loss[loss=0.217, simple_loss=0.2887, pruned_loss=0.07263, over 21832.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3039, pruned_loss=0.07575, over 4249974.02 frames. ], batch size: 107, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:50:38,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.618e+02 5.691e+02 8.490e+02 1.304e+03 3.377e+03, threshold=1.698e+03, percent-clipped=15.0 2023-06-23 20:51:24,277 INFO [train.py:996] (0/4) Epoch 9, batch 18450, loss[loss=0.1757, simple_loss=0.2643, pruned_loss=0.04358, over 21524.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3014, pruned_loss=0.07225, over 4253174.56 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:52:00,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1574562.0, ans=0.125 2023-06-23 20:52:03,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-23 20:52:21,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-06-23 20:52:54,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1574682.0, ans=0.125 2023-06-23 20:53:02,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1574742.0, ans=0.025 2023-06-23 20:53:03,443 INFO [train.py:996] (0/4) Epoch 9, batch 18500, loss[loss=0.1997, simple_loss=0.2666, pruned_loss=0.06643, over 21334.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2961, pruned_loss=0.0712, over 4250391.89 frames. ], batch size: 131, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:53:34,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1574802.0, ans=0.125 2023-06-23 20:53:55,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1574862.0, ans=0.125 2023-06-23 20:53:57,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.581e+02 5.113e+02 7.633e+02 1.101e+03 4.944e+03, threshold=1.527e+03, percent-clipped=5.0 2023-06-23 20:54:02,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1574862.0, ans=0.125 2023-06-23 20:54:42,165 INFO [train.py:996] (0/4) Epoch 9, batch 18550, loss[loss=0.2046, simple_loss=0.2701, pruned_loss=0.06957, over 21761.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2939, pruned_loss=0.07053, over 4259176.60 frames. ], batch size: 124, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:54:59,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-23 20:54:59,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-06-23 20:55:00,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1575102.0, ans=0.125 2023-06-23 20:55:10,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1575102.0, ans=0.2 2023-06-23 20:55:46,173 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-23 20:56:21,044 INFO [train.py:996] (0/4) Epoch 9, batch 18600, loss[loss=0.1993, simple_loss=0.2684, pruned_loss=0.06503, over 21109.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.291, pruned_loss=0.07091, over 4265411.34 frames. ], batch size: 143, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:57:04,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-23 20:57:16,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.217e+02 6.797e+02 9.080e+02 2.355e+03, threshold=1.359e+03, percent-clipped=3.0 2023-06-23 20:57:20,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1575462.0, ans=0.0 2023-06-23 20:57:23,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1575522.0, ans=0.125 2023-06-23 20:57:59,923 INFO [train.py:996] (0/4) Epoch 9, batch 18650, loss[loss=0.2433, simple_loss=0.304, pruned_loss=0.0913, over 21773.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2921, pruned_loss=0.07235, over 4273379.33 frames. ], batch size: 352, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 20:58:10,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-23 20:58:24,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1575702.0, ans=0.2 2023-06-23 20:58:46,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1575762.0, ans=0.0 2023-06-23 20:58:48,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1575762.0, ans=0.95 2023-06-23 20:59:00,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1575822.0, ans=0.125 2023-06-23 20:59:04,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1575822.0, ans=0.2 2023-06-23 20:59:33,510 INFO [train.py:996] (0/4) Epoch 9, batch 18700, loss[loss=0.2069, simple_loss=0.2735, pruned_loss=0.07015, over 21865.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2903, pruned_loss=0.07386, over 4270500.37 frames. ], batch size: 107, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 20:59:49,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1575942.0, ans=0.0 2023-06-23 21:00:25,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1576062.0, ans=0.1 2023-06-23 21:00:27,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.989e+02 5.925e+02 8.021e+02 1.131e+03 2.066e+03, threshold=1.604e+03, percent-clipped=15.0 2023-06-23 21:00:37,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1576122.0, ans=0.0 2023-06-23 21:00:46,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-23 21:00:50,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1576122.0, ans=0.125 2023-06-23 21:01:08,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1576182.0, ans=0.125 2023-06-23 21:01:08,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1576182.0, ans=0.0 2023-06-23 21:01:10,851 INFO [train.py:996] (0/4) Epoch 9, batch 18750, loss[loss=0.2034, simple_loss=0.2666, pruned_loss=0.07015, over 21239.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2908, pruned_loss=0.07539, over 4261865.39 frames. ], batch size: 608, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:01:52,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1576362.0, ans=0.125 2023-06-23 21:02:09,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-06-23 21:02:50,411 INFO [train.py:996] (0/4) Epoch 9, batch 18800, loss[loss=0.2555, simple_loss=0.3437, pruned_loss=0.08367, over 21508.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2983, pruned_loss=0.07793, over 4253438.95 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:03:10,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1576602.0, ans=0.0 2023-06-23 21:03:18,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1576602.0, ans=0.125 2023-06-23 21:03:47,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1576662.0, ans=0.125 2023-06-23 21:03:48,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.728e+02 5.591e+02 7.847e+02 1.340e+03 4.014e+03, threshold=1.569e+03, percent-clipped=18.0 2023-06-23 21:04:11,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.48 vs. limit=22.5 2023-06-23 21:04:23,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1576842.0, ans=0.2 2023-06-23 21:04:24,331 INFO [train.py:996] (0/4) Epoch 9, batch 18850, loss[loss=0.1607, simple_loss=0.2547, pruned_loss=0.03331, over 21639.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2954, pruned_loss=0.0734, over 4259610.97 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:05:19,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1576962.0, ans=0.0 2023-06-23 21:05:46,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1577082.0, ans=0.0 2023-06-23 21:05:57,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1577082.0, ans=0.125 2023-06-23 21:06:01,648 INFO [train.py:996] (0/4) Epoch 9, batch 18900, loss[loss=0.2007, simple_loss=0.2714, pruned_loss=0.06497, over 21822.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2905, pruned_loss=0.0718, over 4260818.49 frames. ], batch size: 371, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:06:06,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1577142.0, ans=0.0 2023-06-23 21:06:21,885 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:06:27,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-23 21:07:02,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.570e+02 8.138e+02 1.105e+03 2.529e+03, threshold=1.628e+03, percent-clipped=6.0 2023-06-23 21:07:03,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1577262.0, ans=0.0 2023-06-23 21:07:40,592 INFO [train.py:996] (0/4) Epoch 9, batch 18950, loss[loss=0.2586, simple_loss=0.3363, pruned_loss=0.09051, over 21800.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2911, pruned_loss=0.07451, over 4269634.77 frames. ], batch size: 414, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:08:24,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1577502.0, ans=0.125 2023-06-23 21:09:15,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1577682.0, ans=0.0 2023-06-23 21:09:25,316 INFO [train.py:996] (0/4) Epoch 9, batch 19000, loss[loss=0.2389, simple_loss=0.2991, pruned_loss=0.08936, over 20188.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3011, pruned_loss=0.07612, over 4261676.21 frames. ], batch size: 703, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:09:37,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-23 21:10:12,283 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:10:22,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.803e+02 7.303e+02 9.676e+02 2.097e+03, threshold=1.461e+03, percent-clipped=8.0 2023-06-23 21:10:26,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1577922.0, ans=0.125 2023-06-23 21:10:26,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1577922.0, ans=0.2 2023-06-23 21:11:04,906 INFO [train.py:996] (0/4) Epoch 9, batch 19050, loss[loss=0.2652, simple_loss=0.3292, pruned_loss=0.1006, over 21842.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3075, pruned_loss=0.07948, over 4270223.97 frames. ], batch size: 124, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:11:07,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1578042.0, ans=0.125 2023-06-23 21:11:48,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1578102.0, ans=0.1 2023-06-23 21:12:10,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1578222.0, ans=0.0 2023-06-23 21:12:18,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1578222.0, ans=0.1 2023-06-23 21:12:41,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1578282.0, ans=0.125 2023-06-23 21:12:44,220 INFO [train.py:996] (0/4) Epoch 9, batch 19100, loss[loss=0.2125, simple_loss=0.2798, pruned_loss=0.07255, over 21576.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3069, pruned_loss=0.0811, over 4265768.35 frames. ], batch size: 414, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:12:59,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1578342.0, ans=0.125 2023-06-23 21:13:34,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-23 21:13:38,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.144e+02 5.907e+02 9.492e+02 1.356e+03 2.303e+03, threshold=1.898e+03, percent-clipped=18.0 2023-06-23 21:14:17,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-23 21:14:26,126 INFO [train.py:996] (0/4) Epoch 9, batch 19150, loss[loss=0.3189, simple_loss=0.408, pruned_loss=0.115, over 21485.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3081, pruned_loss=0.08192, over 4269518.93 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:14:51,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1578702.0, ans=0.1 2023-06-23 21:15:17,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1578762.0, ans=0.1 2023-06-23 21:16:04,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1578882.0, ans=0.2 2023-06-23 21:16:07,495 INFO [train.py:996] (0/4) Epoch 9, batch 19200, loss[loss=0.2551, simple_loss=0.3665, pruned_loss=0.07182, over 21823.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3193, pruned_loss=0.08271, over 4276191.15 frames. ], batch size: 371, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:16:23,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1578942.0, ans=0.125 2023-06-23 21:16:42,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1579002.0, ans=0.0 2023-06-23 21:17:01,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.099e+02 6.099e+02 9.205e+02 1.363e+03 2.424e+03, threshold=1.841e+03, percent-clipped=8.0 2023-06-23 21:17:44,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1579182.0, ans=0.125 2023-06-23 21:17:47,718 INFO [train.py:996] (0/4) Epoch 9, batch 19250, loss[loss=0.235, simple_loss=0.3498, pruned_loss=0.06006, over 19808.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3194, pruned_loss=0.07753, over 4261399.54 frames. ], batch size: 703, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:18:19,373 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-23 21:18:22,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-23 21:18:41,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1579362.0, ans=0.125 2023-06-23 21:19:01,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1579422.0, ans=0.125 2023-06-23 21:19:05,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-23 21:19:06,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1579482.0, ans=0.0 2023-06-23 21:19:27,371 INFO [train.py:996] (0/4) Epoch 9, batch 19300, loss[loss=0.2207, simple_loss=0.2922, pruned_loss=0.07459, over 21625.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3167, pruned_loss=0.07636, over 4268014.15 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:20:23,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.305e+02 4.858e+02 7.673e+02 1.130e+03 2.664e+03, threshold=1.535e+03, percent-clipped=6.0 2023-06-23 21:20:25,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1579722.0, ans=0.0 2023-06-23 21:20:52,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-23 21:21:12,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1579842.0, ans=0.0 2023-06-23 21:21:18,264 INFO [train.py:996] (0/4) Epoch 9, batch 19350, loss[loss=0.1884, simple_loss=0.2588, pruned_loss=0.05898, over 21149.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3108, pruned_loss=0.07292, over 4273017.60 frames. ], batch size: 143, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:22:04,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1580022.0, ans=0.05 2023-06-23 21:22:23,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1580022.0, ans=0.1 2023-06-23 21:22:41,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1580082.0, ans=0.125 2023-06-23 21:22:46,743 INFO [train.py:996] (0/4) Epoch 9, batch 19400, loss[loss=0.2144, simple_loss=0.2929, pruned_loss=0.06796, over 21859.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3077, pruned_loss=0.07273, over 4280616.20 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:23:08,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1580142.0, ans=0.125 2023-06-23 21:23:41,604 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 5.598e+02 7.737e+02 1.076e+03 2.272e+03, threshold=1.547e+03, percent-clipped=6.0 2023-06-23 21:24:12,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=22.5 2023-06-23 21:24:19,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1580382.0, ans=0.1 2023-06-23 21:24:36,311 INFO [train.py:996] (0/4) Epoch 9, batch 19450, loss[loss=0.2215, simple_loss=0.2848, pruned_loss=0.07915, over 21897.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3053, pruned_loss=0.07499, over 4285662.14 frames. ], batch size: 373, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:24:39,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-23 21:24:59,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1580502.0, ans=0.0 2023-06-23 21:25:15,491 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-23 21:26:17,336 INFO [train.py:996] (0/4) Epoch 9, batch 19500, loss[loss=0.2315, simple_loss=0.2983, pruned_loss=0.0823, over 21771.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3016, pruned_loss=0.07563, over 4285574.88 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:26:34,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1580802.0, ans=0.0 2023-06-23 21:26:49,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1580862.0, ans=22.5 2023-06-23 21:26:55,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1580862.0, ans=0.0 2023-06-23 21:27:06,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1580862.0, ans=0.125 2023-06-23 21:27:07,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.161e+02 5.798e+02 8.156e+02 1.303e+03 2.400e+03, threshold=1.631e+03, percent-clipped=12.0 2023-06-23 21:27:13,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1580922.0, ans=0.0 2023-06-23 21:27:31,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1580982.0, ans=0.0 2023-06-23 21:27:57,889 INFO [train.py:996] (0/4) Epoch 9, batch 19550, loss[loss=0.2029, simple_loss=0.2875, pruned_loss=0.05916, over 20806.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2987, pruned_loss=0.07466, over 4280666.37 frames. ], batch size: 607, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:29:01,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1581222.0, ans=0.0 2023-06-23 21:29:09,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1581282.0, ans=0.125 2023-06-23 21:29:37,028 INFO [train.py:996] (0/4) Epoch 9, batch 19600, loss[loss=0.2628, simple_loss=0.3156, pruned_loss=0.105, over 21779.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3005, pruned_loss=0.07571, over 4289431.35 frames. ], batch size: 508, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:29:45,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1581342.0, ans=0.125 2023-06-23 21:29:55,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1581402.0, ans=0.035 2023-06-23 21:30:25,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.955e+02 6.042e+02 7.862e+02 1.200e+03 2.695e+03, threshold=1.572e+03, percent-clipped=11.0 2023-06-23 21:31:06,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1581582.0, ans=0.125 2023-06-23 21:31:15,791 INFO [train.py:996] (0/4) Epoch 9, batch 19650, loss[loss=0.2369, simple_loss=0.3063, pruned_loss=0.08369, over 21864.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3065, pruned_loss=0.08028, over 4287010.38 frames. ], batch size: 371, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:31:47,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1581702.0, ans=0.125 2023-06-23 21:31:57,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-23 21:32:02,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1581762.0, ans=0.125 2023-06-23 21:32:25,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1581822.0, ans=0.0 2023-06-23 21:32:30,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1581822.0, ans=0.0 2023-06-23 21:32:46,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-23 21:32:59,051 INFO [train.py:996] (0/4) Epoch 9, batch 19700, loss[loss=0.247, simple_loss=0.3409, pruned_loss=0.07657, over 21603.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.308, pruned_loss=0.08023, over 4286974.11 frames. ], batch size: 441, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:33:15,005 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-23 21:34:05,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 5.626e+02 7.962e+02 1.128e+03 2.480e+03, threshold=1.592e+03, percent-clipped=10.0 2023-06-23 21:34:14,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.94 vs. limit=15.0 2023-06-23 21:34:33,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1582182.0, ans=0.5 2023-06-23 21:34:43,682 INFO [train.py:996] (0/4) Epoch 9, batch 19750, loss[loss=0.2413, simple_loss=0.3212, pruned_loss=0.08073, over 21463.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3178, pruned_loss=0.08187, over 4281070.50 frames. ], batch size: 194, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:34:49,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1582242.0, ans=0.0 2023-06-23 21:35:43,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1582362.0, ans=0.125 2023-06-23 21:36:12,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1582482.0, ans=0.125 2023-06-23 21:36:22,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1582542.0, ans=0.1 2023-06-23 21:36:23,314 INFO [train.py:996] (0/4) Epoch 9, batch 19800, loss[loss=0.2387, simple_loss=0.322, pruned_loss=0.07773, over 21422.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3182, pruned_loss=0.08268, over 4285005.81 frames. ], batch size: 548, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:36:27,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1582542.0, ans=10.0 2023-06-23 21:36:30,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1582542.0, ans=0.125 2023-06-23 21:36:50,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1582602.0, ans=0.125 2023-06-23 21:37:22,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2023-06-23 21:37:26,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.106e+02 6.339e+02 1.006e+03 1.413e+03 2.674e+03, threshold=2.011e+03, percent-clipped=18.0 2023-06-23 21:37:39,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1582722.0, ans=0.2 2023-06-23 21:37:57,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1582782.0, ans=10.0 2023-06-23 21:38:02,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1582782.0, ans=0.125 2023-06-23 21:38:05,355 INFO [train.py:996] (0/4) Epoch 9, batch 19850, loss[loss=0.2638, simple_loss=0.3518, pruned_loss=0.08787, over 21624.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3114, pruned_loss=0.07814, over 4282135.66 frames. ], batch size: 441, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:38:55,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1582962.0, ans=0.0 2023-06-23 21:38:57,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1582962.0, ans=0.125 2023-06-23 21:39:02,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1582962.0, ans=0.2 2023-06-23 21:39:43,965 INFO [train.py:996] (0/4) Epoch 9, batch 19900, loss[loss=0.2171, simple_loss=0.2891, pruned_loss=0.07255, over 21763.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3119, pruned_loss=0.07486, over 4277297.71 frames. ], batch size: 124, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:40:35,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1583262.0, ans=0.0 2023-06-23 21:40:42,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.30 vs. limit=15.0 2023-06-23 21:40:45,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.607e+02 5.026e+02 6.181e+02 9.309e+02 2.570e+03, threshold=1.236e+03, percent-clipped=2.0 2023-06-23 21:40:53,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-23 21:41:28,271 INFO [train.py:996] (0/4) Epoch 9, batch 19950, loss[loss=0.201, simple_loss=0.2602, pruned_loss=0.07087, over 21429.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3061, pruned_loss=0.0746, over 4273806.51 frames. ], batch size: 195, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:42:13,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1583562.0, ans=0.0 2023-06-23 21:42:52,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1583682.0, ans=0.125 2023-06-23 21:43:07,108 INFO [train.py:996] (0/4) Epoch 9, batch 20000, loss[loss=0.27, simple_loss=0.3398, pruned_loss=0.1001, over 21857.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3068, pruned_loss=0.07509, over 4276316.28 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:43:09,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1583742.0, ans=0.0 2023-06-23 21:43:52,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1583862.0, ans=0.0 2023-06-23 21:44:03,263 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.657e+02 5.210e+02 7.571e+02 1.068e+03 2.474e+03, threshold=1.514e+03, percent-clipped=20.0 2023-06-23 21:44:06,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1583922.0, ans=0.0 2023-06-23 21:44:28,460 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-264000.pt 2023-06-23 21:44:37,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1583982.0, ans=0.1 2023-06-23 21:44:46,986 INFO [train.py:996] (0/4) Epoch 9, batch 20050, loss[loss=0.263, simple_loss=0.3246, pruned_loss=0.1007, over 21616.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3092, pruned_loss=0.07796, over 4278218.87 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:45:24,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1584102.0, ans=0.125 2023-06-23 21:45:51,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1584222.0, ans=0.125 2023-06-23 21:46:28,211 INFO [train.py:996] (0/4) Epoch 9, batch 20100, loss[loss=0.2304, simple_loss=0.3342, pruned_loss=0.0633, over 21804.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3122, pruned_loss=0.08089, over 4281646.40 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:47:13,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1584462.0, ans=0.2 2023-06-23 21:47:32,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.946e+02 5.376e+02 7.013e+02 1.127e+03 1.999e+03, threshold=1.403e+03, percent-clipped=12.0 2023-06-23 21:48:18,566 INFO [train.py:996] (0/4) Epoch 9, batch 20150, loss[loss=0.2608, simple_loss=0.338, pruned_loss=0.09177, over 21410.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3194, pruned_loss=0.08393, over 4283032.29 frames. ], batch size: 159, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:48:19,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1584642.0, ans=0.0 2023-06-23 21:48:42,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1584702.0, ans=0.125 2023-06-23 21:48:42,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1584702.0, ans=0.125 2023-06-23 21:48:47,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1584702.0, ans=0.1 2023-06-23 21:49:19,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1584822.0, ans=0.125 2023-06-23 21:50:01,821 INFO [train.py:996] (0/4) Epoch 9, batch 20200, loss[loss=0.2882, simple_loss=0.3877, pruned_loss=0.09436, over 20741.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3261, pruned_loss=0.08698, over 4273171.25 frames. ], batch size: 607, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:50:02,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1584942.0, ans=0.0 2023-06-23 21:50:08,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1584942.0, ans=0.125 2023-06-23 21:50:26,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1585002.0, ans=0.0 2023-06-23 21:50:59,410 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.884e+02 7.487e+02 1.027e+03 1.466e+03 2.661e+03, threshold=2.055e+03, percent-clipped=25.0 2023-06-23 21:51:42,591 INFO [train.py:996] (0/4) Epoch 9, batch 20250, loss[loss=0.2157, simple_loss=0.3049, pruned_loss=0.06326, over 21795.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3272, pruned_loss=0.08507, over 4271923.70 frames. ], batch size: 332, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:52:05,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1585302.0, ans=0.2 2023-06-23 21:52:09,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1585302.0, ans=0.125 2023-06-23 21:52:17,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1585302.0, ans=0.0 2023-06-23 21:52:52,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1585422.0, ans=0.125 2023-06-23 21:53:12,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1585482.0, ans=0.0 2023-06-23 21:53:22,018 INFO [train.py:996] (0/4) Epoch 9, batch 20300, loss[loss=0.2058, simple_loss=0.2899, pruned_loss=0.06084, over 21422.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3235, pruned_loss=0.0817, over 4274894.02 frames. ], batch size: 211, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:54:05,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-23 21:54:28,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.674e+02 5.917e+02 9.257e+02 1.332e+03 3.110e+03, threshold=1.851e+03, percent-clipped=6.0 2023-06-23 21:54:28,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1585722.0, ans=0.125 2023-06-23 21:54:28,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1585722.0, ans=0.0 2023-06-23 21:54:55,481 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-23 21:55:00,631 INFO [train.py:996] (0/4) Epoch 9, batch 20350, loss[loss=0.2586, simple_loss=0.3294, pruned_loss=0.09385, over 21801.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3234, pruned_loss=0.0817, over 4259272.37 frames. ], batch size: 332, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:55:32,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1585902.0, ans=0.125 2023-06-23 21:55:43,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.83 vs. limit=10.0 2023-06-23 21:56:20,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1586022.0, ans=0.0 2023-06-23 21:56:30,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1586082.0, ans=0.125 2023-06-23 21:56:34,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1586082.0, ans=0.025 2023-06-23 21:56:40,534 INFO [train.py:996] (0/4) Epoch 9, batch 20400, loss[loss=0.2941, simple_loss=0.371, pruned_loss=0.1086, over 21366.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3257, pruned_loss=0.08436, over 4260921.98 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 21:57:42,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.128e+02 5.797e+02 8.307e+02 1.162e+03 2.401e+03, threshold=1.661e+03, percent-clipped=4.0 2023-06-23 21:58:04,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1586382.0, ans=0.125 2023-06-23 21:58:15,262 INFO [train.py:996] (0/4) Epoch 9, batch 20450, loss[loss=0.3076, simple_loss=0.3594, pruned_loss=0.1279, over 21553.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3263, pruned_loss=0.08727, over 4254599.17 frames. ], batch size: 471, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 21:58:50,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1586562.0, ans=0.1 2023-06-23 21:59:04,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1586562.0, ans=0.2 2023-06-23 21:59:37,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-23 21:59:49,248 INFO [train.py:996] (0/4) Epoch 9, batch 20500, loss[loss=0.2338, simple_loss=0.3031, pruned_loss=0.08225, over 21680.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3218, pruned_loss=0.08701, over 4256329.53 frames. ], batch size: 414, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:00:11,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1586802.0, ans=0.125 2023-06-23 22:00:13,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1586802.0, ans=0.125 2023-06-23 22:00:22,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-23 22:00:58,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 5.832e+02 8.906e+02 1.259e+03 2.508e+03, threshold=1.781e+03, percent-clipped=16.0 2023-06-23 22:01:06,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1586922.0, ans=0.0 2023-06-23 22:01:10,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1586922.0, ans=0.1 2023-06-23 22:01:29,792 INFO [train.py:996] (0/4) Epoch 9, batch 20550, loss[loss=0.2031, simple_loss=0.2772, pruned_loss=0.06456, over 21136.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.315, pruned_loss=0.08481, over 4244191.64 frames. ], batch size: 143, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:02:42,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-23 22:03:10,659 INFO [train.py:996] (0/4) Epoch 9, batch 20600, loss[loss=0.2433, simple_loss=0.3111, pruned_loss=0.08779, over 21833.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.316, pruned_loss=0.08251, over 4236028.25 frames. ], batch size: 282, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:03:19,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587342.0, ans=0.1 2023-06-23 22:03:26,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1587342.0, ans=0.125 2023-06-23 22:03:33,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1587402.0, ans=0.2 2023-06-23 22:03:51,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1587462.0, ans=0.125 2023-06-23 22:04:20,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.756e+02 4.749e+02 5.700e+02 8.605e+02 1.495e+03, threshold=1.140e+03, percent-clipped=0.0 2023-06-23 22:04:35,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1587582.0, ans=0.125 2023-06-23 22:04:40,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1587582.0, ans=0.0 2023-06-23 22:04:51,890 INFO [train.py:996] (0/4) Epoch 9, batch 20650, loss[loss=0.2019, simple_loss=0.2777, pruned_loss=0.06304, over 21683.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3124, pruned_loss=0.08268, over 4237274.16 frames. ], batch size: 332, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:04:55,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1587642.0, ans=0.025 2023-06-23 22:05:37,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587762.0, ans=0.1 2023-06-23 22:05:57,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587822.0, ans=0.1 2023-06-23 22:06:01,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1587822.0, ans=0.0 2023-06-23 22:06:19,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1587882.0, ans=0.0 2023-06-23 22:06:32,158 INFO [train.py:996] (0/4) Epoch 9, batch 20700, loss[loss=0.2044, simple_loss=0.2915, pruned_loss=0.05861, over 21716.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.304, pruned_loss=0.07927, over 4236844.44 frames. ], batch size: 332, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:07:33,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1588062.0, ans=0.125 2023-06-23 22:07:44,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.364e+02 5.397e+02 7.207e+02 1.144e+03 2.870e+03, threshold=1.441e+03, percent-clipped=25.0 2023-06-23 22:07:49,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1588122.0, ans=0.025 2023-06-23 22:07:49,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1588122.0, ans=0.0 2023-06-23 22:08:20,578 INFO [train.py:996] (0/4) Epoch 9, batch 20750, loss[loss=0.2762, simple_loss=0.3728, pruned_loss=0.08987, over 21807.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3078, pruned_loss=0.07893, over 4250428.88 frames. ], batch size: 371, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:08:22,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1588242.0, ans=0.0 2023-06-23 22:08:50,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1588302.0, ans=0.125 2023-06-23 22:09:44,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1588482.0, ans=0.0 2023-06-23 22:09:54,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-23 22:10:00,945 INFO [train.py:996] (0/4) Epoch 9, batch 20800, loss[loss=0.2187, simple_loss=0.2822, pruned_loss=0.07759, over 21720.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3107, pruned_loss=0.07953, over 4254661.05 frames. ], batch size: 316, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 22:10:04,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1588542.0, ans=0.1 2023-06-23 22:10:06,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1588542.0, ans=0.1 2023-06-23 22:10:25,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1588602.0, ans=0.125 2023-06-23 22:10:27,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1588602.0, ans=0.0 2023-06-23 22:10:53,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=14.07 vs. limit=15.0 2023-06-23 22:11:04,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1588722.0, ans=0.0 2023-06-23 22:11:06,869 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 5.156e+02 7.961e+02 1.123e+03 3.663e+03, threshold=1.592e+03, percent-clipped=17.0 2023-06-23 22:11:15,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1588722.0, ans=0.1 2023-06-23 22:11:40,278 INFO [train.py:996] (0/4) Epoch 9, batch 20850, loss[loss=0.2084, simple_loss=0.2844, pruned_loss=0.0662, over 21827.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.302, pruned_loss=0.07728, over 4261883.34 frames. ], batch size: 351, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:11:45,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1588842.0, ans=0.035 2023-06-23 22:12:07,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1588902.0, ans=0.2 2023-06-23 22:12:14,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1588902.0, ans=0.2 2023-06-23 22:12:30,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1588962.0, ans=0.0 2023-06-23 22:12:37,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1588962.0, ans=0.5 2023-06-23 22:12:46,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-23 22:13:19,720 INFO [train.py:996] (0/4) Epoch 9, batch 20900, loss[loss=0.3264, simple_loss=0.3801, pruned_loss=0.1363, over 21646.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3036, pruned_loss=0.07907, over 4270750.92 frames. ], batch size: 508, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:13:26,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1589142.0, ans=0.0 2023-06-23 22:13:50,020 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:13:53,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1589202.0, ans=0.0 2023-06-23 22:14:21,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1589322.0, ans=0.5 2023-06-23 22:14:23,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.693e+02 5.793e+02 9.224e+02 1.777e+03 3.715e+03, threshold=1.845e+03, percent-clipped=30.0 2023-06-23 22:14:27,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1589322.0, ans=0.0 2023-06-23 22:14:31,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1589322.0, ans=0.2 2023-06-23 22:14:51,920 INFO [train.py:996] (0/4) Epoch 9, batch 20950, loss[loss=0.2036, simple_loss=0.2809, pruned_loss=0.06313, over 21855.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2992, pruned_loss=0.07544, over 4261395.41 frames. ], batch size: 102, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:15:43,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1589562.0, ans=0.125 2023-06-23 22:16:00,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1589622.0, ans=0.0 2023-06-23 22:16:29,645 INFO [train.py:996] (0/4) Epoch 9, batch 21000, loss[loss=0.1533, simple_loss=0.2264, pruned_loss=0.04013, over 15656.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2984, pruned_loss=0.076, over 4271236.44 frames. ], batch size: 60, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:16:29,646 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 22:16:50,151 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2633, simple_loss=0.3613, pruned_loss=0.0826, over 1796401.00 frames. 2023-06-23 22:16:50,152 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 22:17:03,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-23 22:17:06,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1589742.0, ans=0.1 2023-06-23 22:17:45,525 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:17:50,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1589922.0, ans=0.1 2023-06-23 22:17:51,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 5.898e+02 8.113e+02 1.195e+03 2.501e+03, threshold=1.623e+03, percent-clipped=8.0 2023-06-23 22:18:08,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1589982.0, ans=0.125 2023-06-23 22:18:30,044 INFO [train.py:996] (0/4) Epoch 9, batch 21050, loss[loss=0.203, simple_loss=0.2646, pruned_loss=0.07072, over 21525.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2972, pruned_loss=0.07645, over 4275664.19 frames. ], batch size: 230, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:18:32,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1590042.0, ans=0.1 2023-06-23 22:18:39,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.44 vs. limit=10.0 2023-06-23 22:18:55,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-23 22:19:40,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1590222.0, ans=0.0 2023-06-23 22:19:56,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1590282.0, ans=0.125 2023-06-23 22:20:08,652 INFO [train.py:996] (0/4) Epoch 9, batch 21100, loss[loss=0.2574, simple_loss=0.3193, pruned_loss=0.09774, over 21527.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2942, pruned_loss=0.07641, over 4264737.22 frames. ], batch size: 414, lr: 3.25e-03, grad_scale: 8.0 2023-06-23 22:20:10,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1590342.0, ans=0.035 2023-06-23 22:20:35,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1590402.0, ans=0.1 2023-06-23 22:21:04,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1590462.0, ans=0.125 2023-06-23 22:21:11,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.136e+02 6.651e+02 8.328e+02 1.901e+03, threshold=1.330e+03, percent-clipped=2.0 2023-06-23 22:21:40,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1590582.0, ans=0.125 2023-06-23 22:21:48,087 INFO [train.py:996] (0/4) Epoch 9, batch 21150, loss[loss=0.1983, simple_loss=0.2684, pruned_loss=0.06409, over 21669.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2907, pruned_loss=0.07679, over 4265583.11 frames. ], batch size: 333, lr: 3.25e-03, grad_scale: 8.0 2023-06-23 22:21:50,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-23 22:21:59,260 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=22.5 2023-06-23 22:22:00,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1590642.0, ans=0.125 2023-06-23 22:22:44,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1590822.0, ans=0.125 2023-06-23 22:23:26,855 INFO [train.py:996] (0/4) Epoch 9, batch 21200, loss[loss=0.2913, simple_loss=0.3278, pruned_loss=0.1274, over 21359.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2875, pruned_loss=0.07682, over 4252100.45 frames. ], batch size: 508, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:24:08,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-23 22:24:29,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.749e+02 4.861e+02 6.796e+02 9.543e+02 2.010e+03, threshold=1.359e+03, percent-clipped=3.0 2023-06-23 22:24:54,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-23 22:25:05,780 INFO [train.py:996] (0/4) Epoch 9, batch 21250, loss[loss=0.2917, simple_loss=0.3438, pruned_loss=0.1198, over 21464.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2859, pruned_loss=0.07673, over 4253971.53 frames. ], batch size: 509, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:25:11,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1591242.0, ans=0.2 2023-06-23 22:25:39,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1591302.0, ans=0.125 2023-06-23 22:26:03,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1591422.0, ans=0.1 2023-06-23 22:26:16,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1591422.0, ans=0.1 2023-06-23 22:26:29,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1591482.0, ans=0.035 2023-06-23 22:26:29,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-23 22:26:41,385 INFO [train.py:996] (0/4) Epoch 9, batch 21300, loss[loss=0.2556, simple_loss=0.3247, pruned_loss=0.09327, over 21932.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2933, pruned_loss=0.07889, over 4253154.79 frames. ], batch size: 415, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:27:00,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1591542.0, ans=0.125 2023-06-23 22:27:45,051 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:27:49,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.81 vs. limit=5.0 2023-06-23 22:27:49,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.208e+02 6.806e+02 9.811e+02 1.401e+03 3.569e+03, threshold=1.962e+03, percent-clipped=29.0 2023-06-23 22:28:02,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1591782.0, ans=0.1 2023-06-23 22:28:25,313 INFO [train.py:996] (0/4) Epoch 9, batch 21350, loss[loss=0.2327, simple_loss=0.34, pruned_loss=0.06273, over 19749.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2979, pruned_loss=0.07912, over 4256303.52 frames. ], batch size: 703, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:29:13,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1591962.0, ans=0.0 2023-06-23 22:29:18,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1591962.0, ans=0.1 2023-06-23 22:30:10,711 INFO [train.py:996] (0/4) Epoch 9, batch 21400, loss[loss=0.2823, simple_loss=0.3443, pruned_loss=0.1101, over 21319.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3011, pruned_loss=0.07903, over 4254082.00 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:30:17,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1592142.0, ans=0.2 2023-06-23 22:30:35,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1592202.0, ans=0.05 2023-06-23 22:31:08,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.672e+02 5.270e+02 6.886e+02 1.009e+03 2.109e+03, threshold=1.377e+03, percent-clipped=2.0 2023-06-23 22:31:11,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1592322.0, ans=0.0 2023-06-23 22:31:19,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1592322.0, ans=0.2 2023-06-23 22:31:48,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2023-06-23 22:31:50,206 INFO [train.py:996] (0/4) Epoch 9, batch 21450, loss[loss=0.2396, simple_loss=0.3119, pruned_loss=0.08365, over 21482.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3049, pruned_loss=0.08044, over 4261875.99 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:32:11,921 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:32:25,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1592562.0, ans=0.125 2023-06-23 22:32:28,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1592562.0, ans=0.125 2023-06-23 22:32:35,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-23 22:33:12,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1592682.0, ans=0.0 2023-06-23 22:33:27,858 INFO [train.py:996] (0/4) Epoch 9, batch 21500, loss[loss=0.2029, simple_loss=0.2698, pruned_loss=0.06798, over 21729.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3029, pruned_loss=0.08142, over 4274714.75 frames. ], batch size: 333, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:33:28,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1592742.0, ans=0.1 2023-06-23 22:33:39,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1592742.0, ans=0.2 2023-06-23 22:33:40,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1592742.0, ans=10.0 2023-06-23 22:34:29,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 5.721e+02 7.470e+02 9.927e+02 1.833e+03, threshold=1.494e+03, percent-clipped=12.0 2023-06-23 22:34:57,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1592982.0, ans=0.125 2023-06-23 22:35:06,823 INFO [train.py:996] (0/4) Epoch 9, batch 21550, loss[loss=0.2344, simple_loss=0.3018, pruned_loss=0.08347, over 21470.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2956, pruned_loss=0.07887, over 4262753.57 frames. ], batch size: 211, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:35:31,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-23 22:35:42,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-23 22:36:16,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1593222.0, ans=0.125 2023-06-23 22:36:46,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1593342.0, ans=0.125 2023-06-23 22:36:47,553 INFO [train.py:996] (0/4) Epoch 9, batch 21600, loss[loss=0.2214, simple_loss=0.3146, pruned_loss=0.06415, over 21573.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2907, pruned_loss=0.07698, over 4264380.57 frames. ], batch size: 389, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 22:37:11,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593402.0, ans=0.1 2023-06-23 22:38:04,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.659e+02 6.338e+02 9.920e+02 1.459e+03 3.157e+03, threshold=1.984e+03, percent-clipped=22.0 2023-06-23 22:38:28,845 INFO [train.py:996] (0/4) Epoch 9, batch 21650, loss[loss=0.2279, simple_loss=0.3056, pruned_loss=0.07508, over 21239.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2953, pruned_loss=0.07534, over 4273538.86 frames. ], batch size: 143, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:39:15,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-23 22:40:02,181 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-23 22:40:07,980 INFO [train.py:996] (0/4) Epoch 9, batch 21700, loss[loss=0.2126, simple_loss=0.2722, pruned_loss=0.07651, over 21363.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2947, pruned_loss=0.07325, over 4272041.75 frames. ], batch size: 160, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:40:09,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1593942.0, ans=0.125 2023-06-23 22:40:13,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1593942.0, ans=0.2 2023-06-23 22:40:55,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1594062.0, ans=0.125 2023-06-23 22:41:10,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.363e+02 6.189e+02 8.394e+02 1.254e+03 2.013e+03, threshold=1.679e+03, percent-clipped=1.0 2023-06-23 22:41:29,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.78 vs. limit=6.0 2023-06-23 22:41:36,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1594182.0, ans=0.5 2023-06-23 22:41:45,167 INFO [train.py:996] (0/4) Epoch 9, batch 21750, loss[loss=0.1724, simple_loss=0.2382, pruned_loss=0.05334, over 21390.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2908, pruned_loss=0.07225, over 4277401.36 frames. ], batch size: 212, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:41:47,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1594242.0, ans=0.125 2023-06-23 22:43:25,531 INFO [train.py:996] (0/4) Epoch 9, batch 21800, loss[loss=0.2279, simple_loss=0.3172, pruned_loss=0.06923, over 21621.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2884, pruned_loss=0.07337, over 4284750.76 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:43:49,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.52 vs. limit=15.0 2023-06-23 22:44:34,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.869e+02 5.122e+02 6.776e+02 1.046e+03 2.535e+03, threshold=1.355e+03, percent-clipped=3.0 2023-06-23 22:44:49,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1594782.0, ans=0.1 2023-06-23 22:44:57,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1594782.0, ans=0.0 2023-06-23 22:45:05,013 INFO [train.py:996] (0/4) Epoch 9, batch 21850, loss[loss=0.2438, simple_loss=0.3092, pruned_loss=0.0892, over 21764.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2974, pruned_loss=0.07522, over 4277735.31 frames. ], batch size: 112, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:45:13,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1594842.0, ans=0.0 2023-06-23 22:45:37,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1594902.0, ans=0.035 2023-06-23 22:46:24,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1595022.0, ans=0.1 2023-06-23 22:46:29,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1595082.0, ans=0.2 2023-06-23 22:46:42,709 INFO [train.py:996] (0/4) Epoch 9, batch 21900, loss[loss=0.1964, simple_loss=0.2607, pruned_loss=0.06607, over 21754.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2959, pruned_loss=0.07568, over 4265583.65 frames. ], batch size: 124, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:46:43,185 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:47:17,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1595262.0, ans=0.125 2023-06-23 22:47:49,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.776e+02 5.560e+02 7.973e+02 1.226e+03 2.341e+03, threshold=1.595e+03, percent-clipped=19.0 2023-06-23 22:48:20,432 INFO [train.py:996] (0/4) Epoch 9, batch 21950, loss[loss=0.1911, simple_loss=0.2703, pruned_loss=0.05592, over 21891.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2903, pruned_loss=0.07426, over 4274297.60 frames. ], batch size: 373, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:48:37,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1595502.0, ans=0.125 2023-06-23 22:49:10,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1595562.0, ans=0.125 2023-06-23 22:49:16,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1595562.0, ans=0.2 2023-06-23 22:49:16,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1595562.0, ans=0.125 2023-06-23 22:49:18,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=12.0 2023-06-23 22:49:21,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1595622.0, ans=0.125 2023-06-23 22:49:29,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-23 22:49:58,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1595742.0, ans=0.07 2023-06-23 22:49:59,992 INFO [train.py:996] (0/4) Epoch 9, batch 22000, loss[loss=0.2032, simple_loss=0.2725, pruned_loss=0.06698, over 21597.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.285, pruned_loss=0.07164, over 4271657.43 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 22:50:09,658 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-23 22:50:12,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1595742.0, ans=0.0 2023-06-23 22:50:19,273 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-06-23 22:50:39,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1595802.0, ans=0.125 2023-06-23 22:51:14,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 5.153e+02 7.605e+02 1.162e+03 2.837e+03, threshold=1.521e+03, percent-clipped=11.0 2023-06-23 22:51:31,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.40 vs. limit=15.0 2023-06-23 22:51:39,782 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-23 22:51:40,206 INFO [train.py:996] (0/4) Epoch 9, batch 22050, loss[loss=0.2853, simple_loss=0.3691, pruned_loss=0.1008, over 21622.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2908, pruned_loss=0.07344, over 4272142.39 frames. ], batch size: 441, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 22:51:50,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1596042.0, ans=0.2 2023-06-23 22:52:01,892 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-06-23 22:52:27,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1596162.0, ans=0.125 2023-06-23 22:52:29,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1596162.0, ans=0.125 2023-06-23 22:52:50,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1596222.0, ans=0.125 2023-06-23 22:53:07,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1596282.0, ans=0.0 2023-06-23 22:53:19,876 INFO [train.py:996] (0/4) Epoch 9, batch 22100, loss[loss=0.2423, simple_loss=0.3123, pruned_loss=0.08618, over 21332.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3022, pruned_loss=0.07833, over 4267407.10 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:53:23,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1596342.0, ans=0.2 2023-06-23 22:54:24,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.86 vs. limit=22.5 2023-06-23 22:54:33,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1596522.0, ans=0.2 2023-06-23 22:54:34,353 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.056e+02 6.584e+02 8.540e+02 1.234e+03 2.755e+03, threshold=1.708e+03, percent-clipped=13.0 2023-06-23 22:54:36,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1596522.0, ans=0.2 2023-06-23 22:54:39,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1596522.0, ans=0.0 2023-06-23 22:54:57,984 INFO [train.py:996] (0/4) Epoch 9, batch 22150, loss[loss=0.2006, simple_loss=0.2807, pruned_loss=0.06025, over 21809.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3062, pruned_loss=0.08046, over 4263042.12 frames. ], batch size: 102, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:56:06,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.34 vs. limit=12.0 2023-06-23 22:56:22,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1596882.0, ans=0.125 2023-06-23 22:56:37,900 INFO [train.py:996] (0/4) Epoch 9, batch 22200, loss[loss=0.2557, simple_loss=0.3394, pruned_loss=0.08599, over 21879.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3075, pruned_loss=0.08155, over 4277452.67 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 22:56:58,412 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-23 22:57:10,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-06-23 22:57:29,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1597062.0, ans=0.1 2023-06-23 22:57:48,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1597122.0, ans=0.0 2023-06-23 22:57:54,242 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.796e+02 5.424e+02 7.068e+02 9.828e+02 2.083e+03, threshold=1.414e+03, percent-clipped=7.0 2023-06-23 22:58:01,084 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:58:16,277 INFO [train.py:996] (0/4) Epoch 9, batch 22250, loss[loss=0.2694, simple_loss=0.3414, pruned_loss=0.09869, over 21205.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3144, pruned_loss=0.08341, over 4284916.37 frames. ], batch size: 143, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 22:58:34,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597242.0, ans=0.1 2023-06-23 22:58:46,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1597302.0, ans=0.125 2023-06-23 22:58:56,533 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-23 22:59:40,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1597482.0, ans=0.1 2023-06-23 22:59:48,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1597482.0, ans=0.125 2023-06-23 22:59:54,581 INFO [train.py:996] (0/4) Epoch 9, batch 22300, loss[loss=0.2264, simple_loss=0.3268, pruned_loss=0.06294, over 19930.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3163, pruned_loss=0.08524, over 4288141.22 frames. ], batch size: 702, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 23:00:54,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-23 23:00:58,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1597662.0, ans=0.125 2023-06-23 23:01:10,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.760e+02 5.946e+02 8.093e+02 1.234e+03 3.372e+03, threshold=1.619e+03, percent-clipped=19.0 2023-06-23 23:01:33,326 INFO [train.py:996] (0/4) Epoch 9, batch 22350, loss[loss=0.2149, simple_loss=0.2832, pruned_loss=0.07331, over 21496.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3152, pruned_loss=0.08546, over 4293946.07 frames. ], batch size: 212, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 23:02:03,227 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:02:17,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597902.0, ans=0.1 2023-06-23 23:02:27,972 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.85 vs. limit=22.5 2023-06-23 23:03:22,851 INFO [train.py:996] (0/4) Epoch 9, batch 22400, loss[loss=0.2649, simple_loss=0.3227, pruned_loss=0.1035, over 20066.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3113, pruned_loss=0.0817, over 4292571.09 frames. ], batch size: 703, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:03:32,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1598142.0, ans=0.1 2023-06-23 23:03:32,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1598142.0, ans=0.125 2023-06-23 23:03:54,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1598202.0, ans=0.0 2023-06-23 23:03:59,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1598202.0, ans=0.1 2023-06-23 23:03:59,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1598202.0, ans=0.125 2023-06-23 23:04:18,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1598322.0, ans=0.2 2023-06-23 23:04:29,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.562e+02 4.983e+02 6.853e+02 9.645e+02 2.077e+03, threshold=1.371e+03, percent-clipped=3.0 2023-06-23 23:05:00,748 INFO [train.py:996] (0/4) Epoch 9, batch 22450, loss[loss=0.1896, simple_loss=0.2488, pruned_loss=0.06523, over 21605.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3051, pruned_loss=0.07985, over 4280471.12 frames. ], batch size: 231, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:05:31,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1598502.0, ans=0.0 2023-06-23 23:05:38,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1598502.0, ans=0.07 2023-06-23 23:05:49,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1598562.0, ans=0.035 2023-06-23 23:06:08,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1598622.0, ans=0.0 2023-06-23 23:06:39,888 INFO [train.py:996] (0/4) Epoch 9, batch 22500, loss[loss=0.2119, simple_loss=0.3081, pruned_loss=0.05789, over 21193.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2985, pruned_loss=0.07883, over 4273752.99 frames. ], batch size: 176, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:07:11,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1598802.0, ans=0.02 2023-06-23 23:07:13,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1598802.0, ans=0.1 2023-06-23 23:07:47,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 5.895e+02 7.835e+02 1.248e+03 2.629e+03, threshold=1.567e+03, percent-clipped=21.0 2023-06-23 23:08:19,017 INFO [train.py:996] (0/4) Epoch 9, batch 22550, loss[loss=0.2269, simple_loss=0.3299, pruned_loss=0.06196, over 20728.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3015, pruned_loss=0.07937, over 4277545.26 frames. ], batch size: 607, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:08:39,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1599042.0, ans=0.1 2023-06-23 23:08:49,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1599102.0, ans=0.125 2023-06-23 23:08:56,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1599102.0, ans=0.125 2023-06-23 23:09:49,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1599282.0, ans=0.125 2023-06-23 23:10:06,626 INFO [train.py:996] (0/4) Epoch 9, batch 22600, loss[loss=0.1394, simple_loss=0.1887, pruned_loss=0.04507, over 17076.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3064, pruned_loss=0.08125, over 4277251.08 frames. ], batch size: 65, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:10:50,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1599462.0, ans=0.0 2023-06-23 23:11:13,385 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 6.856e+02 1.098e+03 1.547e+03 4.006e+03, threshold=2.196e+03, percent-clipped=25.0 2023-06-23 23:11:38,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1599582.0, ans=0.0 2023-06-23 23:11:45,436 INFO [train.py:996] (0/4) Epoch 9, batch 22650, loss[loss=0.2449, simple_loss=0.3187, pruned_loss=0.08555, over 21565.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.304, pruned_loss=0.08114, over 4276920.07 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:11:46,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-23 23:12:08,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1599702.0, ans=0.125 2023-06-23 23:12:59,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-23 23:13:18,778 INFO [train.py:996] (0/4) Epoch 9, batch 22700, loss[loss=0.2167, simple_loss=0.287, pruned_loss=0.07319, over 21845.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2979, pruned_loss=0.0803, over 4261188.66 frames. ], batch size: 372, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:13:24,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1599942.0, ans=0.05 2023-06-23 23:13:37,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1600002.0, ans=0.125 2023-06-23 23:13:43,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1600002.0, ans=0.125 2023-06-23 23:14:26,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.745e+02 5.778e+02 8.238e+02 1.243e+03 2.659e+03, threshold=1.648e+03, percent-clipped=2.0 2023-06-23 23:14:42,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-23 23:14:43,890 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-23 23:14:58,274 INFO [train.py:996] (0/4) Epoch 9, batch 22750, loss[loss=0.2427, simple_loss=0.3032, pruned_loss=0.09106, over 20688.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2974, pruned_loss=0.0807, over 4256980.64 frames. ], batch size: 607, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:15:05,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1600242.0, ans=0.1 2023-06-23 23:15:51,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1600362.0, ans=0.1 2023-06-23 23:16:12,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1600482.0, ans=0.125 2023-06-23 23:16:37,197 INFO [train.py:996] (0/4) Epoch 9, batch 22800, loss[loss=0.2555, simple_loss=0.3203, pruned_loss=0.09532, over 21235.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3029, pruned_loss=0.08374, over 4270116.44 frames. ], batch size: 143, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:16:47,755 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-23 23:17:45,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.424e+02 5.839e+02 8.746e+02 1.348e+03 2.535e+03, threshold=1.749e+03, percent-clipped=13.0 2023-06-23 23:17:46,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-23 23:17:47,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1600722.0, ans=0.0 2023-06-23 23:18:00,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-23 23:18:14,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1600842.0, ans=10.0 2023-06-23 23:18:15,369 INFO [train.py:996] (0/4) Epoch 9, batch 22850, loss[loss=0.2329, simple_loss=0.3081, pruned_loss=0.07892, over 21427.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3007, pruned_loss=0.08315, over 4272193.01 frames. ], batch size: 131, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:18:17,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1600842.0, ans=0.0 2023-06-23 23:18:18,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1600842.0, ans=0.2 2023-06-23 23:18:47,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1600962.0, ans=0.125 2023-06-23 23:18:57,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1600962.0, ans=0.125 2023-06-23 23:19:49,690 INFO [train.py:996] (0/4) Epoch 9, batch 22900, loss[loss=0.2211, simple_loss=0.3265, pruned_loss=0.05782, over 21809.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2999, pruned_loss=0.0815, over 4269508.34 frames. ], batch size: 282, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:20:00,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1601142.0, ans=0.0 2023-06-23 23:20:20,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1601202.0, ans=0.2 2023-06-23 23:20:54,788 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-23 23:21:02,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.772e+02 7.597e+02 1.119e+03 1.552e+03 2.740e+03, threshold=2.237e+03, percent-clipped=15.0 2023-06-23 23:21:19,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1601382.0, ans=0.125 2023-06-23 23:21:23,739 INFO [train.py:996] (0/4) Epoch 9, batch 22950, loss[loss=0.2401, simple_loss=0.372, pruned_loss=0.05417, over 20761.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3153, pruned_loss=0.08031, over 4270580.09 frames. ], batch size: 607, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:21:30,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1601442.0, ans=0.125 2023-06-23 23:22:01,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1601502.0, ans=0.125 2023-06-23 23:22:01,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-23 23:22:08,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1601562.0, ans=0.0 2023-06-23 23:22:13,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1601562.0, ans=0.125 2023-06-23 23:22:52,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.79 vs. limit=22.5 2023-06-23 23:23:02,631 INFO [train.py:996] (0/4) Epoch 9, batch 23000, loss[loss=0.2252, simple_loss=0.2971, pruned_loss=0.07667, over 21912.00 frames. ], tot_loss[loss=0.236, simple_loss=0.316, pruned_loss=0.07802, over 4268788.33 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:23:04,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1601742.0, ans=0.125 2023-06-23 23:23:31,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1601802.0, ans=0.125 2023-06-23 23:23:46,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1601862.0, ans=0.1 2023-06-23 23:23:57,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1601862.0, ans=0.125 2023-06-23 23:24:17,146 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.759e+02 5.434e+02 6.875e+02 9.682e+02 1.732e+03, threshold=1.375e+03, percent-clipped=0.0 2023-06-23 23:24:32,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1601982.0, ans=0.05 2023-06-23 23:24:38,064 INFO [train.py:996] (0/4) Epoch 9, batch 23050, loss[loss=0.2414, simple_loss=0.3097, pruned_loss=0.08653, over 21232.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3171, pruned_loss=0.0802, over 4273813.93 frames. ], batch size: 176, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:24:56,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1602042.0, ans=0.125 2023-06-23 23:25:11,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1602102.0, ans=0.05 2023-06-23 23:25:21,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1602162.0, ans=0.1 2023-06-23 23:25:34,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1602162.0, ans=0.0 2023-06-23 23:25:34,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1602162.0, ans=0.07 2023-06-23 23:26:13,074 INFO [train.py:996] (0/4) Epoch 9, batch 23100, loss[loss=0.2152, simple_loss=0.2712, pruned_loss=0.07962, over 21391.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.313, pruned_loss=0.08096, over 4271334.12 frames. ], batch size: 131, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:26:48,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1602402.0, ans=0.125 2023-06-23 23:27:30,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.933e+02 6.265e+02 7.988e+02 9.890e+02 1.959e+03, threshold=1.598e+03, percent-clipped=10.0 2023-06-23 23:27:32,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1602522.0, ans=0.125 2023-06-23 23:27:51,473 INFO [train.py:996] (0/4) Epoch 9, batch 23150, loss[loss=0.2168, simple_loss=0.2876, pruned_loss=0.07299, over 21927.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3065, pruned_loss=0.08069, over 4270627.72 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:28:28,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1602702.0, ans=0.125 2023-06-23 23:28:33,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1602762.0, ans=0.125 2023-06-23 23:29:05,371 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:29:12,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-23 23:29:29,537 INFO [train.py:996] (0/4) Epoch 9, batch 23200, loss[loss=0.2097, simple_loss=0.2799, pruned_loss=0.06977, over 21675.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3057, pruned_loss=0.08154, over 4276118.07 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:30:13,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1603062.0, ans=0.1 2023-06-23 23:30:29,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1603122.0, ans=0.125 2023-06-23 23:30:46,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.636e+02 5.576e+02 6.977e+02 1.069e+03 2.508e+03, threshold=1.395e+03, percent-clipped=7.0 2023-06-23 23:30:53,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1603182.0, ans=0.125 2023-06-23 23:31:07,220 INFO [train.py:996] (0/4) Epoch 9, batch 23250, loss[loss=0.2235, simple_loss=0.2827, pruned_loss=0.08218, over 21497.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3043, pruned_loss=0.08199, over 4280927.37 frames. ], batch size: 194, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:31:48,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1603362.0, ans=0.0 2023-06-23 23:31:50,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-23 23:32:34,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1603482.0, ans=0.125 2023-06-23 23:32:52,344 INFO [train.py:996] (0/4) Epoch 9, batch 23300, loss[loss=0.2592, simple_loss=0.3413, pruned_loss=0.08856, over 21336.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3123, pruned_loss=0.08387, over 4273721.62 frames. ], batch size: 548, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:33:42,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1603662.0, ans=0.1 2023-06-23 23:34:07,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1603722.0, ans=0.1 2023-06-23 23:34:08,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.193e+02 5.673e+02 7.444e+02 1.083e+03 2.210e+03, threshold=1.489e+03, percent-clipped=13.0 2023-06-23 23:34:16,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-23 23:34:28,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1603782.0, ans=0.0 2023-06-23 23:34:37,346 INFO [train.py:996] (0/4) Epoch 9, batch 23350, loss[loss=0.2395, simple_loss=0.3397, pruned_loss=0.06966, over 20711.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3183, pruned_loss=0.08344, over 4265510.24 frames. ], batch size: 607, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:35:15,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1603902.0, ans=0.1 2023-06-23 23:35:28,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1603962.0, ans=0.2 2023-06-23 23:36:05,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1604082.0, ans=0.125 2023-06-23 23:36:15,594 INFO [train.py:996] (0/4) Epoch 9, batch 23400, loss[loss=0.2328, simple_loss=0.3004, pruned_loss=0.08261, over 21772.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.31, pruned_loss=0.07949, over 4267973.10 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:36:17,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1604142.0, ans=0.1 2023-06-23 23:37:32,092 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.577e+02 6.077e+02 8.548e+02 1.177e+03 1.985e+03, threshold=1.710e+03, percent-clipped=13.0 2023-06-23 23:37:38,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1604382.0, ans=0.125 2023-06-23 23:37:55,568 INFO [train.py:996] (0/4) Epoch 9, batch 23450, loss[loss=0.2231, simple_loss=0.2954, pruned_loss=0.07538, over 20771.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3106, pruned_loss=0.08186, over 4268765.56 frames. ], batch size: 608, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:38:15,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1604442.0, ans=0.2 2023-06-23 23:38:54,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1604622.0, ans=0.125 2023-06-23 23:39:10,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1604622.0, ans=0.07 2023-06-23 23:39:33,087 INFO [train.py:996] (0/4) Epoch 9, batch 23500, loss[loss=0.2197, simple_loss=0.2907, pruned_loss=0.0743, over 21884.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3117, pruned_loss=0.08441, over 4278169.82 frames. ], batch size: 371, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:40:04,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-23 23:40:48,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.149e+02 5.607e+02 7.001e+02 9.691e+02 1.810e+03, threshold=1.400e+03, percent-clipped=1.0 2023-06-23 23:41:06,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1604982.0, ans=0.0 2023-06-23 23:41:06,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1604982.0, ans=0.125 2023-06-23 23:41:11,102 INFO [train.py:996] (0/4) Epoch 9, batch 23550, loss[loss=0.1826, simple_loss=0.246, pruned_loss=0.05964, over 21532.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3078, pruned_loss=0.08472, over 4273729.48 frames. ], batch size: 195, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:41:42,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1605102.0, ans=15.0 2023-06-23 23:41:44,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1605102.0, ans=0.125 2023-06-23 23:41:48,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1605102.0, ans=0.125 2023-06-23 23:42:36,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1605282.0, ans=0.025 2023-06-23 23:42:36,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1605282.0, ans=0.125 2023-06-23 23:42:54,903 INFO [train.py:996] (0/4) Epoch 9, batch 23600, loss[loss=0.2502, simple_loss=0.3213, pruned_loss=0.08953, over 21662.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3088, pruned_loss=0.08463, over 4262044.14 frames. ], batch size: 298, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:44:15,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.063e+02 5.548e+02 8.580e+02 1.181e+03 2.336e+03, threshold=1.716e+03, percent-clipped=15.0 2023-06-23 23:44:43,208 INFO [train.py:996] (0/4) Epoch 9, batch 23650, loss[loss=0.2521, simple_loss=0.3298, pruned_loss=0.08718, over 21377.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3086, pruned_loss=0.08306, over 4259658.27 frames. ], batch size: 143, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:46:23,159 INFO [train.py:996] (0/4) Epoch 9, batch 23700, loss[loss=0.244, simple_loss=0.3102, pruned_loss=0.08887, over 21203.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3108, pruned_loss=0.08212, over 4263946.29 frames. ], batch size: 143, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:46:28,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1605942.0, ans=0.2 2023-06-23 23:46:54,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.05 vs. limit=15.0 2023-06-23 23:46:55,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1606002.0, ans=0.0 2023-06-23 23:47:15,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1606062.0, ans=0.0 2023-06-23 23:47:36,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1606122.0, ans=0.125 2023-06-23 23:47:49,320 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.090e+02 6.365e+02 8.254e+02 1.222e+03 2.661e+03, threshold=1.651e+03, percent-clipped=9.0 2023-06-23 23:47:52,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1606182.0, ans=0.035 2023-06-23 23:48:05,091 INFO [train.py:996] (0/4) Epoch 9, batch 23750, loss[loss=0.2427, simple_loss=0.315, pruned_loss=0.0852, over 20157.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3144, pruned_loss=0.08328, over 4264441.72 frames. ], batch size: 702, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:48:47,835 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:48:58,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1606362.0, ans=0.125 2023-06-23 23:49:08,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1606422.0, ans=0.0 2023-06-23 23:49:45,922 INFO [train.py:996] (0/4) Epoch 9, batch 23800, loss[loss=0.2718, simple_loss=0.3491, pruned_loss=0.09726, over 21407.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3122, pruned_loss=0.0808, over 4273364.28 frames. ], batch size: 471, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:49:49,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1606542.0, ans=0.2 2023-06-23 23:50:04,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1606542.0, ans=0.125 2023-06-23 23:50:37,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1606662.0, ans=0.125 2023-06-23 23:50:48,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-23 23:51:11,426 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 6.511e+02 9.622e+02 1.496e+03 3.900e+03, threshold=1.924e+03, percent-clipped=16.0 2023-06-23 23:51:15,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1606782.0, ans=0.125 2023-06-23 23:51:25,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1606782.0, ans=0.125 2023-06-23 23:51:27,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1606782.0, ans=0.125 2023-06-23 23:51:33,191 INFO [train.py:996] (0/4) Epoch 9, batch 23850, loss[loss=0.273, simple_loss=0.3417, pruned_loss=0.1022, over 21494.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3217, pruned_loss=0.08406, over 4273268.95 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:51:49,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1606842.0, ans=0.0 2023-06-23 23:52:34,877 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:53:14,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1607082.0, ans=0.125 2023-06-23 23:53:17,374 INFO [train.py:996] (0/4) Epoch 9, batch 23900, loss[loss=0.2236, simple_loss=0.3072, pruned_loss=0.07003, over 21558.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3287, pruned_loss=0.08562, over 4269357.33 frames. ], batch size: 263, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:53:54,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1607202.0, ans=0.2 2023-06-23 23:54:30,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.004e+02 6.120e+02 8.437e+02 1.170e+03 2.663e+03, threshold=1.687e+03, percent-clipped=5.0 2023-06-23 23:54:56,268 INFO [train.py:996] (0/4) Epoch 9, batch 23950, loss[loss=0.2831, simple_loss=0.3409, pruned_loss=0.1126, over 21618.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3218, pruned_loss=0.08527, over 4272856.96 frames. ], batch size: 441, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:56:17,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1607682.0, ans=0.0 2023-06-23 23:56:26,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1607682.0, ans=0.125 2023-06-23 23:56:40,013 INFO [train.py:996] (0/4) Epoch 9, batch 24000, loss[loss=0.293, simple_loss=0.3587, pruned_loss=0.1137, over 21815.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3238, pruned_loss=0.08853, over 4281339.62 frames. ], batch size: 441, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:56:40,014 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 23:57:00,113 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2698, simple_loss=0.3635, pruned_loss=0.08806, over 1796401.00 frames. 2023-06-23 23:57:00,114 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-23 23:58:19,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 5.711e+02 7.408e+02 1.023e+03 1.952e+03, threshold=1.482e+03, percent-clipped=3.0 2023-06-23 23:58:23,066 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-268000.pt 2023-06-23 23:58:26,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1607982.0, ans=0.2 2023-06-23 23:58:41,989 INFO [train.py:996] (0/4) Epoch 9, batch 24050, loss[loss=0.2542, simple_loss=0.3311, pruned_loss=0.08864, over 20253.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3257, pruned_loss=0.08923, over 4275282.89 frames. ], batch size: 703, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:59:07,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1608102.0, ans=0.0 2023-06-23 23:59:16,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-24 00:00:17,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1608282.0, ans=0.2 2023-06-24 00:00:20,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1608342.0, ans=0.125 2023-06-24 00:00:21,952 INFO [train.py:996] (0/4) Epoch 9, batch 24100, loss[loss=0.2299, simple_loss=0.3035, pruned_loss=0.07819, over 20064.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3251, pruned_loss=0.08667, over 4272311.36 frames. ], batch size: 702, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:00:29,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1608342.0, ans=0.2 2023-06-24 00:01:24,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1608522.0, ans=0.125 2023-06-24 00:01:35,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1608522.0, ans=0.025 2023-06-24 00:01:39,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.035e+02 6.221e+02 8.690e+02 1.208e+03 2.210e+03, threshold=1.738e+03, percent-clipped=15.0 2023-06-24 00:02:00,852 INFO [train.py:996] (0/4) Epoch 9, batch 24150, loss[loss=0.2558, simple_loss=0.3205, pruned_loss=0.09553, over 21727.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3239, pruned_loss=0.08788, over 4281402.27 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:02:24,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608702.0, ans=0.1 2023-06-24 00:03:28,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1608882.0, ans=0.125 2023-06-24 00:03:41,747 INFO [train.py:996] (0/4) Epoch 9, batch 24200, loss[loss=0.2348, simple_loss=0.3031, pruned_loss=0.08321, over 21262.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3279, pruned_loss=0.08998, over 4287344.45 frames. ], batch size: 159, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:03:43,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-24 00:03:50,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-24 00:03:52,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-24 00:04:22,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1609062.0, ans=0.0 2023-06-24 00:04:28,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-24 00:04:31,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1609062.0, ans=0.1 2023-06-24 00:05:07,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 7.380e+02 9.978e+02 1.387e+03 2.651e+03, threshold=1.996e+03, percent-clipped=13.0 2023-06-24 00:05:22,634 INFO [train.py:996] (0/4) Epoch 9, batch 24250, loss[loss=0.1861, simple_loss=0.2695, pruned_loss=0.05128, over 21302.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3238, pruned_loss=0.0831, over 4290633.34 frames. ], batch size: 143, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:05:42,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1609242.0, ans=0.015 2023-06-24 00:06:29,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1609422.0, ans=0.1 2023-06-24 00:06:45,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1609482.0, ans=0.2 2023-06-24 00:07:02,884 INFO [train.py:996] (0/4) Epoch 9, batch 24300, loss[loss=0.1756, simple_loss=0.2585, pruned_loss=0.04634, over 21715.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3156, pruned_loss=0.07711, over 4280560.21 frames. ], batch size: 298, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:08:01,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1609662.0, ans=0.0 2023-06-24 00:08:28,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.714e+02 6.136e+02 8.324e+02 1.263e+03 3.161e+03, threshold=1.665e+03, percent-clipped=12.0 2023-06-24 00:08:35,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1609782.0, ans=0.1 2023-06-24 00:08:35,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1609782.0, ans=0.125 2023-06-24 00:08:52,374 INFO [train.py:996] (0/4) Epoch 9, batch 24350, loss[loss=0.1923, simple_loss=0.2498, pruned_loss=0.06739, over 20195.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3121, pruned_loss=0.07728, over 4281779.68 frames. ], batch size: 702, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:10:34,932 INFO [train.py:996] (0/4) Epoch 9, batch 24400, loss[loss=0.246, simple_loss=0.3135, pruned_loss=0.08923, over 21472.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3162, pruned_loss=0.08093, over 4282911.46 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:10:36,980 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:10:46,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1610142.0, ans=0.125 2023-06-24 00:11:56,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 5.410e+02 6.876e+02 9.194e+02 2.686e+03, threshold=1.375e+03, percent-clipped=10.0 2023-06-24 00:12:11,161 INFO [train.py:996] (0/4) Epoch 9, batch 24450, loss[loss=0.2309, simple_loss=0.317, pruned_loss=0.07239, over 21644.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3201, pruned_loss=0.08211, over 4274916.22 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:12:25,344 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-24 00:12:39,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.72 vs. limit=15.0 2023-06-24 00:12:40,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1610502.0, ans=0.2 2023-06-24 00:12:44,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1610502.0, ans=0.125 2023-06-24 00:13:41,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1610682.0, ans=0.5 2023-06-24 00:13:51,193 INFO [train.py:996] (0/4) Epoch 9, batch 24500, loss[loss=0.2076, simple_loss=0.2891, pruned_loss=0.06305, over 21368.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3213, pruned_loss=0.08261, over 4278316.80 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:14:42,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1610862.0, ans=0.1 2023-06-24 00:15:15,971 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.883e+02 4.941e+02 6.307e+02 8.711e+02 3.165e+03, threshold=1.261e+03, percent-clipped=6.0 2023-06-24 00:15:23,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-24 00:15:28,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-24 00:15:35,235 INFO [train.py:996] (0/4) Epoch 9, batch 24550, loss[loss=0.3138, simple_loss=0.3813, pruned_loss=0.1232, over 21213.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3228, pruned_loss=0.08457, over 4278997.69 frames. ], batch size: 143, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:15:54,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1611102.0, ans=0.2 2023-06-24 00:15:58,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1611102.0, ans=0.0 2023-06-24 00:16:01,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1611102.0, ans=0.125 2023-06-24 00:17:05,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1611282.0, ans=0.0 2023-06-24 00:17:05,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1611282.0, ans=0.0 2023-06-24 00:17:13,026 INFO [train.py:996] (0/4) Epoch 9, batch 24600, loss[loss=0.2092, simple_loss=0.2789, pruned_loss=0.06981, over 21700.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3189, pruned_loss=0.08556, over 4271048.67 frames. ], batch size: 282, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:17:50,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1611462.0, ans=0.125 2023-06-24 00:18:33,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.043e+02 5.417e+02 8.330e+02 1.065e+03 1.781e+03, threshold=1.666e+03, percent-clipped=13.0 2023-06-24 00:18:52,542 INFO [train.py:996] (0/4) Epoch 9, batch 24650, loss[loss=0.2135, simple_loss=0.2737, pruned_loss=0.07662, over 21769.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3102, pruned_loss=0.08463, over 4268632.90 frames. ], batch size: 300, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:19:25,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1611702.0, ans=0.125 2023-06-24 00:19:49,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1611762.0, ans=0.125 2023-06-24 00:19:51,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-24 00:19:52,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1611762.0, ans=0.1 2023-06-24 00:20:32,034 INFO [train.py:996] (0/4) Epoch 9, batch 24700, loss[loss=0.2402, simple_loss=0.3056, pruned_loss=0.0874, over 21784.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3066, pruned_loss=0.08151, over 4271946.88 frames. ], batch size: 112, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:20:40,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1611942.0, ans=0.125 2023-06-24 00:20:44,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-24 00:20:46,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1611942.0, ans=0.125 2023-06-24 00:21:04,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1612002.0, ans=0.0 2023-06-24 00:21:53,241 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 5.774e+02 7.890e+02 1.274e+03 2.911e+03, threshold=1.578e+03, percent-clipped=10.0 2023-06-24 00:22:03,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1612182.0, ans=0.125 2023-06-24 00:22:05,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-24 00:22:10,845 INFO [train.py:996] (0/4) Epoch 9, batch 24750, loss[loss=0.2381, simple_loss=0.2924, pruned_loss=0.09188, over 21418.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3004, pruned_loss=0.07902, over 4269382.65 frames. ], batch size: 509, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:22:13,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1612242.0, ans=0.0 2023-06-24 00:22:49,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1612362.0, ans=0.05 2023-06-24 00:23:14,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1612422.0, ans=0.125 2023-06-24 00:23:44,088 INFO [train.py:996] (0/4) Epoch 9, batch 24800, loss[loss=0.2482, simple_loss=0.2923, pruned_loss=0.102, over 21627.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2957, pruned_loss=0.07823, over 4274476.74 frames. ], batch size: 508, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:24:10,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1612602.0, ans=0.025 2023-06-24 00:24:24,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-24 00:24:44,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1612662.0, ans=0.0 2023-06-24 00:24:49,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1612722.0, ans=0.0 2023-06-24 00:24:49,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1612722.0, ans=0.0 2023-06-24 00:24:57,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1612722.0, ans=0.1 2023-06-24 00:25:07,477 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.818e+02 6.000e+02 9.294e+02 1.511e+03 3.142e+03, threshold=1.859e+03, percent-clipped=19.0 2023-06-24 00:25:22,855 INFO [train.py:996] (0/4) Epoch 9, batch 24850, loss[loss=0.2205, simple_loss=0.3001, pruned_loss=0.0704, over 21057.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.298, pruned_loss=0.08041, over 4282055.41 frames. ], batch size: 608, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:25:29,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1612842.0, ans=0.125 2023-06-24 00:25:55,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1612902.0, ans=0.125 2023-06-24 00:26:35,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-24 00:26:58,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1613082.0, ans=0.2 2023-06-24 00:27:06,906 INFO [train.py:996] (0/4) Epoch 9, batch 24900, loss[loss=0.2554, simple_loss=0.3223, pruned_loss=0.09429, over 21822.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2994, pruned_loss=0.08095, over 4283978.23 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:27:21,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1613202.0, ans=0.125 2023-06-24 00:27:24,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=12.0 2023-06-24 00:28:06,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1613322.0, ans=0.2 2023-06-24 00:28:23,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1613322.0, ans=0.125 2023-06-24 00:28:36,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.759e+02 5.969e+02 8.739e+02 1.291e+03 2.372e+03, threshold=1.748e+03, percent-clipped=6.0 2023-06-24 00:28:48,016 INFO [train.py:996] (0/4) Epoch 9, batch 24950, loss[loss=0.2311, simple_loss=0.2781, pruned_loss=0.09205, over 20298.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3063, pruned_loss=0.08414, over 4285527.52 frames. ], batch size: 703, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:28:50,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1613442.0, ans=0.125 2023-06-24 00:28:52,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1613442.0, ans=0.1 2023-06-24 00:28:54,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1613442.0, ans=12.0 2023-06-24 00:30:14,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1613682.0, ans=0.125 2023-06-24 00:30:29,803 INFO [train.py:996] (0/4) Epoch 9, batch 25000, loss[loss=0.2043, simple_loss=0.2843, pruned_loss=0.06216, over 21839.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.314, pruned_loss=0.08672, over 4281055.98 frames. ], batch size: 118, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:31:06,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1613802.0, ans=0.0 2023-06-24 00:31:16,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1613862.0, ans=0.1 2023-06-24 00:31:57,964 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.997e+02 6.297e+02 8.541e+02 1.164e+03 2.225e+03, threshold=1.708e+03, percent-clipped=6.0 2023-06-24 00:32:08,796 INFO [train.py:996] (0/4) Epoch 9, batch 25050, loss[loss=0.2118, simple_loss=0.2624, pruned_loss=0.08059, over 20295.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3068, pruned_loss=0.0852, over 4272282.35 frames. ], batch size: 703, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:32:53,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1614162.0, ans=0.1 2023-06-24 00:33:50,142 INFO [train.py:996] (0/4) Epoch 9, batch 25100, loss[loss=0.2143, simple_loss=0.3153, pruned_loss=0.05664, over 21259.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3013, pruned_loss=0.08333, over 4272193.58 frames. ], batch size: 548, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:33:56,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1614342.0, ans=0.125 2023-06-24 00:34:04,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1614342.0, ans=0.2 2023-06-24 00:34:26,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1614402.0, ans=0.125 2023-06-24 00:35:11,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1614582.0, ans=22.5 2023-06-24 00:35:16,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 5.962e+02 8.437e+02 1.206e+03 2.426e+03, threshold=1.687e+03, percent-clipped=3.0 2023-06-24 00:35:22,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1614582.0, ans=0.125 2023-06-24 00:35:27,832 INFO [train.py:996] (0/4) Epoch 9, batch 25150, loss[loss=0.276, simple_loss=0.342, pruned_loss=0.105, over 21709.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3058, pruned_loss=0.08162, over 4271970.72 frames. ], batch size: 508, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:35:42,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1614702.0, ans=0.0 2023-06-24 00:35:52,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1614702.0, ans=0.125 2023-06-24 00:36:24,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1614822.0, ans=0.0 2023-06-24 00:36:26,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-24 00:37:08,159 INFO [train.py:996] (0/4) Epoch 9, batch 25200, loss[loss=0.206, simple_loss=0.2977, pruned_loss=0.05717, over 21588.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3055, pruned_loss=0.07931, over 4264818.41 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:37:53,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1615062.0, ans=0.125 2023-06-24 00:38:02,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1615062.0, ans=0.0 2023-06-24 00:38:35,134 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.592e+02 5.246e+02 7.409e+02 1.372e+03 3.913e+03, threshold=1.482e+03, percent-clipped=20.0 2023-06-24 00:38:40,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1615182.0, ans=0.1 2023-06-24 00:38:46,417 INFO [train.py:996] (0/4) Epoch 9, batch 25250, loss[loss=0.1969, simple_loss=0.2691, pruned_loss=0.06233, over 21199.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.302, pruned_loss=0.07667, over 4257626.26 frames. ], batch size: 159, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:38:46,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1615242.0, ans=0.125 2023-06-24 00:39:18,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1615302.0, ans=0.1 2023-06-24 00:39:46,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1615422.0, ans=0.1 2023-06-24 00:40:12,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1615482.0, ans=0.125 2023-06-24 00:40:14,940 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=22.5 2023-06-24 00:40:21,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-24 00:40:23,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1615542.0, ans=0.2 2023-06-24 00:40:24,976 INFO [train.py:996] (0/4) Epoch 9, batch 25300, loss[loss=0.2549, simple_loss=0.3383, pruned_loss=0.08578, over 21321.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2997, pruned_loss=0.07662, over 4239150.49 frames. ], batch size: 548, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:40:32,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-24 00:40:41,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1615602.0, ans=0.125 2023-06-24 00:41:05,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1615662.0, ans=0.0 2023-06-24 00:41:09,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1615662.0, ans=0.2 2023-06-24 00:41:38,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1615722.0, ans=0.125 2023-06-24 00:41:47,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1615782.0, ans=0.125 2023-06-24 00:41:55,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.408e+02 6.402e+02 8.209e+02 1.215e+03 2.497e+03, threshold=1.642e+03, percent-clipped=20.0 2023-06-24 00:41:58,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1615782.0, ans=0.125 2023-06-24 00:42:04,769 INFO [train.py:996] (0/4) Epoch 9, batch 25350, loss[loss=0.1812, simple_loss=0.2676, pruned_loss=0.04738, over 21591.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2998, pruned_loss=0.07571, over 4230709.51 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:42:06,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1615842.0, ans=0.125 2023-06-24 00:42:23,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1615902.0, ans=0.0 2023-06-24 00:42:42,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1615902.0, ans=0.0 2023-06-24 00:43:08,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1616022.0, ans=0.95 2023-06-24 00:43:26,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1616022.0, ans=0.1 2023-06-24 00:43:29,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1616082.0, ans=0.125 2023-06-24 00:43:44,798 INFO [train.py:996] (0/4) Epoch 9, batch 25400, loss[loss=0.2033, simple_loss=0.2709, pruned_loss=0.0679, over 21523.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2959, pruned_loss=0.07444, over 4227932.47 frames. ], batch size: 441, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:43:45,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1616142.0, ans=0.0 2023-06-24 00:45:17,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.471e+02 5.867e+02 9.461e+02 1.414e+03 2.497e+03, threshold=1.892e+03, percent-clipped=14.0 2023-06-24 00:45:27,904 INFO [train.py:996] (0/4) Epoch 9, batch 25450, loss[loss=0.2443, simple_loss=0.3208, pruned_loss=0.08393, over 21284.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2971, pruned_loss=0.07637, over 4237548.68 frames. ], batch size: 143, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:46:21,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1616562.0, ans=0.0 2023-06-24 00:46:21,731 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-24 00:47:04,830 INFO [train.py:996] (0/4) Epoch 9, batch 25500, loss[loss=0.186, simple_loss=0.2708, pruned_loss=0.05056, over 21644.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.296, pruned_loss=0.07245, over 4242847.65 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:47:07,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1616742.0, ans=0.125 2023-06-24 00:47:24,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1616742.0, ans=0.125 2023-06-24 00:47:26,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1616802.0, ans=0.1 2023-06-24 00:47:37,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1616802.0, ans=0.09899494936611666 2023-06-24 00:48:23,153 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:48:39,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.450e+02 4.898e+02 7.192e+02 1.024e+03 1.607e+03, threshold=1.438e+03, percent-clipped=0.0 2023-06-24 00:48:40,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1616982.0, ans=0.0 2023-06-24 00:48:49,474 INFO [train.py:996] (0/4) Epoch 9, batch 25550, loss[loss=0.1872, simple_loss=0.2557, pruned_loss=0.05937, over 15897.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3029, pruned_loss=0.07277, over 4242779.92 frames. ], batch size: 60, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:49:50,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1617162.0, ans=0.0 2023-06-24 00:50:34,409 INFO [train.py:996] (0/4) Epoch 9, batch 25600, loss[loss=0.3012, simple_loss=0.3708, pruned_loss=0.1158, over 21222.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3081, pruned_loss=0.07473, over 4257826.30 frames. ], batch size: 143, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:50:41,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1617342.0, ans=0.125 2023-06-24 00:51:16,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1617402.0, ans=0.1 2023-06-24 00:51:21,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1617462.0, ans=0.125 2023-06-24 00:51:21,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1617462.0, ans=0.0 2023-06-24 00:51:29,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1617462.0, ans=0.0 2023-06-24 00:51:59,682 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.613e+02 7.503e+02 1.087e+03 1.475e+03 2.223e+03, threshold=2.175e+03, percent-clipped=27.0 2023-06-24 00:52:10,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-24 00:52:13,793 INFO [train.py:996] (0/4) Epoch 9, batch 25650, loss[loss=0.2416, simple_loss=0.3008, pruned_loss=0.09121, over 21275.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3098, pruned_loss=0.07817, over 4256172.75 frames. ], batch size: 144, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:52:47,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1617702.0, ans=0.125 2023-06-24 00:52:49,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1617702.0, ans=0.125 2023-06-24 00:53:04,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1617762.0, ans=0.2 2023-06-24 00:53:54,174 INFO [train.py:996] (0/4) Epoch 9, batch 25700, loss[loss=0.2921, simple_loss=0.3495, pruned_loss=0.1173, over 21440.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3083, pruned_loss=0.07946, over 4252157.20 frames. ], batch size: 131, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:54:04,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1617942.0, ans=0.1 2023-06-24 00:54:10,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1617942.0, ans=0.2 2023-06-24 00:54:47,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1618062.0, ans=0.09899494936611666 2023-06-24 00:54:50,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1618062.0, ans=0.125 2023-06-24 00:55:10,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1618122.0, ans=0.125 2023-06-24 00:55:24,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.568e+02 8.876e+02 1.245e+03 3.057e+03, threshold=1.775e+03, percent-clipped=5.0 2023-06-24 00:55:39,599 INFO [train.py:996] (0/4) Epoch 9, batch 25750, loss[loss=0.2207, simple_loss=0.2881, pruned_loss=0.07666, over 20018.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3139, pruned_loss=0.08246, over 4257684.13 frames. ], batch size: 702, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:56:01,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.92 vs. limit=15.0 2023-06-24 00:56:03,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1618242.0, ans=0.0 2023-06-24 00:57:02,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1618422.0, ans=0.125 2023-06-24 00:57:13,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1618482.0, ans=0.125 2023-06-24 00:57:25,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1618482.0, ans=0.05 2023-06-24 00:57:33,104 INFO [train.py:996] (0/4) Epoch 9, batch 25800, loss[loss=0.3, simple_loss=0.3711, pruned_loss=0.1145, over 21780.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3253, pruned_loss=0.08651, over 4262346.54 frames. ], batch size: 441, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:57:44,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-24 00:57:49,335 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:58:02,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1618602.0, ans=0.125 2023-06-24 00:58:55,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1618722.0, ans=0.125 2023-06-24 00:59:05,231 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.745e+02 6.884e+02 9.421e+02 1.458e+03 3.090e+03, threshold=1.884e+03, percent-clipped=11.0 2023-06-24 00:59:14,593 INFO [train.py:996] (0/4) Epoch 9, batch 25850, loss[loss=0.2527, simple_loss=0.3159, pruned_loss=0.0948, over 21343.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3265, pruned_loss=0.08615, over 4263708.76 frames. ], batch size: 143, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:00:17,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1619022.0, ans=0.0 2023-06-24 01:00:47,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1619082.0, ans=0.125 2023-06-24 01:00:56,730 INFO [train.py:996] (0/4) Epoch 9, batch 25900, loss[loss=0.3077, simple_loss=0.3932, pruned_loss=0.1111, over 21698.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3264, pruned_loss=0.08631, over 4270305.72 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:01:31,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.24 vs. limit=6.0 2023-06-24 01:01:45,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1619262.0, ans=0.125 2023-06-24 01:02:14,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-24 01:02:26,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-24 01:02:27,424 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.264e+02 6.970e+02 9.529e+02 1.427e+03 2.797e+03, threshold=1.906e+03, percent-clipped=4.0 2023-06-24 01:02:37,420 INFO [train.py:996] (0/4) Epoch 9, batch 25950, loss[loss=0.2507, simple_loss=0.3311, pruned_loss=0.08516, over 21929.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3325, pruned_loss=0.0889, over 4273560.00 frames. ], batch size: 372, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:04:04,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1619682.0, ans=0.125 2023-06-24 01:04:21,702 INFO [train.py:996] (0/4) Epoch 9, batch 26000, loss[loss=0.269, simple_loss=0.3469, pruned_loss=0.0956, over 21964.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3312, pruned_loss=0.08705, over 4274329.62 frames. ], batch size: 317, lr: 3.22e-03, grad_scale: 32.0 2023-06-24 01:04:38,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1619742.0, ans=0.1 2023-06-24 01:05:01,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1619802.0, ans=0.2 2023-06-24 01:05:19,095 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:05:49,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.394e+02 5.949e+02 7.869e+02 1.155e+03 1.920e+03, threshold=1.574e+03, percent-clipped=1.0 2023-06-24 01:05:50,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1619982.0, ans=0.09899494936611666 2023-06-24 01:05:52,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-06-24 01:06:01,896 INFO [train.py:996] (0/4) Epoch 9, batch 26050, loss[loss=0.2946, simple_loss=0.3408, pruned_loss=0.1242, over 21726.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3317, pruned_loss=0.08862, over 4271601.71 frames. ], batch size: 508, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:06:18,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1620042.0, ans=0.125 2023-06-24 01:06:20,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1620042.0, ans=0.125 2023-06-24 01:06:46,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1620102.0, ans=0.09899494936611666 2023-06-24 01:06:50,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-24 01:07:21,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1620282.0, ans=0.1 2023-06-24 01:07:31,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-24 01:07:42,031 INFO [train.py:996] (0/4) Epoch 9, batch 26100, loss[loss=0.2234, simple_loss=0.2957, pruned_loss=0.07553, over 21890.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3267, pruned_loss=0.08822, over 4279308.48 frames. ], batch size: 371, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:08:03,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1620402.0, ans=0.0 2023-06-24 01:08:59,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-24 01:09:14,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 5.600e+02 7.578e+02 1.132e+03 2.519e+03, threshold=1.516e+03, percent-clipped=12.0 2023-06-24 01:09:15,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.72 vs. limit=6.0 2023-06-24 01:09:27,695 INFO [train.py:996] (0/4) Epoch 9, batch 26150, loss[loss=0.2602, simple_loss=0.3463, pruned_loss=0.08707, over 21831.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.324, pruned_loss=0.08814, over 4282185.38 frames. ], batch size: 124, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:10:10,550 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=15.0 2023-06-24 01:10:23,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-24 01:11:08,847 INFO [train.py:996] (0/4) Epoch 9, batch 26200, loss[loss=0.221, simple_loss=0.2797, pruned_loss=0.08114, over 20064.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3235, pruned_loss=0.08568, over 4286009.54 frames. ], batch size: 702, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:12:40,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.187e+02 6.061e+02 7.931e+02 1.081e+03 1.881e+03, threshold=1.586e+03, percent-clipped=8.0 2023-06-24 01:12:42,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-06-24 01:12:48,754 INFO [train.py:996] (0/4) Epoch 9, batch 26250, loss[loss=0.2358, simple_loss=0.3066, pruned_loss=0.08249, over 21516.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3263, pruned_loss=0.08398, over 4278771.07 frames. ], batch size: 194, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:13:05,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1621242.0, ans=0.1 2023-06-24 01:13:11,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1621302.0, ans=0.0 2023-06-24 01:13:25,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1621362.0, ans=0.125 2023-06-24 01:13:44,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1621422.0, ans=0.1 2023-06-24 01:13:45,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-24 01:13:53,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1621422.0, ans=0.125 2023-06-24 01:13:54,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1621422.0, ans=0.0 2023-06-24 01:14:14,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-24 01:14:27,405 INFO [train.py:996] (0/4) Epoch 9, batch 26300, loss[loss=0.235, simple_loss=0.3014, pruned_loss=0.08434, over 21680.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3232, pruned_loss=0.08475, over 4285726.81 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:14:44,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=15.0 2023-06-24 01:14:46,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-24 01:14:48,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1621602.0, ans=0.0 2023-06-24 01:15:05,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1621662.0, ans=0.125 2023-06-24 01:15:49,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-24 01:15:50,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1621782.0, ans=0.0 2023-06-24 01:16:00,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.956e+02 5.501e+02 7.809e+02 1.118e+03 2.350e+03, threshold=1.562e+03, percent-clipped=10.0 2023-06-24 01:16:08,912 INFO [train.py:996] (0/4) Epoch 9, batch 26350, loss[loss=0.2819, simple_loss=0.348, pruned_loss=0.1079, over 21479.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3214, pruned_loss=0.08551, over 4289130.96 frames. ], batch size: 194, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:16:10,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1621842.0, ans=0.0 2023-06-24 01:17:08,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-24 01:17:50,277 INFO [train.py:996] (0/4) Epoch 9, batch 26400, loss[loss=0.2135, simple_loss=0.2675, pruned_loss=0.07973, over 21610.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3169, pruned_loss=0.086, over 4274492.74 frames. ], batch size: 231, lr: 3.22e-03, grad_scale: 32.0 2023-06-24 01:18:09,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1622202.0, ans=10.0 2023-06-24 01:19:21,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-24 01:19:27,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.128e+02 6.613e+02 9.874e+02 1.376e+03 2.693e+03, threshold=1.975e+03, percent-clipped=17.0 2023-06-24 01:19:29,250 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:19:33,672 INFO [train.py:996] (0/4) Epoch 9, batch 26450, loss[loss=0.2567, simple_loss=0.344, pruned_loss=0.08471, over 21585.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3171, pruned_loss=0.08538, over 4274211.96 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:19:36,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-24 01:19:47,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622442.0, ans=0.1 2023-06-24 01:19:50,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-24 01:19:59,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622502.0, ans=0.1 2023-06-24 01:20:14,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1622502.0, ans=0.125 2023-06-24 01:20:46,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1622622.0, ans=0.2 2023-06-24 01:20:52,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1622622.0, ans=0.125 2023-06-24 01:20:58,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1622682.0, ans=0.0 2023-06-24 01:21:10,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1622682.0, ans=0.125 2023-06-24 01:21:16,745 INFO [train.py:996] (0/4) Epoch 9, batch 26500, loss[loss=0.2461, simple_loss=0.3123, pruned_loss=0.08995, over 21634.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3191, pruned_loss=0.08419, over 4265827.15 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:22:12,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622862.0, ans=0.1 2023-06-24 01:22:14,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.18 vs. limit=5.0 2023-06-24 01:22:15,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1622862.0, ans=0.2 2023-06-24 01:22:18,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1622862.0, ans=0.125 2023-06-24 01:22:59,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 8.000e+02 1.204e+03 2.180e+03 3.765e+03, threshold=2.409e+03, percent-clipped=29.0 2023-06-24 01:23:05,413 INFO [train.py:996] (0/4) Epoch 9, batch 26550, loss[loss=0.206, simple_loss=0.3195, pruned_loss=0.04628, over 20790.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3159, pruned_loss=0.08122, over 4270134.61 frames. ], batch size: 608, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:23:29,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-24 01:24:01,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-24 01:24:10,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1623222.0, ans=0.0 2023-06-24 01:24:28,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1623282.0, ans=0.0 2023-06-24 01:24:31,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1623282.0, ans=0.125 2023-06-24 01:24:55,381 INFO [train.py:996] (0/4) Epoch 9, batch 26600, loss[loss=0.2561, simple_loss=0.3141, pruned_loss=0.09902, over 20152.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3143, pruned_loss=0.07797, over 4273464.80 frames. ], batch size: 702, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:25:41,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1623462.0, ans=0.1 2023-06-24 01:26:13,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1623582.0, ans=0.125 2023-06-24 01:26:33,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.884e+02 5.096e+02 6.459e+02 8.360e+02 2.532e+03, threshold=1.292e+03, percent-clipped=3.0 2023-06-24 01:26:38,311 INFO [train.py:996] (0/4) Epoch 9, batch 26650, loss[loss=0.201, simple_loss=0.2841, pruned_loss=0.05891, over 21871.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.307, pruned_loss=0.07671, over 4272527.85 frames. ], batch size: 373, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:26:47,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1623642.0, ans=0.125 2023-06-24 01:28:05,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1623882.0, ans=0.0 2023-06-24 01:28:17,070 INFO [train.py:996] (0/4) Epoch 9, batch 26700, loss[loss=0.2641, simple_loss=0.3176, pruned_loss=0.1053, over 21776.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2993, pruned_loss=0.07376, over 4256589.96 frames. ], batch size: 508, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:28:33,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1623942.0, ans=0.0 2023-06-24 01:29:26,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1624122.0, ans=0.0 2023-06-24 01:29:55,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1624182.0, ans=0.09899494936611666 2023-06-24 01:29:58,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.138e+02 5.000e+02 7.039e+02 1.019e+03 2.446e+03, threshold=1.408e+03, percent-clipped=9.0 2023-06-24 01:30:03,537 INFO [train.py:996] (0/4) Epoch 9, batch 26750, loss[loss=0.2401, simple_loss=0.3329, pruned_loss=0.07362, over 20768.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3005, pruned_loss=0.07345, over 4262131.71 frames. ], batch size: 607, lr: 3.21e-03, grad_scale: 8.0 2023-06-24 01:30:15,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1624242.0, ans=0.0 2023-06-24 01:30:15,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1624242.0, ans=0.125 2023-06-24 01:30:48,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1624362.0, ans=0.05 2023-06-24 01:31:44,746 INFO [train.py:996] (0/4) Epoch 9, batch 26800, loss[loss=0.2928, simple_loss=0.3604, pruned_loss=0.1127, over 21365.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3086, pruned_loss=0.07833, over 4272576.75 frames. ], batch size: 159, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:31:52,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1624542.0, ans=0.125 2023-06-24 01:32:47,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1624722.0, ans=0.025 2023-06-24 01:33:19,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.341e+02 8.835e+02 1.201e+03 2.832e+03, threshold=1.767e+03, percent-clipped=16.0 2023-06-24 01:33:24,515 INFO [train.py:996] (0/4) Epoch 9, batch 26850, loss[loss=0.2375, simple_loss=0.2906, pruned_loss=0.09217, over 21289.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3116, pruned_loss=0.08146, over 4272151.53 frames. ], batch size: 159, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:33:41,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1624842.0, ans=0.125 2023-06-24 01:33:56,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=12.0 2023-06-24 01:34:03,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1624962.0, ans=0.2 2023-06-24 01:34:25,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1624962.0, ans=0.2 2023-06-24 01:34:51,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1625082.0, ans=0.125 2023-06-24 01:35:06,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1625142.0, ans=0.125 2023-06-24 01:35:07,323 INFO [train.py:996] (0/4) Epoch 9, batch 26900, loss[loss=0.2178, simple_loss=0.274, pruned_loss=0.0808, over 21508.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3035, pruned_loss=0.08071, over 4274360.78 frames. ], batch size: 442, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:35:26,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1625202.0, ans=0.1 2023-06-24 01:35:45,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1625262.0, ans=0.0 2023-06-24 01:36:16,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1625322.0, ans=0.125 2023-06-24 01:36:30,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1625382.0, ans=0.0 2023-06-24 01:36:37,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.966e+02 5.702e+02 7.755e+02 1.168e+03 3.900e+03, threshold=1.551e+03, percent-clipped=4.0 2023-06-24 01:36:41,778 INFO [train.py:996] (0/4) Epoch 9, batch 26950, loss[loss=0.2627, simple_loss=0.3479, pruned_loss=0.08877, over 21568.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3033, pruned_loss=0.08049, over 4272752.56 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:38:05,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1625682.0, ans=0.125 2023-06-24 01:38:06,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.61 vs. limit=15.0 2023-06-24 01:38:15,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1625682.0, ans=0.125 2023-06-24 01:38:23,733 INFO [train.py:996] (0/4) Epoch 9, batch 27000, loss[loss=0.1989, simple_loss=0.2792, pruned_loss=0.05934, over 21599.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3021, pruned_loss=0.07713, over 4269399.98 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:38:23,734 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 01:38:38,296 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7450, 3.7217, 2.1309, 1.6850], device='cuda:0') 2023-06-24 01:38:43,005 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2397, simple_loss=0.3375, pruned_loss=0.07102, over 1796401.00 frames. 2023-06-24 01:38:43,005 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 01:39:09,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1625802.0, ans=0.0 2023-06-24 01:39:13,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1625802.0, ans=0.0 2023-06-24 01:39:34,170 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-24 01:39:38,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.46 vs. limit=15.0 2023-06-24 01:39:48,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1625922.0, ans=0.0 2023-06-24 01:39:53,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1625922.0, ans=0.125 2023-06-24 01:40:18,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.845e+02 5.629e+02 9.070e+02 1.221e+03 2.568e+03, threshold=1.814e+03, percent-clipped=16.0 2023-06-24 01:40:23,234 INFO [train.py:996] (0/4) Epoch 9, batch 27050, loss[loss=0.2219, simple_loss=0.3101, pruned_loss=0.06679, over 21636.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3049, pruned_loss=0.07477, over 4274963.08 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:40:24,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1626042.0, ans=15.0 2023-06-24 01:41:53,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1626282.0, ans=0.125 2023-06-24 01:41:59,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-24 01:42:04,841 INFO [train.py:996] (0/4) Epoch 9, batch 27100, loss[loss=0.2345, simple_loss=0.3292, pruned_loss=0.06995, over 21467.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3079, pruned_loss=0.07648, over 4281654.55 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:42:08,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1626342.0, ans=0.125 2023-06-24 01:42:47,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1626402.0, ans=0.0 2023-06-24 01:43:04,156 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-24 01:43:04,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1626462.0, ans=0.2 2023-06-24 01:43:34,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1626582.0, ans=0.0 2023-06-24 01:43:42,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.346e+02 6.138e+02 8.870e+02 1.301e+03 2.299e+03, threshold=1.774e+03, percent-clipped=4.0 2023-06-24 01:43:47,602 INFO [train.py:996] (0/4) Epoch 9, batch 27150, loss[loss=0.2975, simple_loss=0.3787, pruned_loss=0.1082, over 21855.00 frames. ], tot_loss[loss=0.24, simple_loss=0.32, pruned_loss=0.07996, over 4274799.56 frames. ], batch size: 316, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:43:51,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=8.0 2023-06-24 01:45:16,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1626882.0, ans=0.125 2023-06-24 01:45:30,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-24 01:45:32,218 INFO [train.py:996] (0/4) Epoch 9, batch 27200, loss[loss=0.2852, simple_loss=0.3564, pruned_loss=0.1069, over 21730.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3278, pruned_loss=0.08228, over 4275489.11 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:45:46,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1626942.0, ans=0.2 2023-06-24 01:45:58,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.26 vs. limit=10.0 2023-06-24 01:46:18,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1627062.0, ans=0.0 2023-06-24 01:46:33,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1627062.0, ans=0.09899494936611666 2023-06-24 01:47:17,487 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 6.892e+02 9.906e+02 1.306e+03 3.118e+03, threshold=1.981e+03, percent-clipped=15.0 2023-06-24 01:47:22,732 INFO [train.py:996] (0/4) Epoch 9, batch 27250, loss[loss=0.3162, simple_loss=0.3733, pruned_loss=0.1295, over 21801.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3299, pruned_loss=0.08589, over 4273262.10 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:48:12,238 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:48:59,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1627482.0, ans=0.125 2023-06-24 01:49:03,894 INFO [train.py:996] (0/4) Epoch 9, batch 27300, loss[loss=0.173, simple_loss=0.2463, pruned_loss=0.04988, over 16626.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3313, pruned_loss=0.08736, over 4269895.37 frames. ], batch size: 60, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:49:13,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1627542.0, ans=0.1 2023-06-24 01:49:15,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1627542.0, ans=0.0 2023-06-24 01:49:33,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1627602.0, ans=0.1 2023-06-24 01:49:36,952 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:50:17,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1627722.0, ans=0.1 2023-06-24 01:50:40,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 5.255e+02 6.671e+02 8.856e+02 1.687e+03, threshold=1.334e+03, percent-clipped=0.0 2023-06-24 01:50:43,364 INFO [train.py:996] (0/4) Epoch 9, batch 27350, loss[loss=0.2732, simple_loss=0.3969, pruned_loss=0.07469, over 19873.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3346, pruned_loss=0.08743, over 4259723.36 frames. ], batch size: 702, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:50:51,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1627842.0, ans=0.0 2023-06-24 01:51:12,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1627902.0, ans=0.125 2023-06-24 01:51:24,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1627962.0, ans=15.0 2023-06-24 01:52:21,423 INFO [train.py:996] (0/4) Epoch 9, batch 27400, loss[loss=0.253, simple_loss=0.3131, pruned_loss=0.09642, over 21803.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3297, pruned_loss=0.08704, over 4264706.79 frames. ], batch size: 371, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:52:55,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1628202.0, ans=0.0 2023-06-24 01:53:01,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1628262.0, ans=6.0 2023-06-24 01:53:09,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.00 vs. limit=6.0 2023-06-24 01:53:14,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1628262.0, ans=0.02 2023-06-24 01:53:38,016 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-24 01:53:58,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.992e+02 5.150e+02 6.408e+02 1.007e+03 1.892e+03, threshold=1.282e+03, percent-clipped=7.0 2023-06-24 01:54:00,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1628442.0, ans=0.1 2023-06-24 01:54:01,983 INFO [train.py:996] (0/4) Epoch 9, batch 27450, loss[loss=0.2603, simple_loss=0.3328, pruned_loss=0.09386, over 21880.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3232, pruned_loss=0.08546, over 4275239.11 frames. ], batch size: 316, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:54:11,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1628442.0, ans=0.2 2023-06-24 01:55:06,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1628622.0, ans=0.125 2023-06-24 01:55:36,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1628682.0, ans=0.0 2023-06-24 01:55:36,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1628682.0, ans=0.1 2023-06-24 01:55:39,278 INFO [train.py:996] (0/4) Epoch 9, batch 27500, loss[loss=0.2421, simple_loss=0.3058, pruned_loss=0.08918, over 21531.00 frames. ], tot_loss[loss=0.246, simple_loss=0.321, pruned_loss=0.08555, over 4275847.58 frames. ], batch size: 194, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:56:32,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1628862.0, ans=0.125 2023-06-24 01:56:52,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1628922.0, ans=0.125 2023-06-24 01:57:00,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1628982.0, ans=0.0 2023-06-24 01:57:02,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-24 01:57:11,073 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.736e+02 5.167e+02 6.606e+02 9.536e+02 1.970e+03, threshold=1.321e+03, percent-clipped=8.0 2023-06-24 01:57:18,525 INFO [train.py:996] (0/4) Epoch 9, batch 27550, loss[loss=0.2503, simple_loss=0.3111, pruned_loss=0.09475, over 21412.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3154, pruned_loss=0.08238, over 4273478.62 frames. ], batch size: 507, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:57:23,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1629042.0, ans=0.1 2023-06-24 01:58:21,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1629222.0, ans=0.125 2023-06-24 01:58:23,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1629222.0, ans=0.125 2023-06-24 01:58:56,723 INFO [train.py:996] (0/4) Epoch 9, batch 27600, loss[loss=0.2392, simple_loss=0.2951, pruned_loss=0.09168, over 21839.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3108, pruned_loss=0.08198, over 4270337.76 frames. ], batch size: 107, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:59:16,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1629402.0, ans=0.2 2023-06-24 01:59:28,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1629402.0, ans=0.0 2023-06-24 01:59:45,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1629462.0, ans=0.125 2023-06-24 02:00:02,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1629522.0, ans=0.5 2023-06-24 02:00:05,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1629522.0, ans=0.0 2023-06-24 02:00:27,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.819e+02 6.454e+02 1.076e+03 1.748e+03 4.788e+03, threshold=2.152e+03, percent-clipped=40.0 2023-06-24 02:00:31,099 INFO [train.py:996] (0/4) Epoch 9, batch 27650, loss[loss=0.2298, simple_loss=0.2983, pruned_loss=0.08067, over 21817.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3044, pruned_loss=0.08074, over 4260124.14 frames. ], batch size: 351, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:00:43,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1629642.0, ans=0.125 2023-06-24 02:01:25,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1629762.0, ans=0.025 2023-06-24 02:01:33,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-24 02:01:49,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1629822.0, ans=0.035 2023-06-24 02:02:14,151 INFO [train.py:996] (0/4) Epoch 9, batch 27700, loss[loss=0.2172, simple_loss=0.2996, pruned_loss=0.06742, over 21611.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3052, pruned_loss=0.07908, over 4261342.07 frames. ], batch size: 230, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:02:28,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1629942.0, ans=15.0 2023-06-24 02:02:52,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1630002.0, ans=0.125 2023-06-24 02:03:53,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.601e+02 6.018e+02 9.406e+02 1.396e+03 2.807e+03, threshold=1.881e+03, percent-clipped=3.0 2023-06-24 02:03:55,127 INFO [train.py:996] (0/4) Epoch 9, batch 27750, loss[loss=0.1976, simple_loss=0.2788, pruned_loss=0.0582, over 21161.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3069, pruned_loss=0.07828, over 4260993.03 frames. ], batch size: 159, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:04:13,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1630302.0, ans=0.1 2023-06-24 02:04:51,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1630362.0, ans=0.125 2023-06-24 02:04:52,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1630362.0, ans=0.125 2023-06-24 02:05:28,040 INFO [train.py:996] (0/4) Epoch 9, batch 27800, loss[loss=0.2157, simple_loss=0.2803, pruned_loss=0.07554, over 21586.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.305, pruned_loss=0.0779, over 4273610.33 frames. ], batch size: 195, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:05:55,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1630602.0, ans=0.0 2023-06-24 02:05:58,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1630602.0, ans=0.1 2023-06-24 02:06:14,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1630662.0, ans=0.125 2023-06-24 02:06:25,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1630662.0, ans=0.0 2023-06-24 02:06:32,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1630662.0, ans=0.0 2023-06-24 02:06:47,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1630722.0, ans=0.2 2023-06-24 02:06:57,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-24 02:07:10,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.786e+02 6.303e+02 8.930e+02 1.307e+03 2.305e+03, threshold=1.786e+03, percent-clipped=6.0 2023-06-24 02:07:12,469 INFO [train.py:996] (0/4) Epoch 9, batch 27850, loss[loss=0.2502, simple_loss=0.3146, pruned_loss=0.09288, over 21870.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3042, pruned_loss=0.07942, over 4285002.31 frames. ], batch size: 371, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:07:23,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1630842.0, ans=0.125 2023-06-24 02:07:46,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1630902.0, ans=0.125 2023-06-24 02:07:57,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1630962.0, ans=0.1 2023-06-24 02:08:49,388 INFO [train.py:996] (0/4) Epoch 9, batch 27900, loss[loss=0.2128, simple_loss=0.3088, pruned_loss=0.05838, over 21644.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3141, pruned_loss=0.0811, over 4280751.46 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:09:31,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-24 02:09:36,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1631262.0, ans=0.0 2023-06-24 02:10:44,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.905e+02 5.823e+02 7.958e+02 1.181e+03 2.526e+03, threshold=1.592e+03, percent-clipped=8.0 2023-06-24 02:10:45,602 INFO [train.py:996] (0/4) Epoch 9, batch 27950, loss[loss=0.2404, simple_loss=0.3301, pruned_loss=0.07536, over 21737.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.314, pruned_loss=0.07812, over 4283493.00 frames. ], batch size: 351, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:10:48,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1631442.0, ans=15.0 2023-06-24 02:11:39,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1631622.0, ans=0.1 2023-06-24 02:12:18,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1631742.0, ans=0.125 2023-06-24 02:12:20,044 INFO [train.py:996] (0/4) Epoch 9, batch 28000, loss[loss=0.2167, simple_loss=0.292, pruned_loss=0.07065, over 21407.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.312, pruned_loss=0.07652, over 4283161.35 frames. ], batch size: 194, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:12:28,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1631742.0, ans=0.1 2023-06-24 02:12:28,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1631742.0, ans=0.0 2023-06-24 02:12:47,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1631802.0, ans=0.125 2023-06-24 02:12:59,836 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.37 vs. limit=22.5 2023-06-24 02:13:01,392 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-24 02:13:16,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1631922.0, ans=0.125 2023-06-24 02:13:24,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1631922.0, ans=0.0 2023-06-24 02:13:37,542 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-272000.pt 2023-06-24 02:14:05,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1632042.0, ans=0.0 2023-06-24 02:14:06,385 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.113e+02 6.367e+02 8.563e+02 1.195e+03 2.843e+03, threshold=1.713e+03, percent-clipped=10.0 2023-06-24 02:14:06,416 INFO [train.py:996] (0/4) Epoch 9, batch 28050, loss[loss=0.2094, simple_loss=0.2883, pruned_loss=0.06529, over 21858.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3113, pruned_loss=0.07824, over 4285546.57 frames. ], batch size: 316, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:15:07,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1632222.0, ans=0.0 2023-06-24 02:15:45,597 INFO [train.py:996] (0/4) Epoch 9, batch 28100, loss[loss=0.2124, simple_loss=0.2676, pruned_loss=0.07856, over 21559.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3086, pruned_loss=0.07823, over 4285092.43 frames. ], batch size: 195, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:15:46,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.85 vs. limit=22.5 2023-06-24 02:16:23,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1632462.0, ans=0.0 2023-06-24 02:17:22,319 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.058e+02 5.561e+02 8.006e+02 1.245e+03 3.714e+03, threshold=1.601e+03, percent-clipped=14.0 2023-06-24 02:17:22,349 INFO [train.py:996] (0/4) Epoch 9, batch 28150, loss[loss=0.2081, simple_loss=0.2739, pruned_loss=0.07118, over 21440.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.302, pruned_loss=0.07808, over 4283876.49 frames. ], batch size: 389, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:17:29,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1632642.0, ans=0.125 2023-06-24 02:17:46,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=22.5 2023-06-24 02:18:09,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1632762.0, ans=0.0 2023-06-24 02:18:15,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1632822.0, ans=0.125 2023-06-24 02:18:40,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1632882.0, ans=0.025 2023-06-24 02:18:56,322 INFO [train.py:996] (0/4) Epoch 9, batch 28200, loss[loss=0.2672, simple_loss=0.3324, pruned_loss=0.101, over 21571.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3022, pruned_loss=0.07941, over 4280877.71 frames. ], batch size: 414, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:19:33,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1633062.0, ans=0.0 2023-06-24 02:19:44,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1633062.0, ans=0.125 2023-06-24 02:20:33,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1633182.0, ans=0.125 2023-06-24 02:20:35,756 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.373e+02 7.070e+02 1.041e+03 1.545e+03 2.791e+03, threshold=2.082e+03, percent-clipped=22.0 2023-06-24 02:20:35,778 INFO [train.py:996] (0/4) Epoch 9, batch 28250, loss[loss=0.1874, simple_loss=0.2383, pruned_loss=0.06826, over 20859.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3043, pruned_loss=0.08212, over 4277756.20 frames. ], batch size: 609, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:20:36,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1633242.0, ans=0.07 2023-06-24 02:20:43,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1633242.0, ans=0.05 2023-06-24 02:20:50,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1633302.0, ans=0.0 2023-06-24 02:22:17,414 INFO [train.py:996] (0/4) Epoch 9, batch 28300, loss[loss=0.1877, simple_loss=0.2912, pruned_loss=0.04211, over 21718.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3027, pruned_loss=0.08064, over 4273348.05 frames. ], batch size: 332, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:23:24,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1633722.0, ans=0.125 2023-06-24 02:23:40,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1633782.0, ans=0.95 2023-06-24 02:23:46,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1633782.0, ans=0.125 2023-06-24 02:23:49,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1633782.0, ans=0.125 2023-06-24 02:23:55,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1633842.0, ans=0.02 2023-06-24 02:23:56,798 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 5.879e+02 9.596e+02 1.261e+03 3.601e+03, threshold=1.919e+03, percent-clipped=6.0 2023-06-24 02:23:56,828 INFO [train.py:996] (0/4) Epoch 9, batch 28350, loss[loss=0.2545, simple_loss=0.339, pruned_loss=0.08497, over 21413.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2993, pruned_loss=0.07514, over 4275221.53 frames. ], batch size: 507, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:23:59,512 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-24 02:24:19,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1633902.0, ans=0.125 2023-06-24 02:25:20,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1634082.0, ans=0.025 2023-06-24 02:25:40,930 INFO [train.py:996] (0/4) Epoch 9, batch 28400, loss[loss=0.2436, simple_loss=0.3045, pruned_loss=0.0913, over 21640.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2966, pruned_loss=0.0748, over 4266618.19 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:25:47,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1634142.0, ans=0.125 2023-06-24 02:26:09,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1634202.0, ans=0.0 2023-06-24 02:26:22,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1634262.0, ans=0.125 2023-06-24 02:26:52,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1634322.0, ans=0.125 2023-06-24 02:27:18,256 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.912e+02 5.958e+02 8.832e+02 1.277e+03 2.222e+03, threshold=1.766e+03, percent-clipped=4.0 2023-06-24 02:27:18,276 INFO [train.py:996] (0/4) Epoch 9, batch 28450, loss[loss=0.2463, simple_loss=0.3137, pruned_loss=0.08944, over 21820.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3005, pruned_loss=0.07757, over 4267000.79 frames. ], batch size: 282, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:27:26,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1634442.0, ans=0.5 2023-06-24 02:27:39,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1634502.0, ans=0.125 2023-06-24 02:28:09,479 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:28:11,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1634562.0, ans=0.125 2023-06-24 02:28:37,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=15.0 2023-06-24 02:28:55,997 INFO [train.py:996] (0/4) Epoch 9, batch 28500, loss[loss=0.2638, simple_loss=0.323, pruned_loss=0.1024, over 21811.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.303, pruned_loss=0.0804, over 4280910.54 frames. ], batch size: 247, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:29:15,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1634802.0, ans=0.1 2023-06-24 02:29:38,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-24 02:30:24,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1634982.0, ans=0.0 2023-06-24 02:30:40,777 INFO [train.py:996] (0/4) Epoch 9, batch 28550, loss[loss=0.2455, simple_loss=0.3524, pruned_loss=0.0693, over 21882.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3117, pruned_loss=0.08342, over 4287334.48 frames. ], batch size: 372, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:30:42,308 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.957e+02 5.998e+02 7.738e+02 1.217e+03 2.112e+03, threshold=1.548e+03, percent-clipped=6.0 2023-06-24 02:30:55,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1635042.0, ans=0.0 2023-06-24 02:30:59,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1635102.0, ans=0.125 2023-06-24 02:31:11,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1635102.0, ans=0.125 2023-06-24 02:31:56,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1635222.0, ans=0.0 2023-06-24 02:31:59,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1635282.0, ans=0.125 2023-06-24 02:32:24,259 INFO [train.py:996] (0/4) Epoch 9, batch 28600, loss[loss=0.263, simple_loss=0.3354, pruned_loss=0.09529, over 21708.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3185, pruned_loss=0.08561, over 4289547.98 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:33:01,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1635462.0, ans=0.125 2023-06-24 02:33:20,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1635522.0, ans=0.09899494936611666 2023-06-24 02:33:25,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1635522.0, ans=0.125 2023-06-24 02:34:02,765 INFO [train.py:996] (0/4) Epoch 9, batch 28650, loss[loss=0.2302, simple_loss=0.3043, pruned_loss=0.07801, over 20050.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3135, pruned_loss=0.08472, over 4282941.32 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:34:11,199 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.746e+02 6.069e+02 8.380e+02 1.162e+03 2.307e+03, threshold=1.676e+03, percent-clipped=7.0 2023-06-24 02:34:28,822 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:35:26,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1635882.0, ans=0.0 2023-06-24 02:35:28,999 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:35:46,698 INFO [train.py:996] (0/4) Epoch 9, batch 28700, loss[loss=0.1929, simple_loss=0.2406, pruned_loss=0.0726, over 20022.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3113, pruned_loss=0.08502, over 4283512.31 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:36:03,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1636002.0, ans=0.0 2023-06-24 02:36:42,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1636122.0, ans=0.125 2023-06-24 02:37:22,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1636182.0, ans=0.125 2023-06-24 02:37:25,546 INFO [train.py:996] (0/4) Epoch 9, batch 28750, loss[loss=0.2255, simple_loss=0.2951, pruned_loss=0.07799, over 21934.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3117, pruned_loss=0.08506, over 4279853.40 frames. ], batch size: 316, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:37:28,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.971e+02 6.417e+02 8.454e+02 1.129e+03 2.571e+03, threshold=1.691e+03, percent-clipped=6.0 2023-06-24 02:37:29,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1636242.0, ans=0.125 2023-06-24 02:37:57,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1636302.0, ans=0.125 2023-06-24 02:38:12,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1636362.0, ans=0.0 2023-06-24 02:38:22,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1636422.0, ans=0.125 2023-06-24 02:38:50,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636482.0, ans=0.1 2023-06-24 02:39:04,500 INFO [train.py:996] (0/4) Epoch 9, batch 28800, loss[loss=0.3153, simple_loss=0.3748, pruned_loss=0.1279, over 21384.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3151, pruned_loss=0.08529, over 4277541.94 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:39:05,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1636542.0, ans=0.125 2023-06-24 02:39:07,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-24 02:39:43,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1636662.0, ans=0.125 2023-06-24 02:39:45,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1636662.0, ans=0.125 2023-06-24 02:39:48,556 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:40:28,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636782.0, ans=0.1 2023-06-24 02:40:43,045 INFO [train.py:996] (0/4) Epoch 9, batch 28850, loss[loss=0.2537, simple_loss=0.3104, pruned_loss=0.09844, over 21843.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3165, pruned_loss=0.08707, over 4284501.85 frames. ], batch size: 247, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:40:45,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1636842.0, ans=0.125 2023-06-24 02:40:46,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 7.030e+02 9.281e+02 1.224e+03 2.045e+03, threshold=1.856e+03, percent-clipped=4.0 2023-06-24 02:41:12,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1636902.0, ans=0.125 2023-06-24 02:42:23,115 INFO [train.py:996] (0/4) Epoch 9, batch 28900, loss[loss=0.3655, simple_loss=0.4119, pruned_loss=0.1595, over 21448.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3216, pruned_loss=0.08913, over 4281620.10 frames. ], batch size: 507, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:42:25,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1637142.0, ans=0.125 2023-06-24 02:42:51,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1637202.0, ans=0.09899494936611666 2023-06-24 02:42:52,107 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-24 02:43:07,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1637262.0, ans=0.125 2023-06-24 02:43:35,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1637322.0, ans=0.0 2023-06-24 02:44:07,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1637442.0, ans=0.0 2023-06-24 02:44:08,031 INFO [train.py:996] (0/4) Epoch 9, batch 28950, loss[loss=0.2287, simple_loss=0.3249, pruned_loss=0.06627, over 21767.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3224, pruned_loss=0.088, over 4277877.29 frames. ], batch size: 332, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:44:10,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1637442.0, ans=0.0 2023-06-24 02:44:11,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.380e+02 7.524e+02 1.128e+03 1.793e+03 3.083e+03, threshold=2.257e+03, percent-clipped=23.0 2023-06-24 02:44:26,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.04 vs. limit=10.0 2023-06-24 02:44:57,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1637562.0, ans=0.0 2023-06-24 02:45:47,870 INFO [train.py:996] (0/4) Epoch 9, batch 29000, loss[loss=0.2835, simple_loss=0.3475, pruned_loss=0.1097, over 21253.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3235, pruned_loss=0.08646, over 4275669.14 frames. ], batch size: 143, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:46:21,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1637802.0, ans=0.125 2023-06-24 02:46:44,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1637862.0, ans=0.1 2023-06-24 02:46:58,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-24 02:47:23,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1637982.0, ans=0.0 2023-06-24 02:47:32,703 INFO [train.py:996] (0/4) Epoch 9, batch 29050, loss[loss=0.2387, simple_loss=0.3089, pruned_loss=0.08425, over 21885.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3227, pruned_loss=0.0874, over 4284276.14 frames. ], batch size: 414, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:47:40,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 6.530e+02 1.098e+03 1.738e+03 3.592e+03, threshold=2.195e+03, percent-clipped=7.0 2023-06-24 02:48:17,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-24 02:48:36,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1638222.0, ans=0.0 2023-06-24 02:48:40,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1638222.0, ans=0.125 2023-06-24 02:48:45,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-24 02:48:53,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-24 02:49:06,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-24 02:49:11,751 INFO [train.py:996] (0/4) Epoch 9, batch 29100, loss[loss=0.1892, simple_loss=0.2495, pruned_loss=0.06441, over 21189.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3135, pruned_loss=0.08493, over 4284306.24 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:49:12,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1638342.0, ans=0.2 2023-06-24 02:49:13,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1638342.0, ans=0.0 2023-06-24 02:49:13,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1638342.0, ans=0.0 2023-06-24 02:49:35,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1638402.0, ans=0.2 2023-06-24 02:49:35,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1638402.0, ans=0.07 2023-06-24 02:49:42,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-24 02:49:56,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1638462.0, ans=0.0 2023-06-24 02:50:17,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1638522.0, ans=0.1 2023-06-24 02:50:54,252 INFO [train.py:996] (0/4) Epoch 9, batch 29150, loss[loss=0.2746, simple_loss=0.3534, pruned_loss=0.09791, over 21698.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3113, pruned_loss=0.08344, over 4287419.60 frames. ], batch size: 332, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:50:57,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.333e+02 5.735e+02 8.237e+02 1.411e+03 3.649e+03, threshold=1.647e+03, percent-clipped=7.0 2023-06-24 02:51:03,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1638642.0, ans=0.0 2023-06-24 02:51:03,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1638642.0, ans=0.125 2023-06-24 02:51:55,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1638822.0, ans=0.0 2023-06-24 02:52:32,400 INFO [train.py:996] (0/4) Epoch 9, batch 29200, loss[loss=0.2138, simple_loss=0.2833, pruned_loss=0.07216, over 21571.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3076, pruned_loss=0.08318, over 4283258.02 frames. ], batch size: 414, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:52:58,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1639002.0, ans=0.2 2023-06-24 02:53:13,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1639062.0, ans=0.125 2023-06-24 02:53:49,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-24 02:54:06,545 INFO [train.py:996] (0/4) Epoch 9, batch 29250, loss[loss=0.2316, simple_loss=0.2999, pruned_loss=0.08166, over 21563.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3078, pruned_loss=0.08199, over 4284910.58 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:54:09,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.813e+02 6.323e+02 1.080e+03 1.364e+03 2.361e+03, threshold=2.161e+03, percent-clipped=10.0 2023-06-24 02:54:14,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1639242.0, ans=0.125 2023-06-24 02:54:36,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1639302.0, ans=0.125 2023-06-24 02:54:38,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1639302.0, ans=0.125 2023-06-24 02:55:02,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1639362.0, ans=0.04949747468305833 2023-06-24 02:55:22,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1639422.0, ans=0.125 2023-06-24 02:55:45,682 INFO [train.py:996] (0/4) Epoch 9, batch 29300, loss[loss=0.2338, simple_loss=0.3031, pruned_loss=0.08222, over 21757.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3085, pruned_loss=0.08046, over 4269482.69 frames. ], batch size: 351, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:55:46,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1639542.0, ans=0.07 2023-06-24 02:56:01,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1639542.0, ans=0.125 2023-06-24 02:56:03,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1639542.0, ans=0.2 2023-06-24 02:56:15,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1639602.0, ans=0.0 2023-06-24 02:57:08,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1639782.0, ans=0.1 2023-06-24 02:57:20,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1639842.0, ans=0.125 2023-06-24 02:57:21,361 INFO [train.py:996] (0/4) Epoch 9, batch 29350, loss[loss=0.2461, simple_loss=0.3441, pruned_loss=0.07405, over 20912.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3047, pruned_loss=0.0795, over 4268115.13 frames. ], batch size: 609, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:57:29,484 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 5.877e+02 8.410e+02 1.271e+03 3.253e+03, threshold=1.682e+03, percent-clipped=5.0 2023-06-24 02:57:59,536 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-24 02:58:04,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1639962.0, ans=0.0 2023-06-24 02:58:37,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1640022.0, ans=0.0 2023-06-24 02:59:01,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1640142.0, ans=0.1 2023-06-24 02:59:02,525 INFO [train.py:996] (0/4) Epoch 9, batch 29400, loss[loss=0.1825, simple_loss=0.252, pruned_loss=0.05648, over 21601.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3035, pruned_loss=0.07714, over 4259426.16 frames. ], batch size: 195, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:59:09,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1640142.0, ans=0.125 2023-06-24 02:59:16,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-24 02:59:26,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1640202.0, ans=0.0 2023-06-24 02:59:40,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1640262.0, ans=0.05 2023-06-24 02:59:49,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1640262.0, ans=0.0 2023-06-24 03:00:30,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1640382.0, ans=0.125 2023-06-24 03:00:38,203 INFO [train.py:996] (0/4) Epoch 9, batch 29450, loss[loss=0.1894, simple_loss=0.293, pruned_loss=0.04291, over 20897.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3036, pruned_loss=0.07621, over 4260521.11 frames. ], batch size: 609, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:00:43,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.242e+02 8.018e+02 1.543e+03 2.395e+03 4.126e+03, threshold=3.085e+03, percent-clipped=41.0 2023-06-24 03:00:58,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1640502.0, ans=0.5 2023-06-24 03:02:01,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.00 vs. limit=15.0 2023-06-24 03:02:13,465 INFO [train.py:996] (0/4) Epoch 9, batch 29500, loss[loss=0.2208, simple_loss=0.2892, pruned_loss=0.07618, over 21809.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3083, pruned_loss=0.07986, over 4264423.51 frames. ], batch size: 112, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:02:33,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-24 03:02:39,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1640802.0, ans=0.125 2023-06-24 03:03:37,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.72 vs. limit=6.0 2023-06-24 03:03:52,450 INFO [train.py:996] (0/4) Epoch 9, batch 29550, loss[loss=0.2119, simple_loss=0.2636, pruned_loss=0.08009, over 20312.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3067, pruned_loss=0.08108, over 4275410.71 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:03:57,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.109e+02 5.276e+02 6.489e+02 8.099e+02 1.842e+03, threshold=1.298e+03, percent-clipped=0.0 2023-06-24 03:05:16,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1641282.0, ans=0.125 2023-06-24 03:05:33,547 INFO [train.py:996] (0/4) Epoch 9, batch 29600, loss[loss=0.2506, simple_loss=0.3353, pruned_loss=0.08289, over 21715.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3128, pruned_loss=0.083, over 4278972.06 frames. ], batch size: 247, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:05:34,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1641342.0, ans=0.1 2023-06-24 03:05:40,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1641342.0, ans=0.125 2023-06-24 03:05:53,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1641402.0, ans=0.125 2023-06-24 03:06:43,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1641522.0, ans=0.125 2023-06-24 03:07:16,572 INFO [train.py:996] (0/4) Epoch 9, batch 29650, loss[loss=0.2914, simple_loss=0.3464, pruned_loss=0.1182, over 21658.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3129, pruned_loss=0.0807, over 4270145.41 frames. ], batch size: 473, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:07:21,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.535e+02 6.438e+02 9.787e+02 1.351e+03 2.800e+03, threshold=1.957e+03, percent-clipped=29.0 2023-06-24 03:07:35,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1641702.0, ans=0.0 2023-06-24 03:07:41,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1641702.0, ans=0.1 2023-06-24 03:07:59,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-24 03:08:56,656 INFO [train.py:996] (0/4) Epoch 9, batch 29700, loss[loss=0.2348, simple_loss=0.3229, pruned_loss=0.07335, over 21415.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3119, pruned_loss=0.08059, over 4274087.78 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:09:09,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-24 03:09:29,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1642002.0, ans=0.125 2023-06-24 03:09:46,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-24 03:09:47,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-24 03:10:24,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642182.0, ans=0.1 2023-06-24 03:10:24,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1642182.0, ans=0.04949747468305833 2023-06-24 03:10:34,581 INFO [train.py:996] (0/4) Epoch 9, batch 29750, loss[loss=0.2489, simple_loss=0.3336, pruned_loss=0.08212, over 21666.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3166, pruned_loss=0.08028, over 4274590.11 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:10:35,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1642242.0, ans=0.0 2023-06-24 03:10:40,757 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.742e+02 5.529e+02 6.914e+02 9.553e+02 2.350e+03, threshold=1.383e+03, percent-clipped=6.0 2023-06-24 03:10:51,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1642302.0, ans=0.1 2023-06-24 03:11:10,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642362.0, ans=0.1 2023-06-24 03:11:23,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-24 03:11:29,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1642362.0, ans=0.125 2023-06-24 03:12:13,544 INFO [train.py:996] (0/4) Epoch 9, batch 29800, loss[loss=0.2615, simple_loss=0.3623, pruned_loss=0.08039, over 19840.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3181, pruned_loss=0.08053, over 4274167.18 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:12:14,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-24 03:12:30,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-24 03:12:43,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1642602.0, ans=10.0 2023-06-24 03:12:44,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.83 vs. limit=10.0 2023-06-24 03:13:34,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1642782.0, ans=0.0 2023-06-24 03:13:51,071 INFO [train.py:996] (0/4) Epoch 9, batch 29850, loss[loss=0.1995, simple_loss=0.2803, pruned_loss=0.0593, over 21804.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3145, pruned_loss=0.07824, over 4273013.41 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:13:57,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.819e+02 7.548e+02 1.159e+03 1.635e+03 3.345e+03, threshold=2.317e+03, percent-clipped=36.0 2023-06-24 03:14:01,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1642842.0, ans=0.0 2023-06-24 03:14:59,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1643022.0, ans=0.1 2023-06-24 03:15:24,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1643082.0, ans=0.0 2023-06-24 03:15:29,150 INFO [train.py:996] (0/4) Epoch 9, batch 29900, loss[loss=0.3093, simple_loss=0.3569, pruned_loss=0.1309, over 21628.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3122, pruned_loss=0.07937, over 4279383.69 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:15:32,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1643142.0, ans=0.0 2023-06-24 03:15:40,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1643142.0, ans=0.0 2023-06-24 03:15:51,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1643202.0, ans=0.125 2023-06-24 03:16:33,337 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:16:38,627 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-24 03:17:08,343 INFO [train.py:996] (0/4) Epoch 9, batch 29950, loss[loss=0.2475, simple_loss=0.3156, pruned_loss=0.08969, over 21771.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3145, pruned_loss=0.08265, over 4283192.10 frames. ], batch size: 332, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:17:19,369 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.107e+02 5.748e+02 7.806e+02 1.232e+03 2.482e+03, threshold=1.561e+03, percent-clipped=2.0 2023-06-24 03:17:50,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1643502.0, ans=0.125 2023-06-24 03:18:04,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1643562.0, ans=0.0 2023-06-24 03:18:35,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1643682.0, ans=0.125 2023-06-24 03:18:54,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.28 vs. limit=15.0 2023-06-24 03:19:00,256 INFO [train.py:996] (0/4) Epoch 9, batch 30000, loss[loss=0.1417, simple_loss=0.2053, pruned_loss=0.03901, over 16825.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3155, pruned_loss=0.08267, over 4277936.99 frames. ], batch size: 61, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:19:00,257 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 03:19:17,058 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2502, simple_loss=0.3471, pruned_loss=0.07663, over 1796401.00 frames. 2023-06-24 03:19:17,059 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 03:19:42,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.13 vs. limit=15.0 2023-06-24 03:20:39,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.65 vs. limit=6.0 2023-06-24 03:20:48,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-24 03:21:04,968 INFO [train.py:996] (0/4) Epoch 9, batch 30050, loss[loss=0.1834, simple_loss=0.2619, pruned_loss=0.05242, over 21840.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3176, pruned_loss=0.07909, over 4273602.15 frames. ], batch size: 118, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:21:06,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1644042.0, ans=0.1 2023-06-24 03:21:11,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.143e+02 7.535e+02 1.024e+03 1.337e+03 2.624e+03, threshold=2.049e+03, percent-clipped=15.0 2023-06-24 03:22:07,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1644222.0, ans=0.2 2023-06-24 03:22:44,515 INFO [train.py:996] (0/4) Epoch 9, batch 30100, loss[loss=0.226, simple_loss=0.2859, pruned_loss=0.08304, over 21717.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3185, pruned_loss=0.07974, over 4278879.18 frames. ], batch size: 299, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:23:38,286 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-24 03:23:51,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1644522.0, ans=0.125 2023-06-24 03:24:29,663 INFO [train.py:996] (0/4) Epoch 9, batch 30150, loss[loss=0.2552, simple_loss=0.3195, pruned_loss=0.09545, over 21349.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3157, pruned_loss=0.08145, over 4279607.90 frames. ], batch size: 176, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:24:38,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 6.683e+02 1.059e+03 1.463e+03 4.541e+03, threshold=2.119e+03, percent-clipped=12.0 2023-06-24 03:24:55,675 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:24:55,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1644702.0, ans=0.125 2023-06-24 03:25:50,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1644882.0, ans=0.125 2023-06-24 03:26:12,276 INFO [train.py:996] (0/4) Epoch 9, batch 30200, loss[loss=0.2275, simple_loss=0.3283, pruned_loss=0.06339, over 21322.00 frames. ], tot_loss[loss=0.239, simple_loss=0.317, pruned_loss=0.08051, over 4271378.67 frames. ], batch size: 549, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:26:12,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1644942.0, ans=0.0 2023-06-24 03:26:44,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-24 03:26:56,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1645002.0, ans=0.125 2023-06-24 03:27:00,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1645062.0, ans=0.1 2023-06-24 03:27:03,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-24 03:27:11,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1645062.0, ans=0.1 2023-06-24 03:27:57,818 INFO [train.py:996] (0/4) Epoch 9, batch 30250, loss[loss=0.2557, simple_loss=0.3384, pruned_loss=0.08649, over 21787.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3263, pruned_loss=0.08309, over 4273921.64 frames. ], batch size: 124, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:28:05,422 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.017e+02 5.727e+02 7.436e+02 1.048e+03 2.592e+03, threshold=1.487e+03, percent-clipped=2.0 2023-06-24 03:29:13,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1645422.0, ans=0.0 2023-06-24 03:29:36,905 INFO [train.py:996] (0/4) Epoch 9, batch 30300, loss[loss=0.1853, simple_loss=0.251, pruned_loss=0.05979, over 21208.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3242, pruned_loss=0.08291, over 4274710.96 frames. ], batch size: 549, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:30:22,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-24 03:30:42,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-24 03:31:06,041 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=15.0 2023-06-24 03:31:28,069 INFO [train.py:996] (0/4) Epoch 9, batch 30350, loss[loss=0.2589, simple_loss=0.374, pruned_loss=0.07186, over 20712.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3242, pruned_loss=0.08461, over 4270536.84 frames. ], batch size: 607, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:31:36,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.062e+02 6.729e+02 9.654e+02 1.457e+03 3.930e+03, threshold=1.931e+03, percent-clipped=23.0 2023-06-24 03:31:49,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.03 vs. limit=6.0 2023-06-24 03:32:11,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1646022.0, ans=0.125 2023-06-24 03:32:33,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1646082.0, ans=0.125 2023-06-24 03:32:47,132 INFO [train.py:996] (0/4) Epoch 9, batch 30400, loss[loss=0.2085, simple_loss=0.2651, pruned_loss=0.07596, over 20189.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3202, pruned_loss=0.08321, over 4251728.59 frames. ], batch size: 703, lr: 3.19e-03, grad_scale: 32.0 2023-06-24 03:32:55,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1646142.0, ans=0.0 2023-06-24 03:33:16,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1646202.0, ans=0.025 2023-06-24 03:33:19,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1646262.0, ans=0.2 2023-06-24 03:33:51,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1646322.0, ans=0.0 2023-06-24 03:34:08,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1646382.0, ans=0.2 2023-06-24 03:34:12,622 INFO [train.py:996] (0/4) Epoch 9, batch 30450, loss[loss=0.2947, simple_loss=0.3914, pruned_loss=0.09904, over 19854.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.32, pruned_loss=0.08183, over 4194192.06 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:34:16,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1646442.0, ans=0.125 2023-06-24 03:34:21,431 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 7.756e+02 1.127e+03 2.078e+03 9.482e+03, threshold=2.254e+03, percent-clipped=27.0 2023-06-24 03:35:22,636 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-9.pt 2023-06-24 03:36:54,769 INFO [train.py:996] (0/4) Epoch 10, batch 0, loss[loss=0.1953, simple_loss=0.2651, pruned_loss=0.06271, over 21613.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2651, pruned_loss=0.06271, over 21613.00 frames. ], batch size: 264, lr: 3.02e-03, grad_scale: 32.0 2023-06-24 03:36:54,771 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 03:37:08,078 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.2014, 4.4602, 2.5430, 2.0186], device='cuda:0') 2023-06-24 03:37:10,565 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2396, simple_loss=0.3488, pruned_loss=0.06521, over 1796401.00 frames. 2023-06-24 03:37:10,565 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 03:37:51,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1646772.0, ans=0.125 2023-06-24 03:37:53,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=15.0 2023-06-24 03:38:19,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1646892.0, ans=0.125 2023-06-24 03:38:49,138 INFO [train.py:996] (0/4) Epoch 10, batch 50, loss[loss=0.3839, simple_loss=0.4329, pruned_loss=0.1675, over 21349.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3245, pruned_loss=0.08895, over 953168.16 frames. ], batch size: 507, lr: 3.02e-03, grad_scale: 16.0 2023-06-24 03:39:13,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1647012.0, ans=0.125 2023-06-24 03:39:18,464 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.516e+02 8.479e+02 1.547e+03 2.623e+03 5.891e+03, threshold=3.095e+03, percent-clipped=28.0 2023-06-24 03:40:20,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1647252.0, ans=0.2 2023-06-24 03:40:29,214 INFO [train.py:996] (0/4) Epoch 10, batch 100, loss[loss=0.2793, simple_loss=0.383, pruned_loss=0.08779, over 21724.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3415, pruned_loss=0.08786, over 1691345.75 frames. ], batch size: 389, lr: 3.02e-03, grad_scale: 16.0 2023-06-24 03:41:30,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1647492.0, ans=0.0 2023-06-24 03:41:53,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1647552.0, ans=0.125 2023-06-24 03:42:05,184 INFO [train.py:996] (0/4) Epoch 10, batch 150, loss[loss=0.2403, simple_loss=0.341, pruned_loss=0.06975, over 21803.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.341, pruned_loss=0.08613, over 2261144.42 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:42:39,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.031e+02 6.057e+02 8.805e+02 1.461e+03 2.839e+03, threshold=1.761e+03, percent-clipped=0.0 2023-06-24 03:43:01,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1647732.0, ans=0.125 2023-06-24 03:43:20,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1647792.0, ans=0.1 2023-06-24 03:43:46,538 INFO [train.py:996] (0/4) Epoch 10, batch 200, loss[loss=0.222, simple_loss=0.2971, pruned_loss=0.07347, over 21902.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3358, pruned_loss=0.08371, over 2701496.99 frames. ], batch size: 98, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:44:27,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1648032.0, ans=0.0 2023-06-24 03:45:18,472 INFO [train.py:996] (0/4) Epoch 10, batch 250, loss[loss=0.2382, simple_loss=0.309, pruned_loss=0.08371, over 21850.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3281, pruned_loss=0.08356, over 3057664.89 frames. ], batch size: 332, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:45:48,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.127e+02 6.491e+02 8.586e+02 1.362e+03 2.608e+03, threshold=1.717e+03, percent-clipped=13.0 2023-06-24 03:46:29,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1648392.0, ans=0.0 2023-06-24 03:46:36,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1648452.0, ans=0.0 2023-06-24 03:46:46,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-06-24 03:46:58,121 INFO [train.py:996] (0/4) Epoch 10, batch 300, loss[loss=0.2181, simple_loss=0.2843, pruned_loss=0.07592, over 21274.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3234, pruned_loss=0.08354, over 3316763.97 frames. ], batch size: 143, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:47:33,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1648572.0, ans=0.125 2023-06-24 03:48:18,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1648752.0, ans=0.0 2023-06-24 03:48:26,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1648752.0, ans=0.125 2023-06-24 03:48:29,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1648812.0, ans=0.125 2023-06-24 03:48:34,871 INFO [train.py:996] (0/4) Epoch 10, batch 350, loss[loss=0.2326, simple_loss=0.3057, pruned_loss=0.07974, over 21931.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3161, pruned_loss=0.08301, over 3530871.87 frames. ], batch size: 351, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:49:06,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.104e+02 6.936e+02 9.570e+02 1.355e+03 2.301e+03, threshold=1.914e+03, percent-clipped=7.0 2023-06-24 03:50:03,510 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-24 03:50:04,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1649052.0, ans=0.125 2023-06-24 03:50:10,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1649052.0, ans=0.2 2023-06-24 03:50:11,411 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.22 vs. limit=22.5 2023-06-24 03:50:14,774 INFO [train.py:996] (0/4) Epoch 10, batch 400, loss[loss=0.2296, simple_loss=0.352, pruned_loss=0.0536, over 19917.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3098, pruned_loss=0.08054, over 3683298.04 frames. ], batch size: 703, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:51:16,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.61 vs. limit=15.0 2023-06-24 03:51:21,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-24 03:51:29,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1649292.0, ans=0.125 2023-06-24 03:51:36,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-24 03:51:57,934 INFO [train.py:996] (0/4) Epoch 10, batch 450, loss[loss=0.2747, simple_loss=0.3866, pruned_loss=0.0814, over 21746.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.31, pruned_loss=0.07945, over 3810859.59 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:51:58,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1649412.0, ans=0.125 2023-06-24 03:52:00,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1649412.0, ans=0.5 2023-06-24 03:52:22,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 6.467e+02 1.066e+03 1.544e+03 3.388e+03, threshold=2.132e+03, percent-clipped=13.0 2023-06-24 03:52:52,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1649592.0, ans=0.1 2023-06-24 03:53:05,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1649592.0, ans=0.125 2023-06-24 03:53:17,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1649652.0, ans=0.2 2023-06-24 03:53:29,656 INFO [train.py:996] (0/4) Epoch 10, batch 500, loss[loss=0.2359, simple_loss=0.3036, pruned_loss=0.0841, over 19930.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.31, pruned_loss=0.07806, over 3918695.42 frames. ], batch size: 704, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:55:07,338 INFO [train.py:996] (0/4) Epoch 10, batch 550, loss[loss=0.1895, simple_loss=0.2699, pruned_loss=0.05454, over 21685.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3115, pruned_loss=0.07792, over 3996460.04 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:55:17,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1650012.0, ans=0.1 2023-06-24 03:55:32,659 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.909e+02 8.917e+02 1.249e+03 2.003e+03 3.580e+03, threshold=2.497e+03, percent-clipped=21.0 2023-06-24 03:55:36,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1650072.0, ans=0.125 2023-06-24 03:56:20,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1650252.0, ans=0.125 2023-06-24 03:56:26,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1650252.0, ans=0.125 2023-06-24 03:56:40,152 INFO [train.py:996] (0/4) Epoch 10, batch 600, loss[loss=0.251, simple_loss=0.3367, pruned_loss=0.08265, over 21793.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3145, pruned_loss=0.07884, over 4059694.86 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:56:46,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1650312.0, ans=0.0 2023-06-24 03:56:49,801 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:57:42,420 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:57:44,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-24 03:57:50,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1650492.0, ans=0.0 2023-06-24 03:58:01,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1650552.0, ans=0.125 2023-06-24 03:58:13,514 INFO [train.py:996] (0/4) Epoch 10, batch 650, loss[loss=0.2134, simple_loss=0.2897, pruned_loss=0.06861, over 21921.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.315, pruned_loss=0.0791, over 4114283.87 frames. ], batch size: 113, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:58:13,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1650612.0, ans=10.0 2023-06-24 03:58:20,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1650612.0, ans=0.125 2023-06-24 03:58:44,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.130e+02 7.027e+02 1.084e+03 1.748e+03 3.374e+03, threshold=2.167e+03, percent-clipped=5.0 2023-06-24 03:58:56,727 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-24 03:59:02,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1650732.0, ans=0.1 2023-06-24 03:59:33,327 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:59:34,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1650852.0, ans=0.125 2023-06-24 03:59:37,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1650852.0, ans=0.125 2023-06-24 03:59:45,397 INFO [train.py:996] (0/4) Epoch 10, batch 700, loss[loss=0.2062, simple_loss=0.2769, pruned_loss=0.06774, over 21636.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3176, pruned_loss=0.08063, over 4159030.71 frames. ], batch size: 230, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 04:00:22,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1650972.0, ans=0.125 2023-06-24 04:00:37,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1651032.0, ans=0.0 2023-06-24 04:00:48,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1651032.0, ans=0.04949747468305833 2023-06-24 04:00:52,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1651092.0, ans=0.0 2023-06-24 04:01:27,487 INFO [train.py:996] (0/4) Epoch 10, batch 750, loss[loss=0.2296, simple_loss=0.2896, pruned_loss=0.08477, over 15345.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.314, pruned_loss=0.08061, over 4188324.62 frames. ], batch size: 63, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 04:01:53,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.183e+02 6.075e+02 9.989e+02 1.388e+03 3.247e+03, threshold=1.998e+03, percent-clipped=7.0 2023-06-24 04:02:08,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1651332.0, ans=0.1 2023-06-24 04:02:22,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1651392.0, ans=0.09899494936611666 2023-06-24 04:02:38,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1651452.0, ans=0.0 2023-06-24 04:02:40,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1651452.0, ans=0.1 2023-06-24 04:03:01,151 INFO [train.py:996] (0/4) Epoch 10, batch 800, loss[loss=0.2489, simple_loss=0.3141, pruned_loss=0.09182, over 21345.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3132, pruned_loss=0.08107, over 4203058.39 frames. ], batch size: 159, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:03:03,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1651512.0, ans=0.0 2023-06-24 04:04:36,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1651752.0, ans=0.125 2023-06-24 04:04:39,121 INFO [train.py:996] (0/4) Epoch 10, batch 850, loss[loss=0.279, simple_loss=0.3313, pruned_loss=0.1134, over 21761.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3105, pruned_loss=0.08079, over 4224042.95 frames. ], batch size: 508, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:04:44,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1651812.0, ans=0.2 2023-06-24 04:04:44,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-24 04:05:10,510 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.435e+02 1.007e+03 1.415e+03 2.798e+03, threshold=2.014e+03, percent-clipped=8.0 2023-06-24 04:05:27,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1651932.0, ans=0.0 2023-06-24 04:06:13,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1652052.0, ans=0.125 2023-06-24 04:06:16,307 INFO [train.py:996] (0/4) Epoch 10, batch 900, loss[loss=0.2961, simple_loss=0.3493, pruned_loss=0.1214, over 21778.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3086, pruned_loss=0.08077, over 4239233.60 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:06:21,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.45 vs. limit=6.0 2023-06-24 04:06:40,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1652172.0, ans=0.125 2023-06-24 04:06:45,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1652172.0, ans=0.125 2023-06-24 04:07:27,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1652292.0, ans=0.125 2023-06-24 04:07:36,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1652292.0, ans=0.125 2023-06-24 04:07:39,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-24 04:07:50,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1652352.0, ans=0.125 2023-06-24 04:08:04,390 INFO [train.py:996] (0/4) Epoch 10, batch 950, loss[loss=0.1848, simple_loss=0.2763, pruned_loss=0.04666, over 21624.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3066, pruned_loss=0.08096, over 4250372.15 frames. ], batch size: 230, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:08:17,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-06-24 04:08:27,076 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.827e+02 5.619e+02 7.658e+02 1.220e+03 3.060e+03, threshold=1.532e+03, percent-clipped=1.0 2023-06-24 04:08:48,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1652532.0, ans=0.125 2023-06-24 04:09:39,806 INFO [train.py:996] (0/4) Epoch 10, batch 1000, loss[loss=0.2625, simple_loss=0.3398, pruned_loss=0.09259, over 21539.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3074, pruned_loss=0.08107, over 4262561.90 frames. ], batch size: 507, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:09:42,806 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-24 04:10:21,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1652832.0, ans=0.0 2023-06-24 04:10:49,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-24 04:11:07,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1652952.0, ans=0.0 2023-06-24 04:11:19,956 INFO [train.py:996] (0/4) Epoch 10, batch 1050, loss[loss=0.1745, simple_loss=0.2455, pruned_loss=0.05178, over 16043.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3089, pruned_loss=0.08127, over 4261661.46 frames. ], batch size: 60, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:11:29,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1653012.0, ans=0.1 2023-06-24 04:11:46,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 8.659e+02 1.096e+03 1.679e+03 3.356e+03, threshold=2.191e+03, percent-clipped=32.0 2023-06-24 04:12:49,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1653252.0, ans=0.0 2023-06-24 04:12:53,900 INFO [train.py:996] (0/4) Epoch 10, batch 1100, loss[loss=0.194, simple_loss=0.2626, pruned_loss=0.06272, over 21260.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3067, pruned_loss=0.07907, over 4269704.04 frames. ], batch size: 608, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:13:18,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1653372.0, ans=0.2 2023-06-24 04:13:21,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1653372.0, ans=0.0 2023-06-24 04:13:33,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-24 04:14:32,921 INFO [train.py:996] (0/4) Epoch 10, batch 1150, loss[loss=0.2286, simple_loss=0.3149, pruned_loss=0.07115, over 21842.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3074, pruned_loss=0.07926, over 4277854.54 frames. ], batch size: 332, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:15:00,228 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.939e+02 5.998e+02 8.488e+02 1.315e+03 2.677e+03, threshold=1.698e+03, percent-clipped=3.0 2023-06-24 04:15:50,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1653792.0, ans=0.2 2023-06-24 04:16:02,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-24 04:16:12,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1653852.0, ans=0.125 2023-06-24 04:16:16,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1653912.0, ans=0.125 2023-06-24 04:16:17,796 INFO [train.py:996] (0/4) Epoch 10, batch 1200, loss[loss=0.2854, simple_loss=0.3448, pruned_loss=0.113, over 21587.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3098, pruned_loss=0.07997, over 4286191.80 frames. ], batch size: 471, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:16:20,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-24 04:17:35,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1654152.0, ans=0.125 2023-06-24 04:17:55,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-06-24 04:17:56,454 INFO [train.py:996] (0/4) Epoch 10, batch 1250, loss[loss=0.2331, simple_loss=0.3348, pruned_loss=0.06574, over 21686.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3123, pruned_loss=0.08009, over 4290651.51 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:18:19,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.265e+02 6.637e+02 9.533e+02 1.248e+03 2.697e+03, threshold=1.907e+03, percent-clipped=13.0 2023-06-24 04:18:28,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1654272.0, ans=0.1 2023-06-24 04:19:03,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-24 04:19:35,646 INFO [train.py:996] (0/4) Epoch 10, batch 1300, loss[loss=0.3014, simple_loss=0.3615, pruned_loss=0.1207, over 21380.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.314, pruned_loss=0.0806, over 4294325.47 frames. ], batch size: 509, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:21:13,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1654812.0, ans=0.0 2023-06-24 04:21:14,473 INFO [train.py:996] (0/4) Epoch 10, batch 1350, loss[loss=0.2588, simple_loss=0.3206, pruned_loss=0.09851, over 21864.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3129, pruned_loss=0.08093, over 4288766.87 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:21:43,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.930e+02 9.209e+02 1.385e+03 4.036e+03, threshold=1.842e+03, percent-clipped=12.0 2023-06-24 04:21:44,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1654872.0, ans=0.0 2023-06-24 04:21:48,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1654872.0, ans=0.0 2023-06-24 04:22:01,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1654932.0, ans=0.02 2023-06-24 04:22:15,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1654992.0, ans=0.125 2023-06-24 04:22:26,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1655052.0, ans=0.07 2023-06-24 04:22:35,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1655052.0, ans=0.125 2023-06-24 04:22:37,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1655052.0, ans=0.1 2023-06-24 04:22:49,022 INFO [train.py:996] (0/4) Epoch 10, batch 1400, loss[loss=0.2147, simple_loss=0.2969, pruned_loss=0.0662, over 21847.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.312, pruned_loss=0.08106, over 4291794.84 frames. ], batch size: 372, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:23:18,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1655172.0, ans=0.1 2023-06-24 04:23:31,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1655232.0, ans=0.2 2023-06-24 04:24:03,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1655292.0, ans=0.2 2023-06-24 04:24:28,487 INFO [train.py:996] (0/4) Epoch 10, batch 1450, loss[loss=0.2436, simple_loss=0.3143, pruned_loss=0.08642, over 21781.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3142, pruned_loss=0.08213, over 4292872.99 frames. ], batch size: 332, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:24:28,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1655412.0, ans=0.125 2023-06-24 04:24:56,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.216e+02 6.383e+02 1.021e+03 1.504e+03 2.934e+03, threshold=2.041e+03, percent-clipped=11.0 2023-06-24 04:25:54,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-24 04:25:59,927 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:26:07,465 INFO [train.py:996] (0/4) Epoch 10, batch 1500, loss[loss=0.2683, simple_loss=0.3579, pruned_loss=0.08942, over 21685.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3149, pruned_loss=0.08307, over 4297568.56 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:26:10,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=15.0 2023-06-24 04:26:14,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1655712.0, ans=0.2 2023-06-24 04:26:43,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1655832.0, ans=0.1 2023-06-24 04:27:12,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1655892.0, ans=0.0 2023-06-24 04:27:32,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1655952.0, ans=0.0 2023-06-24 04:27:44,392 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-276000.pt 2023-06-24 04:27:50,601 INFO [train.py:996] (0/4) Epoch 10, batch 1550, loss[loss=0.2621, simple_loss=0.346, pruned_loss=0.0891, over 21794.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3119, pruned_loss=0.08217, over 4299818.14 frames. ], batch size: 316, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:28:24,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 5.706e+02 8.802e+02 1.256e+03 2.211e+03, threshold=1.760e+03, percent-clipped=1.0 2023-06-24 04:28:44,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1656132.0, ans=0.125 2023-06-24 04:28:56,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.03 vs. limit=12.0 2023-06-24 04:29:24,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.68 vs. limit=15.0 2023-06-24 04:29:35,792 INFO [train.py:996] (0/4) Epoch 10, batch 1600, loss[loss=0.2434, simple_loss=0.3165, pruned_loss=0.08517, over 21793.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3114, pruned_loss=0.08234, over 4299446.07 frames. ], batch size: 351, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:29:52,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1656372.0, ans=0.0 2023-06-24 04:30:13,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-24 04:30:44,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1656492.0, ans=0.0 2023-06-24 04:30:47,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1656492.0, ans=0.2 2023-06-24 04:30:51,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1656492.0, ans=0.015 2023-06-24 04:30:53,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1656552.0, ans=0.0 2023-06-24 04:31:15,795 INFO [train.py:996] (0/4) Epoch 10, batch 1650, loss[loss=0.2732, simple_loss=0.3465, pruned_loss=0.09995, over 21764.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3087, pruned_loss=0.08076, over 4289495.61 frames. ], batch size: 332, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:31:26,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-06-24 04:31:44,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.141e+02 9.207e+02 1.280e+03 2.509e+03, threshold=1.841e+03, percent-clipped=8.0 2023-06-24 04:32:04,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1656732.0, ans=0.125 2023-06-24 04:32:29,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1656792.0, ans=0.0 2023-06-24 04:32:51,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 04:32:57,173 INFO [train.py:996] (0/4) Epoch 10, batch 1700, loss[loss=0.2957, simple_loss=0.3609, pruned_loss=0.1153, over 21821.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3108, pruned_loss=0.08174, over 4282460.36 frames. ], batch size: 124, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:33:09,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1656912.0, ans=0.125 2023-06-24 04:33:20,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1656972.0, ans=0.0 2023-06-24 04:33:52,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-24 04:33:57,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1657032.0, ans=0.1 2023-06-24 04:34:00,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1657092.0, ans=0.1 2023-06-24 04:34:45,130 INFO [train.py:996] (0/4) Epoch 10, batch 1750, loss[loss=0.1812, simple_loss=0.2415, pruned_loss=0.06047, over 21151.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3115, pruned_loss=0.0803, over 4276657.97 frames. ], batch size: 143, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:35:00,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-24 04:35:05,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1657272.0, ans=0.0 2023-06-24 04:35:21,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.068e+02 6.335e+02 9.144e+02 1.525e+03 4.256e+03, threshold=1.829e+03, percent-clipped=17.0 2023-06-24 04:35:40,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-24 04:36:07,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1657452.0, ans=0.0 2023-06-24 04:36:32,814 INFO [train.py:996] (0/4) Epoch 10, batch 1800, loss[loss=0.2625, simple_loss=0.3573, pruned_loss=0.08386, over 19807.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.311, pruned_loss=0.07828, over 4275330.14 frames. ], batch size: 703, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:36:42,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1657512.0, ans=0.0 2023-06-24 04:36:44,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1657512.0, ans=0.125 2023-06-24 04:36:53,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1657572.0, ans=0.125 2023-06-24 04:36:53,664 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.594e-03 2023-06-24 04:36:59,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1657572.0, ans=0.0 2023-06-24 04:37:19,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1657632.0, ans=0.0 2023-06-24 04:38:10,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1657752.0, ans=0.2 2023-06-24 04:38:12,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.57 vs. limit=5.0 2023-06-24 04:38:13,217 INFO [train.py:996] (0/4) Epoch 10, batch 1850, loss[loss=0.2441, simple_loss=0.3459, pruned_loss=0.07116, over 21622.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3134, pruned_loss=0.07713, over 4271253.40 frames. ], batch size: 389, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:38:28,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-24 04:38:43,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.154e+02 6.426e+02 1.042e+03 1.664e+03 4.444e+03, threshold=2.085e+03, percent-clipped=25.0 2023-06-24 04:38:46,067 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-24 04:39:13,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1657992.0, ans=0.125 2023-06-24 04:39:20,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1657992.0, ans=0.125 2023-06-24 04:39:30,552 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:39:34,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-24 04:39:49,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1658052.0, ans=0.125 2023-06-24 04:39:49,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1658052.0, ans=0.025 2023-06-24 04:39:52,061 INFO [train.py:996] (0/4) Epoch 10, batch 1900, loss[loss=0.2161, simple_loss=0.308, pruned_loss=0.06209, over 21738.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3124, pruned_loss=0.07674, over 4270348.26 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:39:55,813 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:40:43,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1658232.0, ans=0.125 2023-06-24 04:40:45,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-24 04:41:23,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1658352.0, ans=0.2 2023-06-24 04:41:31,886 INFO [train.py:996] (0/4) Epoch 10, batch 1950, loss[loss=0.185, simple_loss=0.2587, pruned_loss=0.05566, over 21283.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3105, pruned_loss=0.07736, over 4270035.53 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:42:02,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.830e+02 7.074e+02 9.115e+02 1.415e+03 2.823e+03, threshold=1.823e+03, percent-clipped=5.0 2023-06-24 04:42:32,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1658592.0, ans=0.04949747468305833 2023-06-24 04:42:45,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-24 04:43:12,640 INFO [train.py:996] (0/4) Epoch 10, batch 2000, loss[loss=0.1988, simple_loss=0.3078, pruned_loss=0.04491, over 20811.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3058, pruned_loss=0.07577, over 4268824.48 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:43:21,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1658712.0, ans=0.0 2023-06-24 04:43:49,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-24 04:44:11,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1658892.0, ans=0.0 2023-06-24 04:44:38,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1658952.0, ans=0.2 2023-06-24 04:44:57,329 INFO [train.py:996] (0/4) Epoch 10, batch 2050, loss[loss=0.1973, simple_loss=0.307, pruned_loss=0.0438, over 21141.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3066, pruned_loss=0.07496, over 4270653.29 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:45:00,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1659012.0, ans=0.125 2023-06-24 04:45:04,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1659012.0, ans=0.125 2023-06-24 04:45:11,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1659012.0, ans=0.125 2023-06-24 04:45:15,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1659072.0, ans=0.125 2023-06-24 04:45:27,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1659072.0, ans=0.125 2023-06-24 04:45:28,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.378e+02 7.380e+02 1.174e+03 1.683e+03 3.998e+03, threshold=2.349e+03, percent-clipped=22.0 2023-06-24 04:45:43,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1659132.0, ans=0.0 2023-06-24 04:46:10,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1659192.0, ans=0.125 2023-06-24 04:46:37,798 INFO [train.py:996] (0/4) Epoch 10, batch 2100, loss[loss=0.2602, simple_loss=0.3393, pruned_loss=0.09056, over 21572.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3114, pruned_loss=0.07728, over 4271703.42 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:46:49,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1659312.0, ans=0.125 2023-06-24 04:46:55,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1659372.0, ans=0.125 2023-06-24 04:47:05,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1659372.0, ans=0.0 2023-06-24 04:47:32,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-24 04:47:34,452 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-24 04:48:15,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1659552.0, ans=0.125 2023-06-24 04:48:17,973 INFO [train.py:996] (0/4) Epoch 10, batch 2150, loss[loss=0.2233, simple_loss=0.3009, pruned_loss=0.07289, over 21845.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3108, pruned_loss=0.07791, over 4276178.54 frames. ], batch size: 372, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:48:28,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1659612.0, ans=0.1 2023-06-24 04:48:46,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1659672.0, ans=0.04949747468305833 2023-06-24 04:48:48,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.248e+02 6.485e+02 1.170e+03 1.690e+03 3.411e+03, threshold=2.340e+03, percent-clipped=8.0 2023-06-24 04:48:53,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-24 04:49:11,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1659732.0, ans=0.125 2023-06-24 04:49:12,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1659732.0, ans=0.125 2023-06-24 04:49:49,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1659852.0, ans=0.0 2023-06-24 04:49:58,097 INFO [train.py:996] (0/4) Epoch 10, batch 2200, loss[loss=0.2235, simple_loss=0.2929, pruned_loss=0.07708, over 21388.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3115, pruned_loss=0.07888, over 4280713.49 frames. ], batch size: 471, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:50:20,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1659972.0, ans=0.125 2023-06-24 04:50:23,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1659972.0, ans=0.125 2023-06-24 04:50:47,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1660032.0, ans=0.125 2023-06-24 04:51:03,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1660092.0, ans=0.125 2023-06-24 04:51:08,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1660092.0, ans=0.125 2023-06-24 04:51:25,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.46 vs. limit=10.0 2023-06-24 04:51:37,262 INFO [train.py:996] (0/4) Epoch 10, batch 2250, loss[loss=0.2059, simple_loss=0.267, pruned_loss=0.07242, over 21146.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.309, pruned_loss=0.07702, over 4286312.82 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:51:42,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1660212.0, ans=0.0 2023-06-24 04:51:57,181 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=15.0 2023-06-24 04:52:08,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.133e+02 6.896e+02 1.012e+03 1.519e+03 4.116e+03, threshold=2.025e+03, percent-clipped=5.0 2023-06-24 04:52:31,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1660332.0, ans=0.125 2023-06-24 04:53:15,547 INFO [train.py:996] (0/4) Epoch 10, batch 2300, loss[loss=0.2285, simple_loss=0.3126, pruned_loss=0.0722, over 21220.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3056, pruned_loss=0.07629, over 4290085.42 frames. ], batch size: 176, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:53:34,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1660572.0, ans=0.125 2023-06-24 04:54:09,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1660632.0, ans=0.0 2023-06-24 04:54:36,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1660692.0, ans=0.0 2023-06-24 04:54:55,358 INFO [train.py:996] (0/4) Epoch 10, batch 2350, loss[loss=0.229, simple_loss=0.2792, pruned_loss=0.08942, over 20110.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3027, pruned_loss=0.07701, over 4283751.65 frames. ], batch size: 703, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:55:13,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1660812.0, ans=0.0 2023-06-24 04:55:32,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.196e+02 7.285e+02 1.033e+03 1.548e+03 3.497e+03, threshold=2.065e+03, percent-clipped=14.0 2023-06-24 04:55:37,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-24 04:55:42,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1660932.0, ans=0.125 2023-06-24 04:55:58,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1660992.0, ans=0.2 2023-06-24 04:56:34,567 INFO [train.py:996] (0/4) Epoch 10, batch 2400, loss[loss=0.2529, simple_loss=0.3208, pruned_loss=0.09256, over 21786.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3049, pruned_loss=0.07896, over 4273793.18 frames. ], batch size: 247, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:58:11,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1661352.0, ans=0.125 2023-06-24 04:58:13,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1661412.0, ans=0.125 2023-06-24 04:58:19,001 INFO [train.py:996] (0/4) Epoch 10, batch 2450, loss[loss=0.2364, simple_loss=0.328, pruned_loss=0.0724, over 21354.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3104, pruned_loss=0.08153, over 4272242.93 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:58:45,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1661472.0, ans=0.0 2023-06-24 04:58:50,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.358e+02 7.393e+02 1.203e+03 1.868e+03 3.512e+03, threshold=2.405e+03, percent-clipped=21.0 2023-06-24 04:58:59,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1661532.0, ans=0.125 2023-06-24 04:59:56,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1661712.0, ans=0.0 2023-06-24 04:59:58,065 INFO [train.py:996] (0/4) Epoch 10, batch 2500, loss[loss=0.2473, simple_loss=0.3378, pruned_loss=0.07838, over 19843.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3082, pruned_loss=0.08077, over 4256605.46 frames. ], batch size: 702, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:59:58,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1661712.0, ans=0.1 2023-06-24 05:00:01,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1661712.0, ans=0.015 2023-06-24 05:00:11,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1661712.0, ans=0.125 2023-06-24 05:00:28,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1661772.0, ans=0.0 2023-06-24 05:00:36,411 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-24 05:01:08,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1661892.0, ans=22.5 2023-06-24 05:01:39,225 INFO [train.py:996] (0/4) Epoch 10, batch 2550, loss[loss=0.2473, simple_loss=0.3255, pruned_loss=0.08457, over 21667.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3085, pruned_loss=0.08031, over 4256369.78 frames. ], batch size: 298, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:02:11,854 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.129e+02 7.144e+02 9.794e+02 1.361e+03 2.807e+03, threshold=1.959e+03, percent-clipped=4.0 2023-06-24 05:02:20,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1662132.0, ans=0.125 2023-06-24 05:02:32,586 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.45 vs. limit=10.0 2023-06-24 05:02:40,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-24 05:03:05,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1662252.0, ans=0.0 2023-06-24 05:03:17,553 INFO [train.py:996] (0/4) Epoch 10, batch 2600, loss[loss=0.2817, simple_loss=0.355, pruned_loss=0.1042, over 17285.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3083, pruned_loss=0.08128, over 4256567.44 frames. ], batch size: 60, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:04:18,565 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-06-24 05:04:46,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=12.0 2023-06-24 05:04:48,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.59 vs. limit=15.0 2023-06-24 05:04:58,962 INFO [train.py:996] (0/4) Epoch 10, batch 2650, loss[loss=0.2599, simple_loss=0.3279, pruned_loss=0.09596, over 21360.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3101, pruned_loss=0.08303, over 4266027.54 frames. ], batch size: 549, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:05:12,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1662612.0, ans=0.0 2023-06-24 05:05:32,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.149e+02 7.187e+02 9.516e+02 1.311e+03 3.015e+03, threshold=1.903e+03, percent-clipped=11.0 2023-06-24 05:06:10,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1662792.0, ans=0.0 2023-06-24 05:06:25,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1662852.0, ans=0.2 2023-06-24 05:06:27,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-24 05:06:34,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1662852.0, ans=0.125 2023-06-24 05:06:40,822 INFO [train.py:996] (0/4) Epoch 10, batch 2700, loss[loss=0.1857, simple_loss=0.2529, pruned_loss=0.05929, over 21422.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3105, pruned_loss=0.08312, over 4274529.82 frames. ], batch size: 194, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:07:43,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1663092.0, ans=0.2 2023-06-24 05:07:56,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1663092.0, ans=0.125 2023-06-24 05:08:21,441 INFO [train.py:996] (0/4) Epoch 10, batch 2750, loss[loss=0.2499, simple_loss=0.3288, pruned_loss=0.08549, over 21834.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3104, pruned_loss=0.08328, over 4273739.28 frames. ], batch size: 118, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:08:33,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-24 05:08:38,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1663212.0, ans=0.2 2023-06-24 05:08:52,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1663272.0, ans=0.0 2023-06-24 05:08:54,978 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 8.097e+02 1.070e+03 1.539e+03 2.944e+03, threshold=2.139e+03, percent-clipped=11.0 2023-06-24 05:10:07,022 INFO [train.py:996] (0/4) Epoch 10, batch 2800, loss[loss=0.2125, simple_loss=0.33, pruned_loss=0.04757, over 19700.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3132, pruned_loss=0.08326, over 4278234.68 frames. ], batch size: 702, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 05:11:41,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1663752.0, ans=0.1 2023-06-24 05:11:47,589 INFO [train.py:996] (0/4) Epoch 10, batch 2850, loss[loss=0.2574, simple_loss=0.3126, pruned_loss=0.1011, over 21553.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3151, pruned_loss=0.08468, over 4280736.39 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:12:00,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-24 05:12:04,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1663872.0, ans=0.125 2023-06-24 05:12:20,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1663872.0, ans=0.1 2023-06-24 05:12:27,738 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.081e+02 7.888e+02 1.288e+03 1.995e+03 6.558e+03, threshold=2.577e+03, percent-clipped=20.0 2023-06-24 05:12:49,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1663992.0, ans=0.0 2023-06-24 05:13:19,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1664052.0, ans=0.125 2023-06-24 05:13:23,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1664052.0, ans=0.1 2023-06-24 05:13:27,338 INFO [train.py:996] (0/4) Epoch 10, batch 2900, loss[loss=0.2288, simple_loss=0.2982, pruned_loss=0.07973, over 20096.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3137, pruned_loss=0.08432, over 4284778.43 frames. ], batch size: 702, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:14:47,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1664352.0, ans=0.07 2023-06-24 05:15:04,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1664412.0, ans=0.0 2023-06-24 05:15:05,628 INFO [train.py:996] (0/4) Epoch 10, batch 2950, loss[loss=0.263, simple_loss=0.3336, pruned_loss=0.0962, over 21846.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3142, pruned_loss=0.0836, over 4282238.68 frames. ], batch size: 371, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:15:14,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1664412.0, ans=0.2 2023-06-24 05:15:45,390 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.895e+02 6.745e+02 8.632e+02 1.337e+03 3.191e+03, threshold=1.726e+03, percent-clipped=2.0 2023-06-24 05:15:57,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1664532.0, ans=0.125 2023-06-24 05:16:22,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1664652.0, ans=0.0 2023-06-24 05:16:44,590 INFO [train.py:996] (0/4) Epoch 10, batch 3000, loss[loss=0.2214, simple_loss=0.3252, pruned_loss=0.05885, over 19761.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3166, pruned_loss=0.08327, over 4284770.20 frames. ], batch size: 704, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:16:44,591 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 05:17:00,553 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2505, simple_loss=0.3452, pruned_loss=0.07794, over 1796401.00 frames. 2023-06-24 05:17:00,554 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 05:17:07,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1664712.0, ans=0.125 2023-06-24 05:17:35,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1664772.0, ans=0.2 2023-06-24 05:18:02,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1664892.0, ans=0.07 2023-06-24 05:18:15,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1664892.0, ans=0.125 2023-06-24 05:18:18,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1664892.0, ans=0.125 2023-06-24 05:18:25,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1664952.0, ans=0.2 2023-06-24 05:18:34,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1664952.0, ans=10.0 2023-06-24 05:18:45,523 INFO [train.py:996] (0/4) Epoch 10, batch 3050, loss[loss=0.2158, simple_loss=0.2912, pruned_loss=0.07024, over 21882.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3181, pruned_loss=0.08207, over 4281856.07 frames. ], batch size: 316, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:18:47,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1665012.0, ans=0.5 2023-06-24 05:18:54,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1665012.0, ans=0.02 2023-06-24 05:18:56,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.65 vs. limit=22.5 2023-06-24 05:19:06,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=22.5 2023-06-24 05:19:09,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1665072.0, ans=0.125 2023-06-24 05:19:21,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 6.265e+02 9.515e+02 1.393e+03 2.651e+03, threshold=1.903e+03, percent-clipped=13.0 2023-06-24 05:19:49,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1665192.0, ans=0.125 2023-06-24 05:19:57,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1665192.0, ans=0.0 2023-06-24 05:20:18,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1665252.0, ans=0.2 2023-06-24 05:20:25,956 INFO [train.py:996] (0/4) Epoch 10, batch 3100, loss[loss=0.2586, simple_loss=0.3321, pruned_loss=0.09253, over 21760.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3184, pruned_loss=0.08141, over 4282545.88 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:20:52,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1665372.0, ans=0.125 2023-06-24 05:20:58,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1665372.0, ans=0.1 2023-06-24 05:22:10,439 INFO [train.py:996] (0/4) Epoch 10, batch 3150, loss[loss=0.1815, simple_loss=0.2717, pruned_loss=0.04567, over 21656.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3182, pruned_loss=0.08102, over 4278683.13 frames. ], batch size: 263, lr: 3.00e-03, grad_scale: 8.0 2023-06-24 05:22:16,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1665612.0, ans=0.05 2023-06-24 05:22:16,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1665612.0, ans=0.1 2023-06-24 05:22:33,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1665672.0, ans=0.0 2023-06-24 05:22:36,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-06-24 05:22:55,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.369e+02 7.727e+02 1.146e+03 1.592e+03 4.239e+03, threshold=2.292e+03, percent-clipped=10.0 2023-06-24 05:22:56,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-24 05:23:46,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1665852.0, ans=0.0 2023-06-24 05:23:53,261 INFO [train.py:996] (0/4) Epoch 10, batch 3200, loss[loss=0.1841, simple_loss=0.2643, pruned_loss=0.05201, over 21299.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3201, pruned_loss=0.08102, over 4285412.11 frames. ], batch size: 176, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:24:01,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1665912.0, ans=0.0 2023-06-24 05:24:29,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1665972.0, ans=0.0 2023-06-24 05:24:38,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1666032.0, ans=0.2 2023-06-24 05:24:56,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1666092.0, ans=0.015 2023-06-24 05:25:29,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1666152.0, ans=0.125 2023-06-24 05:25:34,000 INFO [train.py:996] (0/4) Epoch 10, batch 3250, loss[loss=0.2544, simple_loss=0.2971, pruned_loss=0.1059, over 21419.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3209, pruned_loss=0.08216, over 4281824.69 frames. ], batch size: 510, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:25:37,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1666212.0, ans=0.125 2023-06-24 05:25:57,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1666272.0, ans=0.0 2023-06-24 05:26:16,584 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.011e+02 5.714e+02 8.207e+02 1.474e+03 3.383e+03, threshold=1.641e+03, percent-clipped=8.0 2023-06-24 05:26:19,324 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-24 05:26:56,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.37 vs. limit=6.0 2023-06-24 05:27:07,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1666452.0, ans=0.0 2023-06-24 05:27:07,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1666452.0, ans=0.125 2023-06-24 05:27:15,049 INFO [train.py:996] (0/4) Epoch 10, batch 3300, loss[loss=0.28, simple_loss=0.3748, pruned_loss=0.09258, over 21609.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3165, pruned_loss=0.08155, over 4285101.86 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:27:40,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1666572.0, ans=0.2 2023-06-24 05:27:41,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1666572.0, ans=0.2 2023-06-24 05:28:30,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1666692.0, ans=0.125 2023-06-24 05:28:55,180 INFO [train.py:996] (0/4) Epoch 10, batch 3350, loss[loss=0.242, simple_loss=0.3213, pruned_loss=0.08133, over 21485.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3175, pruned_loss=0.08169, over 4280134.97 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:29:20,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1666872.0, ans=0.2 2023-06-24 05:29:42,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.528e+02 7.146e+02 1.056e+03 1.768e+03 3.632e+03, threshold=2.111e+03, percent-clipped=30.0 2023-06-24 05:29:44,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-24 05:29:58,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1666932.0, ans=0.1 2023-06-24 05:30:12,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1666992.0, ans=0.0 2023-06-24 05:30:38,703 INFO [train.py:996] (0/4) Epoch 10, batch 3400, loss[loss=0.206, simple_loss=0.275, pruned_loss=0.06855, over 21757.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3193, pruned_loss=0.08235, over 4281902.83 frames. ], batch size: 112, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:31:36,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-24 05:31:36,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-24 05:31:41,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1667292.0, ans=0.125 2023-06-24 05:31:47,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1667292.0, ans=0.0 2023-06-24 05:31:50,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-24 05:31:57,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.72 vs. limit=12.0 2023-06-24 05:32:18,966 INFO [train.py:996] (0/4) Epoch 10, batch 3450, loss[loss=0.2447, simple_loss=0.3034, pruned_loss=0.09301, over 21484.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3156, pruned_loss=0.08241, over 4274403.53 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:32:34,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1667412.0, ans=0.125 2023-06-24 05:33:00,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 9.142e+02 1.242e+03 1.836e+03 3.790e+03, threshold=2.483e+03, percent-clipped=19.0 2023-06-24 05:33:00,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1667532.0, ans=0.0 2023-06-24 05:33:20,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-24 05:33:21,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1667592.0, ans=0.125 2023-06-24 05:33:27,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1667592.0, ans=0.2 2023-06-24 05:34:02,175 INFO [train.py:996] (0/4) Epoch 10, batch 3500, loss[loss=0.2651, simple_loss=0.3399, pruned_loss=0.09518, over 21421.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3231, pruned_loss=0.08571, over 4273817.99 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:34:12,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1667712.0, ans=0.0 2023-06-24 05:34:25,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1667772.0, ans=0.0 2023-06-24 05:34:35,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-24 05:34:50,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1667832.0, ans=0.0 2023-06-24 05:34:52,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1667832.0, ans=0.125 2023-06-24 05:35:14,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1667892.0, ans=0.2 2023-06-24 05:35:22,474 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:35:41,138 INFO [train.py:996] (0/4) Epoch 10, batch 3550, loss[loss=0.2781, simple_loss=0.3144, pruned_loss=0.1209, over 21312.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3257, pruned_loss=0.08824, over 4279066.49 frames. ], batch size: 507, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:35:55,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1668012.0, ans=0.2 2023-06-24 05:35:58,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1668012.0, ans=0.125 2023-06-24 05:36:22,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.441e+02 8.129e+02 1.130e+03 1.802e+03 3.924e+03, threshold=2.259e+03, percent-clipped=11.0 2023-06-24 05:37:11,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1668252.0, ans=15.0 2023-06-24 05:37:20,206 INFO [train.py:996] (0/4) Epoch 10, batch 3600, loss[loss=0.2834, simple_loss=0.3527, pruned_loss=0.1071, over 21804.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3192, pruned_loss=0.08662, over 4283250.57 frames. ], batch size: 124, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 05:37:48,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1668372.0, ans=0.125 2023-06-24 05:38:08,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-24 05:38:20,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=22.5 2023-06-24 05:38:36,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1668492.0, ans=0.125 2023-06-24 05:38:49,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1668552.0, ans=0.2 2023-06-24 05:39:01,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-24 05:39:01,634 INFO [train.py:996] (0/4) Epoch 10, batch 3650, loss[loss=0.271, simple_loss=0.3406, pruned_loss=0.1007, over 21293.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.321, pruned_loss=0.08776, over 4286278.55 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:39:19,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1668612.0, ans=0.125 2023-06-24 05:39:22,256 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:39:38,463 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:39:43,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.119e+02 6.332e+02 8.468e+02 1.461e+03 3.139e+03, threshold=1.694e+03, percent-clipped=4.0 2023-06-24 05:40:05,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1668792.0, ans=0.125 2023-06-24 05:40:39,710 INFO [train.py:996] (0/4) Epoch 10, batch 3700, loss[loss=0.2345, simple_loss=0.314, pruned_loss=0.07747, over 21272.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3208, pruned_loss=0.08728, over 4287474.51 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:40:41,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1668912.0, ans=0.0 2023-06-24 05:40:58,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-24 05:40:59,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1668972.0, ans=0.1 2023-06-24 05:41:37,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1669092.0, ans=0.125 2023-06-24 05:42:06,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1669152.0, ans=0.0 2023-06-24 05:42:20,651 INFO [train.py:996] (0/4) Epoch 10, batch 3750, loss[loss=0.259, simple_loss=0.3248, pruned_loss=0.09659, over 21834.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3196, pruned_loss=0.08684, over 4290248.62 frames. ], batch size: 391, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:43:00,074 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.152e+02 7.119e+02 9.667e+02 1.381e+03 3.413e+03, threshold=1.933e+03, percent-clipped=11.0 2023-06-24 05:43:03,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1669332.0, ans=0.125 2023-06-24 05:44:00,639 INFO [train.py:996] (0/4) Epoch 10, batch 3800, loss[loss=0.2642, simple_loss=0.3518, pruned_loss=0.08829, over 21459.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.315, pruned_loss=0.08392, over 4288191.43 frames. ], batch size: 131, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:44:19,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1669572.0, ans=0.0 2023-06-24 05:44:46,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-24 05:44:50,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1669632.0, ans=0.025 2023-06-24 05:45:06,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.18 vs. limit=15.0 2023-06-24 05:45:18,995 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=12.0 2023-06-24 05:45:24,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1669752.0, ans=0.125 2023-06-24 05:45:24,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1669752.0, ans=0.125 2023-06-24 05:45:34,341 INFO [train.py:996] (0/4) Epoch 10, batch 3850, loss[loss=0.2209, simple_loss=0.295, pruned_loss=0.07338, over 15156.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3121, pruned_loss=0.08405, over 4284584.93 frames. ], batch size: 60, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:46:06,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 6.682e+02 9.971e+02 1.611e+03 3.519e+03, threshold=1.994e+03, percent-clipped=16.0 2023-06-24 05:46:10,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-24 05:46:34,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1669992.0, ans=0.0 2023-06-24 05:47:01,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1670052.0, ans=0.0 2023-06-24 05:47:06,834 INFO [train.py:996] (0/4) Epoch 10, batch 3900, loss[loss=0.1937, simple_loss=0.2477, pruned_loss=0.0699, over 20704.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3076, pruned_loss=0.0834, over 4266273.02 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:47:18,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-24 05:47:49,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1670232.0, ans=0.125 2023-06-24 05:48:29,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1670352.0, ans=0.0 2023-06-24 05:48:51,334 INFO [train.py:996] (0/4) Epoch 10, batch 3950, loss[loss=0.229, simple_loss=0.3205, pruned_loss=0.06877, over 21728.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3102, pruned_loss=0.08223, over 4270425.72 frames. ], batch size: 351, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:49:00,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1670412.0, ans=0.0 2023-06-24 05:49:28,747 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.660e+02 7.100e+02 1.207e+03 1.862e+03 3.460e+03, threshold=2.413e+03, percent-clipped=21.0 2023-06-24 05:49:38,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1670532.0, ans=0.1 2023-06-24 05:49:41,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-06-24 05:49:48,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=12.0 2023-06-24 05:49:57,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1670592.0, ans=0.2 2023-06-24 05:50:05,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1670592.0, ans=0.0 2023-06-24 05:50:29,181 INFO [train.py:996] (0/4) Epoch 10, batch 4000, loss[loss=0.212, simple_loss=0.2821, pruned_loss=0.07091, over 21712.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3047, pruned_loss=0.07928, over 4271129.46 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:51:13,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1670832.0, ans=0.125 2023-06-24 05:51:21,495 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:52:09,823 INFO [train.py:996] (0/4) Epoch 10, batch 4050, loss[loss=0.2146, simple_loss=0.2771, pruned_loss=0.07602, over 21494.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3028, pruned_loss=0.07801, over 4276816.49 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:52:51,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.639e+02 6.579e+02 8.856e+02 1.407e+03 2.917e+03, threshold=1.771e+03, percent-clipped=4.0 2023-06-24 05:53:40,636 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-24 05:53:49,033 INFO [train.py:996] (0/4) Epoch 10, batch 4100, loss[loss=0.2157, simple_loss=0.2939, pruned_loss=0.06873, over 21724.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.307, pruned_loss=0.0784, over 4266345.97 frames. ], batch size: 247, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:53:59,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1671312.0, ans=0.035 2023-06-24 05:54:57,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1671492.0, ans=0.2 2023-06-24 05:55:28,612 INFO [train.py:996] (0/4) Epoch 10, batch 4150, loss[loss=0.1737, simple_loss=0.2593, pruned_loss=0.04401, over 21801.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.307, pruned_loss=0.07481, over 4274605.72 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:56:17,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.253e+02 6.615e+02 8.834e+02 1.095e+03 2.475e+03, threshold=1.767e+03, percent-clipped=7.0 2023-06-24 05:56:57,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1671852.0, ans=0.125 2023-06-24 05:57:09,382 INFO [train.py:996] (0/4) Epoch 10, batch 4200, loss[loss=0.2265, simple_loss=0.3079, pruned_loss=0.07255, over 21902.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3067, pruned_loss=0.07499, over 4277972.19 frames. ], batch size: 373, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:57:31,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-24 05:58:12,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1672032.0, ans=0.0 2023-06-24 05:58:24,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1672092.0, ans=0.2 2023-06-24 05:59:00,785 INFO [train.py:996] (0/4) Epoch 10, batch 4250, loss[loss=0.2576, simple_loss=0.3387, pruned_loss=0.08825, over 21273.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3111, pruned_loss=0.07619, over 4272720.02 frames. ], batch size: 131, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:59:40,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 7.026e+02 9.985e+02 1.582e+03 3.548e+03, threshold=1.997e+03, percent-clipped=19.0 2023-06-24 06:00:24,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1672452.0, ans=15.0 2023-06-24 06:00:43,131 INFO [train.py:996] (0/4) Epoch 10, batch 4300, loss[loss=0.241, simple_loss=0.3283, pruned_loss=0.07685, over 21665.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3183, pruned_loss=0.07839, over 4264947.73 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:00:45,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1672512.0, ans=0.1 2023-06-24 06:00:58,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1672512.0, ans=0.125 2023-06-24 06:02:26,305 INFO [train.py:996] (0/4) Epoch 10, batch 4350, loss[loss=0.2036, simple_loss=0.283, pruned_loss=0.06214, over 21369.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3183, pruned_loss=0.0787, over 4265141.95 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:03:05,534 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.531e+02 6.687e+02 1.042e+03 1.785e+03 5.548e+03, threshold=2.083e+03, percent-clipped=20.0 2023-06-24 06:03:06,489 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.62 vs. limit=6.0 2023-06-24 06:03:57,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=22.5 2023-06-24 06:04:04,323 INFO [train.py:996] (0/4) Epoch 10, batch 4400, loss[loss=0.2043, simple_loss=0.2748, pruned_loss=0.06691, over 21462.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3141, pruned_loss=0.07898, over 4268671.46 frames. ], batch size: 212, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:04:29,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1673172.0, ans=0.0 2023-06-24 06:04:35,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1673172.0, ans=0.1 2023-06-24 06:04:55,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1673232.0, ans=0.035 2023-06-24 06:05:45,301 INFO [train.py:996] (0/4) Epoch 10, batch 4450, loss[loss=0.289, simple_loss=0.3727, pruned_loss=0.1027, over 21406.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3219, pruned_loss=0.08109, over 4275459.56 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:06:10,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1673472.0, ans=0.2 2023-06-24 06:06:30,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.122e+02 7.408e+02 1.039e+03 1.368e+03 2.536e+03, threshold=2.077e+03, percent-clipped=7.0 2023-06-24 06:06:38,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=1673532.0, ans=12.0 2023-06-24 06:06:53,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1673592.0, ans=0.125 2023-06-24 06:07:23,218 INFO [train.py:996] (0/4) Epoch 10, batch 4500, loss[loss=0.2326, simple_loss=0.3095, pruned_loss=0.07784, over 21342.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3215, pruned_loss=0.08212, over 4278808.63 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:07:46,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1673772.0, ans=0.0 2023-06-24 06:09:05,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1673952.0, ans=0.125 2023-06-24 06:09:08,468 INFO [train.py:996] (0/4) Epoch 10, batch 4550, loss[loss=0.2656, simple_loss=0.3414, pruned_loss=0.09493, over 21689.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3247, pruned_loss=0.08316, over 4283262.16 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:09:31,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1674072.0, ans=0.125 2023-06-24 06:09:43,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1674072.0, ans=0.125 2023-06-24 06:09:55,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.449e+02 7.044e+02 9.475e+02 1.574e+03 2.834e+03, threshold=1.895e+03, percent-clipped=10.0 2023-06-24 06:10:17,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1674192.0, ans=0.125 2023-06-24 06:10:26,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1674192.0, ans=0.125 2023-06-24 06:10:29,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1674252.0, ans=0.125 2023-06-24 06:10:40,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1674252.0, ans=0.125 2023-06-24 06:10:49,478 INFO [train.py:996] (0/4) Epoch 10, batch 4600, loss[loss=0.2758, simple_loss=0.3431, pruned_loss=0.1043, over 21761.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3301, pruned_loss=0.08551, over 4285426.75 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:10:55,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-24 06:11:16,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1674372.0, ans=0.2 2023-06-24 06:12:24,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1674552.0, ans=0.125 2023-06-24 06:12:27,176 INFO [train.py:996] (0/4) Epoch 10, batch 4650, loss[loss=0.2254, simple_loss=0.2986, pruned_loss=0.07613, over 21908.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3247, pruned_loss=0.08416, over 4285771.39 frames. ], batch size: 351, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:12:59,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1674672.0, ans=0.2 2023-06-24 06:13:18,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.211e+02 6.703e+02 9.514e+02 1.360e+03 2.442e+03, threshold=1.903e+03, percent-clipped=9.0 2023-06-24 06:13:56,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.30 vs. limit=22.5 2023-06-24 06:14:04,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1674912.0, ans=0.0 2023-06-24 06:14:05,764 INFO [train.py:996] (0/4) Epoch 10, batch 4700, loss[loss=0.1943, simple_loss=0.2662, pruned_loss=0.06117, over 21825.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.315, pruned_loss=0.08144, over 4285955.90 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:14:47,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1675032.0, ans=0.2 2023-06-24 06:15:11,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1675092.0, ans=0.0 2023-06-24 06:15:17,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1675092.0, ans=0.125 2023-06-24 06:15:35,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1675152.0, ans=0.0 2023-06-24 06:15:44,559 INFO [train.py:996] (0/4) Epoch 10, batch 4750, loss[loss=0.2417, simple_loss=0.3086, pruned_loss=0.0874, over 21745.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3088, pruned_loss=0.08118, over 4284093.11 frames. ], batch size: 415, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:16:00,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1675212.0, ans=0.0 2023-06-24 06:16:03,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1675272.0, ans=0.125 2023-06-24 06:16:12,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1675272.0, ans=0.015 2023-06-24 06:16:13,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.04 vs. limit=22.5 2023-06-24 06:16:35,101 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.386e+02 6.636e+02 9.717e+02 1.456e+03 3.310e+03, threshold=1.943e+03, percent-clipped=9.0 2023-06-24 06:16:43,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1675332.0, ans=0.1 2023-06-24 06:17:27,599 INFO [train.py:996] (0/4) Epoch 10, batch 4800, loss[loss=0.1878, simple_loss=0.2486, pruned_loss=0.06345, over 21208.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3072, pruned_loss=0.08013, over 4286164.90 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:17:45,973 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-24 06:18:03,848 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-24 06:18:28,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1675692.0, ans=0.2 2023-06-24 06:19:05,392 INFO [train.py:996] (0/4) Epoch 10, batch 4850, loss[loss=0.266, simple_loss=0.3281, pruned_loss=0.102, over 21866.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3052, pruned_loss=0.07951, over 4289793.20 frames. ], batch size: 124, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:19:18,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1675812.0, ans=0.0 2023-06-24 06:19:39,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1675872.0, ans=0.05 2023-06-24 06:19:47,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1675932.0, ans=0.125 2023-06-24 06:19:53,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.198e+02 7.060e+02 1.085e+03 1.594e+03 2.809e+03, threshold=2.169e+03, percent-clipped=13.0 2023-06-24 06:20:09,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1675992.0, ans=0.125 2023-06-24 06:20:26,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1676052.0, ans=0.025 2023-06-24 06:20:45,023 INFO [train.py:996] (0/4) Epoch 10, batch 4900, loss[loss=0.2717, simple_loss=0.4059, pruned_loss=0.06872, over 20765.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3084, pruned_loss=0.08082, over 4290502.82 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:20:48,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.43 vs. limit=15.0 2023-06-24 06:21:23,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1676172.0, ans=0.125 2023-06-24 06:21:28,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1676232.0, ans=0.125 2023-06-24 06:21:46,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1676292.0, ans=0.125 2023-06-24 06:22:29,696 INFO [train.py:996] (0/4) Epoch 10, batch 4950, loss[loss=0.1912, simple_loss=0.2871, pruned_loss=0.04763, over 21732.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3131, pruned_loss=0.07859, over 4290177.68 frames. ], batch size: 351, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:22:59,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-24 06:23:12,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.855e+02 5.943e+02 9.817e+02 1.512e+03 3.334e+03, threshold=1.963e+03, percent-clipped=7.0 2023-06-24 06:23:55,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-24 06:24:03,580 INFO [train.py:996] (0/4) Epoch 10, batch 5000, loss[loss=0.2397, simple_loss=0.3004, pruned_loss=0.08947, over 20113.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3129, pruned_loss=0.07612, over 4291955.53 frames. ], batch size: 703, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:24:20,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-24 06:24:23,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=12.0 2023-06-24 06:25:22,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.28 vs. limit=6.0 2023-06-24 06:25:23,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1676952.0, ans=0.0 2023-06-24 06:25:41,913 INFO [train.py:996] (0/4) Epoch 10, batch 5050, loss[loss=0.2499, simple_loss=0.3197, pruned_loss=0.09009, over 21436.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3127, pruned_loss=0.07793, over 4284574.76 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:26:22,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1677132.0, ans=0.2 2023-06-24 06:26:30,084 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.758e+02 6.527e+02 8.985e+02 1.399e+03 2.450e+03, threshold=1.797e+03, percent-clipped=5.0 2023-06-24 06:26:46,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1677192.0, ans=0.125 2023-06-24 06:27:16,414 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.56 vs. limit=15.0 2023-06-24 06:27:21,206 INFO [train.py:996] (0/4) Epoch 10, batch 5100, loss[loss=0.2129, simple_loss=0.2833, pruned_loss=0.0712, over 21299.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3088, pruned_loss=0.07818, over 4288253.13 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:27:38,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1677312.0, ans=10.0 2023-06-24 06:28:00,836 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-24 06:28:11,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1677432.0, ans=0.125 2023-06-24 06:28:13,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1677432.0, ans=0.125 2023-06-24 06:28:30,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1677492.0, ans=0.125 2023-06-24 06:28:40,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-24 06:29:01,702 INFO [train.py:996] (0/4) Epoch 10, batch 5150, loss[loss=0.2776, simple_loss=0.4004, pruned_loss=0.07736, over 20765.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.31, pruned_loss=0.07982, over 4293570.90 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:29:06,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-24 06:29:50,182 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 6.243e+02 9.247e+02 1.548e+03 4.552e+03, threshold=1.849e+03, percent-clipped=17.0 2023-06-24 06:30:41,406 INFO [train.py:996] (0/4) Epoch 10, batch 5200, loss[loss=0.2108, simple_loss=0.3079, pruned_loss=0.0569, over 21612.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3098, pruned_loss=0.07966, over 4282065.58 frames. ], batch size: 230, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:31:06,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1677972.0, ans=0.1 2023-06-24 06:31:15,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1677972.0, ans=0.2 2023-06-24 06:31:16,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.61 vs. limit=15.0 2023-06-24 06:31:25,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1678032.0, ans=0.125 2023-06-24 06:31:36,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1678032.0, ans=0.0 2023-06-24 06:32:17,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1678152.0, ans=0.125 2023-06-24 06:32:20,312 INFO [train.py:996] (0/4) Epoch 10, batch 5250, loss[loss=0.2049, simple_loss=0.2904, pruned_loss=0.05975, over 21560.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3135, pruned_loss=0.07786, over 4280118.06 frames. ], batch size: 230, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:32:51,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1678272.0, ans=0.0 2023-06-24 06:33:00,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1678272.0, ans=0.1 2023-06-24 06:33:09,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.205e+02 5.630e+02 8.256e+02 1.146e+03 2.990e+03, threshold=1.651e+03, percent-clipped=4.0 2023-06-24 06:33:45,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678452.0, ans=0.1 2023-06-24 06:34:00,837 INFO [train.py:996] (0/4) Epoch 10, batch 5300, loss[loss=0.237, simple_loss=0.3071, pruned_loss=0.08341, over 21461.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3126, pruned_loss=0.07853, over 4289437.41 frames. ], batch size: 194, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:34:26,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678572.0, ans=0.1 2023-06-24 06:34:31,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1678572.0, ans=0.2 2023-06-24 06:34:33,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1678572.0, ans=0.125 2023-06-24 06:34:50,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-24 06:35:38,853 INFO [train.py:996] (0/4) Epoch 10, batch 5350, loss[loss=0.209, simple_loss=0.2759, pruned_loss=0.07105, over 21912.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3105, pruned_loss=0.07984, over 4294485.47 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:35:39,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1678812.0, ans=0.125 2023-06-24 06:35:40,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1678812.0, ans=0.0 2023-06-24 06:35:46,068 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:35:48,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1678812.0, ans=0.0 2023-06-24 06:36:15,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1678932.0, ans=0.0 2023-06-24 06:36:23,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.431e+02 6.516e+02 8.462e+02 1.218e+03 2.526e+03, threshold=1.692e+03, percent-clipped=10.0 2023-06-24 06:36:29,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1678932.0, ans=0.125 2023-06-24 06:36:37,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-24 06:36:53,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1678992.0, ans=0.125 2023-06-24 06:37:12,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1679112.0, ans=0.0 2023-06-24 06:37:13,521 INFO [train.py:996] (0/4) Epoch 10, batch 5400, loss[loss=0.2058, simple_loss=0.2556, pruned_loss=0.07804, over 20753.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3094, pruned_loss=0.08125, over 4294845.62 frames. ], batch size: 608, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:37:50,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1679172.0, ans=0.2 2023-06-24 06:37:58,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679232.0, ans=0.1 2023-06-24 06:38:44,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1679352.0, ans=0.125 2023-06-24 06:38:58,906 INFO [train.py:996] (0/4) Epoch 10, batch 5450, loss[loss=0.2571, simple_loss=0.3547, pruned_loss=0.07974, over 21790.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3107, pruned_loss=0.07974, over 4295455.55 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:39:53,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 6.727e+02 1.128e+03 1.838e+03 3.883e+03, threshold=2.256e+03, percent-clipped=29.0 2023-06-24 06:39:56,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-24 06:39:56,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-24 06:40:15,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1679592.0, ans=0.125 2023-06-24 06:40:41,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1679652.0, ans=0.0 2023-06-24 06:40:43,787 INFO [train.py:996] (0/4) Epoch 10, batch 5500, loss[loss=0.2036, simple_loss=0.3031, pruned_loss=0.05201, over 21683.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3159, pruned_loss=0.07678, over 4290375.84 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:41:05,899 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:41:10,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1679772.0, ans=0.0 2023-06-24 06:41:41,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1679892.0, ans=0.125 2023-06-24 06:42:20,306 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-280000.pt 2023-06-24 06:42:31,507 INFO [train.py:996] (0/4) Epoch 10, batch 5550, loss[loss=0.1774, simple_loss=0.2493, pruned_loss=0.05278, over 21080.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3147, pruned_loss=0.07416, over 4291848.03 frames. ], batch size: 143, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:42:33,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1680012.0, ans=0.95 2023-06-24 06:42:35,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1680012.0, ans=0.0 2023-06-24 06:42:39,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-24 06:43:16,343 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.799e+02 8.739e+02 1.452e+03 3.739e+03, threshold=1.748e+03, percent-clipped=10.0 2023-06-24 06:43:31,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1680132.0, ans=0.125 2023-06-24 06:44:16,152 INFO [train.py:996] (0/4) Epoch 10, batch 5600, loss[loss=0.2592, simple_loss=0.3534, pruned_loss=0.08251, over 21823.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3139, pruned_loss=0.07202, over 4284178.70 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:44:27,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1680312.0, ans=0.04949747468305833 2023-06-24 06:44:36,187 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-24 06:44:40,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1680372.0, ans=0.0 2023-06-24 06:44:45,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1680372.0, ans=0.2 2023-06-24 06:44:47,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1680432.0, ans=0.0 2023-06-24 06:45:06,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.61 vs. limit=15.0 2023-06-24 06:45:44,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1680552.0, ans=0.1 2023-06-24 06:45:52,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1680552.0, ans=0.125 2023-06-24 06:45:55,110 INFO [train.py:996] (0/4) Epoch 10, batch 5650, loss[loss=0.2752, simple_loss=0.3445, pruned_loss=0.103, over 21720.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.319, pruned_loss=0.07443, over 4288909.92 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:45:55,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1680612.0, ans=0.125 2023-06-24 06:46:09,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=12.0 2023-06-24 06:46:46,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.674e+02 6.342e+02 8.278e+02 1.256e+03 3.323e+03, threshold=1.656e+03, percent-clipped=10.0 2023-06-24 06:47:15,278 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-24 06:47:34,657 INFO [train.py:996] (0/4) Epoch 10, batch 5700, loss[loss=0.2453, simple_loss=0.3396, pruned_loss=0.07549, over 21650.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3176, pruned_loss=0.07601, over 4294834.53 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:47:35,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1680912.0, ans=6.0 2023-06-24 06:47:35,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1680912.0, ans=22.5 2023-06-24 06:48:01,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1680972.0, ans=0.125 2023-06-24 06:48:10,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1680972.0, ans=0.0 2023-06-24 06:48:26,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1681032.0, ans=0.125 2023-06-24 06:48:39,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.02 vs. limit=15.0 2023-06-24 06:48:42,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1681092.0, ans=0.125 2023-06-24 06:48:58,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1681152.0, ans=0.0 2023-06-24 06:48:58,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-06-24 06:49:15,869 INFO [train.py:996] (0/4) Epoch 10, batch 5750, loss[loss=0.2289, simple_loss=0.3341, pruned_loss=0.06187, over 21186.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3127, pruned_loss=0.07212, over 4292876.40 frames. ], batch size: 548, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:50:01,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1681332.0, ans=0.125 2023-06-24 06:50:11,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1681332.0, ans=0.07 2023-06-24 06:50:12,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.981e+02 6.985e+02 1.085e+03 1.966e+03 4.482e+03, threshold=2.170e+03, percent-clipped=31.0 2023-06-24 06:50:55,681 INFO [train.py:996] (0/4) Epoch 10, batch 5800, loss[loss=0.209, simple_loss=0.3139, pruned_loss=0.05202, over 21749.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3119, pruned_loss=0.07081, over 4289872.70 frames. ], batch size: 298, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:51:12,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1681512.0, ans=0.125 2023-06-24 06:51:54,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1681632.0, ans=0.125 2023-06-24 06:52:21,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1681752.0, ans=0.0 2023-06-24 06:52:37,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1681752.0, ans=0.125 2023-06-24 06:52:40,281 INFO [train.py:996] (0/4) Epoch 10, batch 5850, loss[loss=0.2233, simple_loss=0.3216, pruned_loss=0.06249, over 21479.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3098, pruned_loss=0.06709, over 4286523.09 frames. ], batch size: 507, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:52:55,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1681812.0, ans=0.125 2023-06-24 06:53:05,830 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-24 06:53:13,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1681872.0, ans=0.0 2023-06-24 06:53:36,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.591e+02 5.271e+02 8.161e+02 1.450e+03 2.978e+03, threshold=1.632e+03, percent-clipped=6.0 2023-06-24 06:53:46,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1681992.0, ans=0.0 2023-06-24 06:53:51,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-24 06:54:07,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1682052.0, ans=0.2 2023-06-24 06:54:22,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1682112.0, ans=0.125 2023-06-24 06:54:23,940 INFO [train.py:996] (0/4) Epoch 10, batch 5900, loss[loss=0.17, simple_loss=0.2423, pruned_loss=0.04885, over 16525.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.3025, pruned_loss=0.06213, over 4278645.63 frames. ], batch size: 61, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:54:50,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-24 06:54:51,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1682172.0, ans=0.125 2023-06-24 06:55:07,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1682232.0, ans=0.0 2023-06-24 06:55:15,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1682232.0, ans=0.125 2023-06-24 06:55:37,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1682352.0, ans=0.1 2023-06-24 06:55:46,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1682352.0, ans=0.2 2023-06-24 06:56:02,447 INFO [train.py:996] (0/4) Epoch 10, batch 5950, loss[loss=0.2578, simple_loss=0.3202, pruned_loss=0.09768, over 21869.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.3005, pruned_loss=0.06537, over 4282675.26 frames. ], batch size: 414, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:56:37,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-24 06:56:52,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.371e+02 5.849e+02 7.878e+02 1.100e+03 2.007e+03, threshold=1.576e+03, percent-clipped=6.0 2023-06-24 06:57:10,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1682592.0, ans=0.125 2023-06-24 06:57:37,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1682652.0, ans=0.0 2023-06-24 06:57:40,442 INFO [train.py:996] (0/4) Epoch 10, batch 6000, loss[loss=0.2068, simple_loss=0.2624, pruned_loss=0.07555, over 21241.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2966, pruned_loss=0.06893, over 4280893.64 frames. ], batch size: 548, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 06:57:40,443 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 06:57:59,725 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2611, simple_loss=0.3564, pruned_loss=0.0829, over 1796401.00 frames. 2023-06-24 06:57:59,725 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 06:58:00,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1682712.0, ans=0.125 2023-06-24 06:59:01,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-24 06:59:32,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.18 vs. limit=6.0 2023-06-24 06:59:38,449 INFO [train.py:996] (0/4) Epoch 10, batch 6050, loss[loss=0.2117, simple_loss=0.2757, pruned_loss=0.07386, over 15337.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2919, pruned_loss=0.07049, over 4261431.43 frames. ], batch size: 60, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:59:50,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1683012.0, ans=0.2 2023-06-24 07:00:00,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-24 07:00:02,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1683072.0, ans=0.0 2023-06-24 07:00:26,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.624e+02 5.765e+02 7.695e+02 1.085e+03 2.275e+03, threshold=1.539e+03, percent-clipped=10.0 2023-06-24 07:01:17,087 INFO [train.py:996] (0/4) Epoch 10, batch 6100, loss[loss=0.2195, simple_loss=0.2907, pruned_loss=0.07417, over 21788.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2924, pruned_loss=0.06913, over 4261672.80 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:01:49,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-24 07:01:51,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1683432.0, ans=0.025 2023-06-24 07:02:47,483 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:02:59,567 INFO [train.py:996] (0/4) Epoch 10, batch 6150, loss[loss=0.253, simple_loss=0.3241, pruned_loss=0.09098, over 21513.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2955, pruned_loss=0.07124, over 4263160.85 frames. ], batch size: 473, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:03:06,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1683612.0, ans=0.125 2023-06-24 07:03:33,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1683732.0, ans=0.125 2023-06-24 07:03:33,510 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:03:51,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.961e+02 6.743e+02 9.296e+02 1.382e+03 3.230e+03, threshold=1.859e+03, percent-clipped=18.0 2023-06-24 07:04:09,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1683792.0, ans=0.125 2023-06-24 07:04:11,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1683852.0, ans=0.1 2023-06-24 07:04:16,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1683852.0, ans=0.1 2023-06-24 07:04:35,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1683852.0, ans=0.125 2023-06-24 07:04:38,177 INFO [train.py:996] (0/4) Epoch 10, batch 6200, loss[loss=0.2765, simple_loss=0.3351, pruned_loss=0.1089, over 21451.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3001, pruned_loss=0.07206, over 4256939.41 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:05:05,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1683972.0, ans=0.2 2023-06-24 07:05:41,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=12.0 2023-06-24 07:06:17,946 INFO [train.py:996] (0/4) Epoch 10, batch 6250, loss[loss=0.213, simple_loss=0.311, pruned_loss=0.05751, over 21657.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.307, pruned_loss=0.07312, over 4254504.71 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:06:27,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1684212.0, ans=0.0 2023-06-24 07:06:32,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1684272.0, ans=0.125 2023-06-24 07:06:36,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1684272.0, ans=0.125 2023-06-24 07:06:48,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1684272.0, ans=0.1 2023-06-24 07:07:00,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1684332.0, ans=0.0 2023-06-24 07:07:09,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.397e+02 7.386e+02 1.187e+03 1.704e+03 4.027e+03, threshold=2.375e+03, percent-clipped=21.0 2023-06-24 07:07:15,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1684392.0, ans=0.07 2023-06-24 07:07:56,069 INFO [train.py:996] (0/4) Epoch 10, batch 6300, loss[loss=0.21, simple_loss=0.3255, pruned_loss=0.04728, over 19803.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3102, pruned_loss=0.07209, over 4256027.63 frames. ], batch size: 703, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:07:58,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1684512.0, ans=0.02 2023-06-24 07:09:34,405 INFO [train.py:996] (0/4) Epoch 10, batch 6350, loss[loss=0.2709, simple_loss=0.3367, pruned_loss=0.1025, over 21288.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3121, pruned_loss=0.07427, over 4261818.39 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:09:36,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1684812.0, ans=0.1 2023-06-24 07:10:06,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1684872.0, ans=0.125 2023-06-24 07:10:15,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1684932.0, ans=0.1 2023-06-24 07:10:26,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1684932.0, ans=0.1 2023-06-24 07:10:27,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.068e+02 5.999e+02 8.518e+02 1.321e+03 2.305e+03, threshold=1.704e+03, percent-clipped=0.0 2023-06-24 07:11:14,614 INFO [train.py:996] (0/4) Epoch 10, batch 6400, loss[loss=0.2436, simple_loss=0.3159, pruned_loss=0.08569, over 21520.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3159, pruned_loss=0.07862, over 4263013.12 frames. ], batch size: 194, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:11:39,758 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:11:46,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1685172.0, ans=0.2 2023-06-24 07:12:17,202 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.62 vs. limit=15.0 2023-06-24 07:12:18,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1685232.0, ans=0.125 2023-06-24 07:12:32,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1685292.0, ans=0.125 2023-06-24 07:12:58,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1685412.0, ans=0.0 2023-06-24 07:12:59,604 INFO [train.py:996] (0/4) Epoch 10, batch 6450, loss[loss=0.2151, simple_loss=0.3029, pruned_loss=0.06361, over 21771.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3195, pruned_loss=0.07935, over 4267834.90 frames. ], batch size: 316, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:13:13,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=15.0 2023-06-24 07:13:56,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.370e+02 9.137e+02 1.216e+03 1.629e+03 2.950e+03, threshold=2.432e+03, percent-clipped=21.0 2023-06-24 07:13:57,783 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-24 07:14:00,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-24 07:14:16,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1685652.0, ans=0.125 2023-06-24 07:14:41,256 INFO [train.py:996] (0/4) Epoch 10, batch 6500, loss[loss=0.2258, simple_loss=0.2842, pruned_loss=0.08372, over 21592.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3125, pruned_loss=0.07827, over 4275195.09 frames. ], batch size: 415, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:15:47,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1685892.0, ans=0.1 2023-06-24 07:16:19,613 INFO [train.py:996] (0/4) Epoch 10, batch 6550, loss[loss=0.2283, simple_loss=0.2993, pruned_loss=0.07863, over 21411.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3099, pruned_loss=0.07711, over 4276967.61 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:16:27,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1686012.0, ans=0.07 2023-06-24 07:16:52,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-24 07:17:18,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.101e+02 5.692e+02 8.877e+02 1.428e+03 2.273e+03, threshold=1.775e+03, percent-clipped=0.0 2023-06-24 07:17:32,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1686192.0, ans=0.125 2023-06-24 07:18:03,373 INFO [train.py:996] (0/4) Epoch 10, batch 6600, loss[loss=0.2422, simple_loss=0.2879, pruned_loss=0.09828, over 21405.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3045, pruned_loss=0.07703, over 4270896.39 frames. ], batch size: 508, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:18:13,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1686312.0, ans=0.0 2023-06-24 07:18:23,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1686372.0, ans=0.0 2023-06-24 07:18:44,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-24 07:18:51,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1686432.0, ans=0.1 2023-06-24 07:18:59,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1686492.0, ans=0.125 2023-06-24 07:19:22,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1686552.0, ans=0.0 2023-06-24 07:19:28,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-24 07:19:36,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1686612.0, ans=0.125 2023-06-24 07:19:37,660 INFO [train.py:996] (0/4) Epoch 10, batch 6650, loss[loss=0.1963, simple_loss=0.2697, pruned_loss=0.06147, over 21654.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2972, pruned_loss=0.07429, over 4270241.93 frames. ], batch size: 298, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:19:44,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1686612.0, ans=0.125 2023-06-24 07:19:56,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-24 07:20:31,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.031e+02 5.733e+02 1.031e+03 1.471e+03 3.342e+03, threshold=2.062e+03, percent-clipped=12.0 2023-06-24 07:20:35,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1686792.0, ans=0.0 2023-06-24 07:20:53,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1686792.0, ans=0.125 2023-06-24 07:21:04,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1686852.0, ans=0.0 2023-06-24 07:21:15,688 INFO [train.py:996] (0/4) Epoch 10, batch 6700, loss[loss=0.18, simple_loss=0.2481, pruned_loss=0.05595, over 21833.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2935, pruned_loss=0.07424, over 4263311.82 frames. ], batch size: 98, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:21:28,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1686912.0, ans=0.125 2023-06-24 07:22:02,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1687032.0, ans=0.125 2023-06-24 07:22:16,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1687092.0, ans=0.125 2023-06-24 07:22:53,763 INFO [train.py:996] (0/4) Epoch 10, batch 6750, loss[loss=0.2192, simple_loss=0.2902, pruned_loss=0.07414, over 21782.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2911, pruned_loss=0.07493, over 4261943.64 frames. ], batch size: 332, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:23:06,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1687212.0, ans=0.04949747468305833 2023-06-24 07:23:09,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1687272.0, ans=0.125 2023-06-24 07:23:47,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.403e+02 6.480e+02 8.146e+02 1.101e+03 1.861e+03, threshold=1.629e+03, percent-clipped=0.0 2023-06-24 07:24:32,369 INFO [train.py:996] (0/4) Epoch 10, batch 6800, loss[loss=0.2374, simple_loss=0.2964, pruned_loss=0.08924, over 21422.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2928, pruned_loss=0.0774, over 4259768.62 frames. ], batch size: 194, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:25:06,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687632.0, ans=0.1 2023-06-24 07:25:30,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.58 vs. limit=5.0 2023-06-24 07:25:41,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1687692.0, ans=0.0 2023-06-24 07:26:03,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1687752.0, ans=0.125 2023-06-24 07:26:09,729 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:26:10,781 INFO [train.py:996] (0/4) Epoch 10, batch 6850, loss[loss=0.2535, simple_loss=0.3102, pruned_loss=0.09841, over 21734.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2926, pruned_loss=0.07856, over 4268499.23 frames. ], batch size: 441, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:26:45,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1687872.0, ans=0.125 2023-06-24 07:27:08,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.455e+02 6.000e+02 9.900e+02 1.479e+03 3.025e+03, threshold=1.980e+03, percent-clipped=16.0 2023-06-24 07:27:08,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1687992.0, ans=10.0 2023-06-24 07:27:08,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1687992.0, ans=0.125 2023-06-24 07:27:34,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1688052.0, ans=0.0 2023-06-24 07:27:51,244 INFO [train.py:996] (0/4) Epoch 10, batch 6900, loss[loss=0.2045, simple_loss=0.2732, pruned_loss=0.06792, over 21276.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2942, pruned_loss=0.07838, over 4268467.30 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:27:58,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1688112.0, ans=0.2 2023-06-24 07:28:00,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1688112.0, ans=0.125 2023-06-24 07:28:24,157 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:28:46,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1688232.0, ans=0.0 2023-06-24 07:28:50,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1688292.0, ans=0.125 2023-06-24 07:29:02,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1688292.0, ans=0.125 2023-06-24 07:29:03,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1688292.0, ans=0.0 2023-06-24 07:29:33,004 INFO [train.py:996] (0/4) Epoch 10, batch 6950, loss[loss=0.2608, simple_loss=0.343, pruned_loss=0.08928, over 21493.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2957, pruned_loss=0.07439, over 4272782.53 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:29:35,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=12.0 2023-06-24 07:29:52,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1688472.0, ans=0.0 2023-06-24 07:30:15,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1688532.0, ans=0.0 2023-06-24 07:30:18,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-24 07:30:34,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 6.339e+02 9.938e+02 1.554e+03 2.681e+03, threshold=1.988e+03, percent-clipped=10.0 2023-06-24 07:31:09,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-24 07:31:11,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1688712.0, ans=0.125 2023-06-24 07:31:11,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1688712.0, ans=0.1 2023-06-24 07:31:12,643 INFO [train.py:996] (0/4) Epoch 10, batch 7000, loss[loss=0.2259, simple_loss=0.2867, pruned_loss=0.08255, over 21750.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3, pruned_loss=0.07769, over 4275503.46 frames. ], batch size: 317, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:31:29,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1688712.0, ans=0.125 2023-06-24 07:31:53,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1688832.0, ans=0.0 2023-06-24 07:32:27,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1688892.0, ans=0.125 2023-06-24 07:32:45,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1688952.0, ans=0.125 2023-06-24 07:32:52,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-24 07:32:52,795 INFO [train.py:996] (0/4) Epoch 10, batch 7050, loss[loss=0.2305, simple_loss=0.3216, pruned_loss=0.06973, over 21598.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2976, pruned_loss=0.07606, over 4261976.76 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:33:05,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.83 vs. limit=22.5 2023-06-24 07:34:00,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.677e+02 8.213e+02 1.284e+03 1.969e+03 3.755e+03, threshold=2.569e+03, percent-clipped=21.0 2023-06-24 07:34:00,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1689192.0, ans=0.125 2023-06-24 07:34:04,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1689192.0, ans=0.125 2023-06-24 07:34:24,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1689252.0, ans=0.125 2023-06-24 07:34:43,748 INFO [train.py:996] (0/4) Epoch 10, batch 7100, loss[loss=0.1711, simple_loss=0.2436, pruned_loss=0.04931, over 21137.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3011, pruned_loss=0.07608, over 4258830.33 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:34:47,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1689312.0, ans=0.125 2023-06-24 07:35:34,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1689432.0, ans=0.0 2023-06-24 07:36:23,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1689612.0, ans=0.125 2023-06-24 07:36:24,353 INFO [train.py:996] (0/4) Epoch 10, batch 7150, loss[loss=0.2976, simple_loss=0.3586, pruned_loss=0.1183, over 21292.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2992, pruned_loss=0.07483, over 4264367.25 frames. ], batch size: 507, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:37:05,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1689732.0, ans=0.0 2023-06-24 07:37:20,925 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.107e+02 6.261e+02 9.228e+02 1.360e+03 3.235e+03, threshold=1.846e+03, percent-clipped=6.0 2023-06-24 07:37:22,904 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:38:04,278 INFO [train.py:996] (0/4) Epoch 10, batch 7200, loss[loss=0.2274, simple_loss=0.2914, pruned_loss=0.08174, over 21775.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3014, pruned_loss=0.0767, over 4271235.47 frames. ], batch size: 124, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:38:43,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-24 07:39:13,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1690092.0, ans=0.1 2023-06-24 07:39:44,058 INFO [train.py:996] (0/4) Epoch 10, batch 7250, loss[loss=0.2349, simple_loss=0.2954, pruned_loss=0.08722, over 21635.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2969, pruned_loss=0.0768, over 4271172.53 frames. ], batch size: 393, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:40:17,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1690272.0, ans=0.125 2023-06-24 07:40:18,565 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.71 vs. limit=10.0 2023-06-24 07:40:22,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1690272.0, ans=0.0 2023-06-24 07:40:24,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1690332.0, ans=0.125 2023-06-24 07:40:45,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.252e+02 6.130e+02 8.584e+02 1.247e+03 2.821e+03, threshold=1.717e+03, percent-clipped=3.0 2023-06-24 07:41:15,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1690452.0, ans=0.125 2023-06-24 07:41:22,789 INFO [train.py:996] (0/4) Epoch 10, batch 7300, loss[loss=0.2678, simple_loss=0.3822, pruned_loss=0.0767, over 19731.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.291, pruned_loss=0.07517, over 4261491.50 frames. ], batch size: 703, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:41:23,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1690512.0, ans=0.0 2023-06-24 07:42:09,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1690632.0, ans=0.125 2023-06-24 07:42:27,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1690692.0, ans=0.125 2023-06-24 07:43:08,174 INFO [train.py:996] (0/4) Epoch 10, batch 7350, loss[loss=0.2163, simple_loss=0.2837, pruned_loss=0.07445, over 21686.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2909, pruned_loss=0.07675, over 4261641.98 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:43:34,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1690872.0, ans=0.125 2023-06-24 07:43:40,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690872.0, ans=0.1 2023-06-24 07:44:05,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1690992.0, ans=0.125 2023-06-24 07:44:06,757 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.919e+02 7.422e+02 1.084e+03 1.496e+03 4.269e+03, threshold=2.168e+03, percent-clipped=20.0 2023-06-24 07:44:22,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1690992.0, ans=0.2 2023-06-24 07:44:48,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1691112.0, ans=0.04949747468305833 2023-06-24 07:44:49,776 INFO [train.py:996] (0/4) Epoch 10, batch 7400, loss[loss=0.2414, simple_loss=0.3406, pruned_loss=0.07112, over 21549.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2976, pruned_loss=0.07937, over 4268339.29 frames. ], batch size: 473, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:44:55,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-06-24 07:46:00,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=22.5 2023-06-24 07:46:02,297 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-24 07:46:19,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1691352.0, ans=0.2 2023-06-24 07:46:29,142 INFO [train.py:996] (0/4) Epoch 10, batch 7450, loss[loss=0.2064, simple_loss=0.2663, pruned_loss=0.07321, over 21565.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2961, pruned_loss=0.07779, over 4262264.33 frames. ], batch size: 213, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:47:32,874 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 6.014e+02 8.284e+02 1.438e+03 2.557e+03, threshold=1.657e+03, percent-clipped=4.0 2023-06-24 07:47:51,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-24 07:47:53,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1691652.0, ans=0.0 2023-06-24 07:48:08,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.92 vs. limit=6.0 2023-06-24 07:48:15,244 INFO [train.py:996] (0/4) Epoch 10, batch 7500, loss[loss=0.2668, simple_loss=0.3447, pruned_loss=0.09445, over 21213.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3007, pruned_loss=0.07846, over 4262937.94 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:48:53,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=8.0 2023-06-24 07:49:00,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1691832.0, ans=15.0 2023-06-24 07:49:45,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1691952.0, ans=0.025 2023-06-24 07:49:54,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-24 07:49:56,512 INFO [train.py:996] (0/4) Epoch 10, batch 7550, loss[loss=0.1914, simple_loss=0.2489, pruned_loss=0.06698, over 20334.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3082, pruned_loss=0.07798, over 4263862.87 frames. ], batch size: 703, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:50:37,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1692132.0, ans=0.0 2023-06-24 07:50:44,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-24 07:50:45,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1692132.0, ans=0.125 2023-06-24 07:50:46,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1692132.0, ans=0.125 2023-06-24 07:50:51,733 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.32 vs. limit=10.0 2023-06-24 07:50:52,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.375e+02 7.111e+02 1.164e+03 1.789e+03 2.953e+03, threshold=2.328e+03, percent-clipped=32.0 2023-06-24 07:51:16,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-24 07:51:34,169 INFO [train.py:996] (0/4) Epoch 10, batch 7600, loss[loss=0.2458, simple_loss=0.3158, pruned_loss=0.08791, over 21899.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3069, pruned_loss=0.07672, over 4269885.62 frames. ], batch size: 332, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:51:40,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1692312.0, ans=0.025 2023-06-24 07:52:38,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1692492.0, ans=0.125 2023-06-24 07:52:54,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1692552.0, ans=0.0 2023-06-24 07:53:06,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1692552.0, ans=0.125 2023-06-24 07:53:09,055 INFO [train.py:996] (0/4) Epoch 10, batch 7650, loss[loss=0.232, simple_loss=0.2981, pruned_loss=0.08295, over 21461.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3071, pruned_loss=0.07907, over 4282555.80 frames. ], batch size: 131, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:53:28,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1692612.0, ans=0.1 2023-06-24 07:53:31,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1692612.0, ans=0.0 2023-06-24 07:53:43,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=22.5 2023-06-24 07:54:00,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1692732.0, ans=0.125 2023-06-24 07:54:11,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.476e+02 5.649e+02 6.864e+02 9.584e+02 1.979e+03, threshold=1.373e+03, percent-clipped=0.0 2023-06-24 07:54:20,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1692792.0, ans=0.0 2023-06-24 07:54:25,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=22.5 2023-06-24 07:54:29,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-24 07:54:30,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1692792.0, ans=0.0 2023-06-24 07:54:48,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1692852.0, ans=0.0 2023-06-24 07:54:58,723 INFO [train.py:996] (0/4) Epoch 10, batch 7700, loss[loss=0.2686, simple_loss=0.336, pruned_loss=0.1006, over 21309.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3097, pruned_loss=0.08186, over 4284944.59 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:56:01,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-24 07:56:02,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1693092.0, ans=0.125 2023-06-24 07:56:31,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1693152.0, ans=0.125 2023-06-24 07:56:40,759 INFO [train.py:996] (0/4) Epoch 10, batch 7750, loss[loss=0.2879, simple_loss=0.3885, pruned_loss=0.09368, over 21782.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3164, pruned_loss=0.08259, over 4285142.64 frames. ], batch size: 332, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 07:56:48,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.69 vs. limit=15.0 2023-06-24 07:57:25,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1693332.0, ans=0.0 2023-06-24 07:57:35,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1693332.0, ans=0.0 2023-06-24 07:57:38,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1693392.0, ans=0.2 2023-06-24 07:57:42,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 8.175e+02 1.275e+03 1.822e+03 5.282e+03, threshold=2.550e+03, percent-clipped=41.0 2023-06-24 07:57:54,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1693392.0, ans=0.2 2023-06-24 07:58:21,732 INFO [train.py:996] (0/4) Epoch 10, batch 7800, loss[loss=0.1851, simple_loss=0.2445, pruned_loss=0.06287, over 20802.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3213, pruned_loss=0.08363, over 4281227.73 frames. ], batch size: 609, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 07:58:24,202 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-24 07:58:33,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1693512.0, ans=0.5 2023-06-24 07:58:33,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1693512.0, ans=0.125 2023-06-24 07:59:25,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1693692.0, ans=0.125 2023-06-24 07:59:42,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1693692.0, ans=0.1 2023-06-24 07:59:50,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1693752.0, ans=0.04949747468305833 2023-06-24 08:00:00,782 INFO [train.py:996] (0/4) Epoch 10, batch 7850, loss[loss=0.2261, simple_loss=0.2932, pruned_loss=0.07952, over 21972.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3147, pruned_loss=0.08271, over 4264812.43 frames. ], batch size: 113, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:00:31,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1693872.0, ans=0.125 2023-06-24 08:01:02,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.400e+02 8.758e+02 1.300e+03 4.376e+03, threshold=1.752e+03, percent-clipped=3.0 2023-06-24 08:01:18,136 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-24 08:01:18,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1694052.0, ans=0.1 2023-06-24 08:01:41,737 INFO [train.py:996] (0/4) Epoch 10, batch 7900, loss[loss=0.2948, simple_loss=0.3844, pruned_loss=0.1026, over 21478.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3094, pruned_loss=0.0818, over 4267926.46 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:02:16,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1694172.0, ans=0.125 2023-06-24 08:02:19,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1694172.0, ans=0.0 2023-06-24 08:02:43,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1694232.0, ans=0.125 2023-06-24 08:03:29,370 INFO [train.py:996] (0/4) Epoch 10, batch 7950, loss[loss=0.2665, simple_loss=0.3232, pruned_loss=0.1048, over 20121.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3119, pruned_loss=0.08092, over 4260745.61 frames. ], batch size: 705, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:03:42,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1694412.0, ans=0.125 2023-06-24 08:03:50,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1694472.0, ans=0.125 2023-06-24 08:04:36,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.376e+02 7.152e+02 9.880e+02 1.480e+03 4.841e+03, threshold=1.976e+03, percent-clipped=16.0 2023-06-24 08:05:16,266 INFO [train.py:996] (0/4) Epoch 10, batch 8000, loss[loss=0.2502, simple_loss=0.3286, pruned_loss=0.0859, over 17181.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3168, pruned_loss=0.08324, over 4258220.95 frames. ], batch size: 60, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:05:18,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1694712.0, ans=0.125 2023-06-24 08:05:22,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1694712.0, ans=0.0 2023-06-24 08:05:42,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1694772.0, ans=0.125 2023-06-24 08:05:58,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1694832.0, ans=0.125 2023-06-24 08:06:09,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1694832.0, ans=0.0 2023-06-24 08:06:39,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1694892.0, ans=0.125 2023-06-24 08:07:03,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1694952.0, ans=0.2 2023-06-24 08:07:06,346 INFO [train.py:996] (0/4) Epoch 10, batch 8050, loss[loss=0.3523, simple_loss=0.4233, pruned_loss=0.1406, over 21455.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3231, pruned_loss=0.08366, over 4261469.48 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:07:08,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1695012.0, ans=0.125 2023-06-24 08:07:27,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1695072.0, ans=0.125 2023-06-24 08:07:29,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1695072.0, ans=0.125 2023-06-24 08:07:45,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1695132.0, ans=0.0 2023-06-24 08:08:07,473 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.312e+02 7.120e+02 8.807e+02 1.513e+03 2.630e+03, threshold=1.761e+03, percent-clipped=8.0 2023-06-24 08:08:36,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-24 08:08:37,960 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.31 vs. limit=22.5 2023-06-24 08:08:46,530 INFO [train.py:996] (0/4) Epoch 10, batch 8100, loss[loss=0.2203, simple_loss=0.2924, pruned_loss=0.07407, over 21859.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3206, pruned_loss=0.08407, over 4263537.09 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:09:04,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1695312.0, ans=0.125 2023-06-24 08:09:52,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1695492.0, ans=0.125 2023-06-24 08:10:23,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1695552.0, ans=0.125 2023-06-24 08:10:36,967 INFO [train.py:996] (0/4) Epoch 10, batch 8150, loss[loss=0.2678, simple_loss=0.3708, pruned_loss=0.08239, over 21570.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3279, pruned_loss=0.08466, over 4262425.65 frames. ], batch size: 389, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:10:51,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.96 vs. limit=22.5 2023-06-24 08:11:43,628 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.039e+02 7.483e+02 1.136e+03 1.755e+03 5.961e+03, threshold=2.271e+03, percent-clipped=24.0 2023-06-24 08:12:06,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1695852.0, ans=0.125 2023-06-24 08:12:17,697 INFO [train.py:996] (0/4) Epoch 10, batch 8200, loss[loss=0.192, simple_loss=0.2489, pruned_loss=0.06759, over 20738.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.32, pruned_loss=0.08252, over 4263870.97 frames. ], batch size: 609, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:13:31,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1696092.0, ans=0.1 2023-06-24 08:13:32,841 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:13:38,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1696092.0, ans=0.125 2023-06-24 08:13:57,374 INFO [train.py:996] (0/4) Epoch 10, batch 8250, loss[loss=0.2984, simple_loss=0.3812, pruned_loss=0.1078, over 21502.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3186, pruned_loss=0.0831, over 4261681.16 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:14:57,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-24 08:15:04,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.899e+02 6.467e+02 8.535e+02 1.267e+03 3.280e+03, threshold=1.707e+03, percent-clipped=4.0 2023-06-24 08:15:32,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1696452.0, ans=0.0 2023-06-24 08:15:38,171 INFO [train.py:996] (0/4) Epoch 10, batch 8300, loss[loss=0.263, simple_loss=0.3478, pruned_loss=0.08906, over 21576.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3153, pruned_loss=0.07948, over 4265737.99 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:15:48,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1696512.0, ans=0.1 2023-06-24 08:15:58,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1696572.0, ans=0.04949747468305833 2023-06-24 08:16:34,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-24 08:16:36,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1696632.0, ans=0.125 2023-06-24 08:16:40,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=15.0 2023-06-24 08:16:58,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1696692.0, ans=0.025 2023-06-24 08:17:10,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1696752.0, ans=0.1 2023-06-24 08:17:18,879 INFO [train.py:996] (0/4) Epoch 10, batch 8350, loss[loss=0.2361, simple_loss=0.3149, pruned_loss=0.07871, over 21638.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3147, pruned_loss=0.07732, over 4264773.58 frames. ], batch size: 263, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:17:45,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1696872.0, ans=0.07 2023-06-24 08:17:46,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1696872.0, ans=0.0 2023-06-24 08:17:49,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1696872.0, ans=0.125 2023-06-24 08:18:06,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1696932.0, ans=0.125 2023-06-24 08:18:25,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1696992.0, ans=0.1 2023-06-24 08:18:29,883 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 6.071e+02 7.237e+02 1.103e+03 3.221e+03, threshold=1.447e+03, percent-clipped=5.0 2023-06-24 08:18:38,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1696992.0, ans=0.125 2023-06-24 08:18:59,505 INFO [train.py:996] (0/4) Epoch 10, batch 8400, loss[loss=0.1838, simple_loss=0.2748, pruned_loss=0.04639, over 21405.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.311, pruned_loss=0.07477, over 4258935.12 frames. ], batch size: 194, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:19:22,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1697112.0, ans=0.125 2023-06-24 08:19:53,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1697232.0, ans=0.1 2023-06-24 08:20:17,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1697292.0, ans=0.2 2023-06-24 08:20:28,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1697352.0, ans=0.0 2023-06-24 08:20:28,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1697352.0, ans=0.125 2023-06-24 08:20:39,152 INFO [train.py:996] (0/4) Epoch 10, batch 8450, loss[loss=0.2551, simple_loss=0.3196, pruned_loss=0.09529, over 21855.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3095, pruned_loss=0.07356, over 4258646.93 frames. ], batch size: 351, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:21:05,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1697472.0, ans=0.035 2023-06-24 08:21:09,067 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.31 vs. limit=6.0 2023-06-24 08:21:35,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1697532.0, ans=0.125 2023-06-24 08:21:51,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 9.053e+02 1.298e+03 1.951e+03 3.847e+03, threshold=2.596e+03, percent-clipped=39.0 2023-06-24 08:22:10,468 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-24 08:22:11,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1697652.0, ans=0.1 2023-06-24 08:22:23,937 INFO [train.py:996] (0/4) Epoch 10, batch 8500, loss[loss=0.2651, simple_loss=0.3197, pruned_loss=0.1052, over 14995.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3066, pruned_loss=0.07488, over 4255202.55 frames. ], batch size: 60, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:22:46,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.81 vs. limit=15.0 2023-06-24 08:23:43,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1697952.0, ans=0.0 2023-06-24 08:24:03,946 INFO [train.py:996] (0/4) Epoch 10, batch 8550, loss[loss=0.2286, simple_loss=0.3104, pruned_loss=0.07342, over 21364.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3109, pruned_loss=0.07811, over 4266699.00 frames. ], batch size: 176, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:24:37,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1698072.0, ans=0.2 2023-06-24 08:25:09,878 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.596e+02 7.231e+02 1.146e+03 1.740e+03 4.216e+03, threshold=2.291e+03, percent-clipped=13.0 2023-06-24 08:25:16,686 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=12.0 2023-06-24 08:25:16,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-06-24 08:25:48,537 INFO [train.py:996] (0/4) Epoch 10, batch 8600, loss[loss=0.2588, simple_loss=0.3393, pruned_loss=0.08912, over 21618.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3156, pruned_loss=0.07966, over 4272387.54 frames. ], batch size: 389, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:25:55,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1698312.0, ans=0.1 2023-06-24 08:26:11,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1698372.0, ans=0.2 2023-06-24 08:26:38,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1698432.0, ans=0.125 2023-06-24 08:27:11,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1698552.0, ans=0.0 2023-06-24 08:27:28,181 INFO [train.py:996] (0/4) Epoch 10, batch 8650, loss[loss=0.2334, simple_loss=0.3404, pruned_loss=0.06321, over 21650.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.321, pruned_loss=0.08098, over 4271764.85 frames. ], batch size: 414, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:27:58,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1698672.0, ans=0.125 2023-06-24 08:28:07,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1698732.0, ans=0.1 2023-06-24 08:28:28,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.818e+02 6.800e+02 1.041e+03 1.467e+03 2.492e+03, threshold=2.082e+03, percent-clipped=1.0 2023-06-24 08:28:31,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1698792.0, ans=0.1 2023-06-24 08:28:59,767 INFO [train.py:996] (0/4) Epoch 10, batch 8700, loss[loss=0.1969, simple_loss=0.2586, pruned_loss=0.06754, over 21664.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3123, pruned_loss=0.07661, over 4268495.86 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:29:12,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1698912.0, ans=0.125 2023-06-24 08:29:24,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1698912.0, ans=0.125 2023-06-24 08:30:47,007 INFO [train.py:996] (0/4) Epoch 10, batch 8750, loss[loss=0.301, simple_loss=0.3455, pruned_loss=0.1283, over 21693.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3069, pruned_loss=0.07705, over 4263833.68 frames. ], batch size: 473, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:31:28,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.35 vs. limit=15.0 2023-06-24 08:31:29,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699332.0, ans=0.1 2023-06-24 08:31:32,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1699332.0, ans=0.125 2023-06-24 08:31:50,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 7.163e+02 1.016e+03 1.538e+03 3.044e+03, threshold=2.032e+03, percent-clipped=7.0 2023-06-24 08:32:32,749 INFO [train.py:996] (0/4) Epoch 10, batch 8800, loss[loss=0.2783, simple_loss=0.3562, pruned_loss=0.1002, over 21719.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3159, pruned_loss=0.08013, over 4273984.83 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:33:22,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1699632.0, ans=10.0 2023-06-24 08:33:48,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1699692.0, ans=0.015 2023-06-24 08:34:15,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1699752.0, ans=0.125 2023-06-24 08:34:18,609 INFO [train.py:996] (0/4) Epoch 10, batch 8850, loss[loss=0.2441, simple_loss=0.3288, pruned_loss=0.07968, over 21285.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3221, pruned_loss=0.0824, over 4275896.47 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:34:32,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1699872.0, ans=0.125 2023-06-24 08:34:32,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1699872.0, ans=0.125 2023-06-24 08:35:00,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-24 08:35:01,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1699932.0, ans=0.1 2023-06-24 08:35:17,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.300e+02 6.165e+02 8.058e+02 1.044e+03 1.938e+03, threshold=1.612e+03, percent-clipped=0.0 2023-06-24 08:35:59,461 INFO [train.py:996] (0/4) Epoch 10, batch 8900, loss[loss=0.2601, simple_loss=0.3217, pruned_loss=0.09926, over 21569.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3171, pruned_loss=0.08177, over 4275284.12 frames. ], batch size: 414, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:36:17,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1700172.0, ans=0.0 2023-06-24 08:36:25,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1700172.0, ans=0.0 2023-06-24 08:36:58,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1700232.0, ans=0.1 2023-06-24 08:37:35,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1700352.0, ans=0.1 2023-06-24 08:37:43,135 INFO [train.py:996] (0/4) Epoch 10, batch 8950, loss[loss=0.2652, simple_loss=0.3392, pruned_loss=0.09559, over 21702.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3227, pruned_loss=0.08187, over 4265608.66 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:38:22,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1700472.0, ans=0.1 2023-06-24 08:38:46,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1700592.0, ans=0.0 2023-06-24 08:38:56,967 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.364e+02 1.062e+03 1.597e+03 2.316e+03 4.236e+03, threshold=3.193e+03, percent-clipped=50.0 2023-06-24 08:39:22,940 INFO [train.py:996] (0/4) Epoch 10, batch 9000, loss[loss=0.1858, simple_loss=0.2557, pruned_loss=0.05799, over 21870.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3156, pruned_loss=0.08092, over 4265099.32 frames. ], batch size: 118, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:39:22,941 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 08:39:39,587 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2679, simple_loss=0.3599, pruned_loss=0.08793, over 1796401.00 frames. 2023-06-24 08:39:39,588 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 08:39:53,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1700712.0, ans=0.1 2023-06-24 08:40:08,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1700772.0, ans=0.0 2023-06-24 08:41:20,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1701012.0, ans=0.125 2023-06-24 08:41:21,506 INFO [train.py:996] (0/4) Epoch 10, batch 9050, loss[loss=0.217, simple_loss=0.3005, pruned_loss=0.06674, over 21688.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3076, pruned_loss=0.07737, over 4256858.76 frames. ], batch size: 351, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:41:37,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-24 08:41:41,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-24 08:41:53,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1701072.0, ans=0.0 2023-06-24 08:42:37,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.440e+02 8.915e+02 1.381e+03 2.151e+03 3.467e+03, threshold=2.763e+03, percent-clipped=3.0 2023-06-24 08:43:08,696 INFO [train.py:996] (0/4) Epoch 10, batch 9100, loss[loss=0.3003, simple_loss=0.3655, pruned_loss=0.1175, over 21329.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3138, pruned_loss=0.08083, over 4262863.50 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:43:14,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1701312.0, ans=0.125 2023-06-24 08:43:54,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1701432.0, ans=0.2 2023-06-24 08:44:49,851 INFO [train.py:996] (0/4) Epoch 10, batch 9150, loss[loss=0.2205, simple_loss=0.302, pruned_loss=0.06952, over 21414.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3165, pruned_loss=0.078, over 4264999.36 frames. ], batch size: 160, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:45:01,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1701612.0, ans=0.125 2023-06-24 08:45:43,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1701732.0, ans=0.2 2023-06-24 08:45:53,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1701792.0, ans=0.5 2023-06-24 08:45:59,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 7.155e+02 1.028e+03 1.622e+03 3.048e+03, threshold=2.056e+03, percent-clipped=2.0 2023-06-24 08:46:04,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1701792.0, ans=0.125 2023-06-24 08:46:29,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1701852.0, ans=0.0 2023-06-24 08:46:40,275 INFO [train.py:996] (0/4) Epoch 10, batch 9200, loss[loss=0.2557, simple_loss=0.3449, pruned_loss=0.08328, over 21287.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3187, pruned_loss=0.07731, over 4274200.99 frames. ], batch size: 548, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:46:40,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1701912.0, ans=0.125 2023-06-24 08:47:40,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1702092.0, ans=0.1 2023-06-24 08:47:44,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1702092.0, ans=0.125 2023-06-24 08:48:20,290 INFO [train.py:996] (0/4) Epoch 10, batch 9250, loss[loss=0.2363, simple_loss=0.3003, pruned_loss=0.08619, over 21561.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.323, pruned_loss=0.08058, over 4272146.96 frames. ], batch size: 391, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:49:04,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1702332.0, ans=0.125 2023-06-24 08:49:20,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1702392.0, ans=0.125 2023-06-24 08:49:21,336 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.372e+02 7.455e+02 9.438e+02 1.547e+03 2.905e+03, threshold=1.888e+03, percent-clipped=9.0 2023-06-24 08:50:06,168 INFO [train.py:996] (0/4) Epoch 10, batch 9300, loss[loss=0.2281, simple_loss=0.3201, pruned_loss=0.0681, over 21416.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3157, pruned_loss=0.08057, over 4265486.44 frames. ], batch size: 211, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:50:23,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1702572.0, ans=0.1 2023-06-24 08:50:38,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1702632.0, ans=0.125 2023-06-24 08:50:46,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1702632.0, ans=0.0 2023-06-24 08:51:30,854 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:51:38,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-06-24 08:51:39,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1702752.0, ans=0.125 2023-06-24 08:51:43,757 INFO [train.py:996] (0/4) Epoch 10, batch 9350, loss[loss=0.2676, simple_loss=0.3441, pruned_loss=0.09558, over 21861.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3227, pruned_loss=0.08105, over 4264758.70 frames. ], batch size: 371, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:52:14,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2023-06-24 08:53:01,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.560e+02 6.681e+02 9.424e+02 1.664e+03 4.543e+03, threshold=1.885e+03, percent-clipped=14.0 2023-06-24 08:53:09,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1703052.0, ans=0.125 2023-06-24 08:53:25,361 INFO [train.py:996] (0/4) Epoch 10, batch 9400, loss[loss=0.246, simple_loss=0.3011, pruned_loss=0.09547, over 21165.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3234, pruned_loss=0.08177, over 4263073.86 frames. ], batch size: 143, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:53:56,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1703232.0, ans=0.07 2023-06-24 08:53:59,300 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:54:20,066 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:54:26,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1703292.0, ans=0.125 2023-06-24 08:54:52,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1703352.0, ans=0.1 2023-06-24 08:55:04,526 INFO [train.py:996] (0/4) Epoch 10, batch 9450, loss[loss=0.2589, simple_loss=0.3758, pruned_loss=0.07102, over 20852.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3151, pruned_loss=0.08081, over 4258152.16 frames. ], batch size: 608, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:55:05,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-24 08:55:25,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1703472.0, ans=0.125 2023-06-24 08:56:19,385 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.658e+02 7.388e+02 1.013e+03 1.627e+03 3.415e+03, threshold=2.026e+03, percent-clipped=14.0 2023-06-24 08:56:36,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1703652.0, ans=0.1 2023-06-24 08:56:43,455 INFO [train.py:996] (0/4) Epoch 10, batch 9500, loss[loss=0.1942, simple_loss=0.2873, pruned_loss=0.05055, over 21616.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.308, pruned_loss=0.07893, over 4251565.68 frames. ], batch size: 441, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 08:57:00,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-24 08:57:13,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1703772.0, ans=0.07 2023-06-24 08:58:13,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1703952.0, ans=0.125 2023-06-24 08:58:14,357 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-284000.pt 2023-06-24 08:58:20,295 INFO [train.py:996] (0/4) Epoch 10, batch 9550, loss[loss=0.2898, simple_loss=0.3581, pruned_loss=0.1108, over 21432.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3118, pruned_loss=0.08117, over 4261948.35 frames. ], batch size: 194, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 08:58:31,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1704012.0, ans=0.0 2023-06-24 08:59:09,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1704132.0, ans=0.125 2023-06-24 08:59:21,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704192.0, ans=0.1 2023-06-24 08:59:33,967 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.417e+02 6.895e+02 1.023e+03 1.419e+03 2.349e+03, threshold=2.046e+03, percent-clipped=3.0 2023-06-24 08:59:47,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-24 08:59:53,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1704252.0, ans=0.125 2023-06-24 08:59:57,596 INFO [train.py:996] (0/4) Epoch 10, batch 9600, loss[loss=0.2558, simple_loss=0.319, pruned_loss=0.09635, over 21770.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.314, pruned_loss=0.08287, over 4266818.05 frames. ], batch size: 112, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:00:08,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1704312.0, ans=0.1 2023-06-24 09:00:10,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1704312.0, ans=0.2 2023-06-24 09:00:15,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1704372.0, ans=0.0 2023-06-24 09:00:29,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1704372.0, ans=0.0 2023-06-24 09:01:25,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1704552.0, ans=0.0 2023-06-24 09:01:37,648 INFO [train.py:996] (0/4) Epoch 10, batch 9650, loss[loss=0.2464, simple_loss=0.3196, pruned_loss=0.08664, over 21823.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3132, pruned_loss=0.08224, over 4272876.72 frames. ], batch size: 282, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:01:41,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1704612.0, ans=10.0 2023-06-24 09:02:55,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 6.745e+02 1.012e+03 1.363e+03 3.649e+03, threshold=2.025e+03, percent-clipped=11.0 2023-06-24 09:03:04,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-24 09:03:08,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.91 vs. limit=15.0 2023-06-24 09:03:17,711 INFO [train.py:996] (0/4) Epoch 10, batch 9700, loss[loss=0.2225, simple_loss=0.3117, pruned_loss=0.06665, over 21810.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.318, pruned_loss=0.08292, over 4270452.94 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:03:33,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.59 vs. limit=22.5 2023-06-24 09:04:18,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1705032.0, ans=0.125 2023-06-24 09:04:22,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1705032.0, ans=0.015 2023-06-24 09:04:37,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1705092.0, ans=0.0 2023-06-24 09:04:44,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1705152.0, ans=0.125 2023-06-24 09:04:49,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-24 09:04:55,690 INFO [train.py:996] (0/4) Epoch 10, batch 9750, loss[loss=0.2065, simple_loss=0.2708, pruned_loss=0.0711, over 21386.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3117, pruned_loss=0.08053, over 4273044.60 frames. ], batch size: 194, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:05:07,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1705212.0, ans=0.125 2023-06-24 09:06:10,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.138e+02 7.196e+02 1.029e+03 1.714e+03 4.123e+03, threshold=2.059e+03, percent-clipped=13.0 2023-06-24 09:06:32,723 INFO [train.py:996] (0/4) Epoch 10, batch 9800, loss[loss=0.2471, simple_loss=0.2997, pruned_loss=0.09724, over 21746.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.31, pruned_loss=0.08096, over 4267453.57 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:06:42,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1705512.0, ans=0.125 2023-06-24 09:07:40,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1705692.0, ans=15.0 2023-06-24 09:08:09,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1705812.0, ans=0.0 2023-06-24 09:08:10,629 INFO [train.py:996] (0/4) Epoch 10, batch 9850, loss[loss=0.2145, simple_loss=0.2789, pruned_loss=0.075, over 21777.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3079, pruned_loss=0.08101, over 4266463.90 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:08:11,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-24 09:08:48,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1705872.0, ans=0.0 2023-06-24 09:09:11,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1705932.0, ans=0.2 2023-06-24 09:09:26,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 7.595e+02 1.021e+03 1.343e+03 2.731e+03, threshold=2.043e+03, percent-clipped=9.0 2023-06-24 09:09:33,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1706052.0, ans=0.125 2023-06-24 09:09:49,328 INFO [train.py:996] (0/4) Epoch 10, batch 9900, loss[loss=0.2193, simple_loss=0.274, pruned_loss=0.08232, over 21230.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3046, pruned_loss=0.08049, over 4261284.45 frames. ], batch size: 548, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:09:49,967 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:10:01,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1706112.0, ans=0.2 2023-06-24 09:10:04,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1706112.0, ans=0.125 2023-06-24 09:10:53,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-24 09:11:13,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1706352.0, ans=0.0 2023-06-24 09:11:27,330 INFO [train.py:996] (0/4) Epoch 10, batch 9950, loss[loss=0.309, simple_loss=0.3416, pruned_loss=0.1382, over 21409.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3076, pruned_loss=0.08314, over 4260989.09 frames. ], batch size: 509, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:11:28,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-24 09:11:46,136 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=22.5 2023-06-24 09:12:01,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1706472.0, ans=0.125 2023-06-24 09:12:08,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-24 09:12:14,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1706472.0, ans=0.125 2023-06-24 09:12:15,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-24 09:12:44,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 7.241e+02 1.069e+03 1.520e+03 2.876e+03, threshold=2.138e+03, percent-clipped=9.0 2023-06-24 09:12:57,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-24 09:13:12,906 INFO [train.py:996] (0/4) Epoch 10, batch 10000, loss[loss=0.2233, simple_loss=0.2892, pruned_loss=0.07867, over 21265.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3038, pruned_loss=0.08243, over 4260949.93 frames. ], batch size: 548, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:13:56,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1706772.0, ans=0.125 2023-06-24 09:13:57,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1706772.0, ans=10.0 2023-06-24 09:14:17,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-24 09:14:28,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1706892.0, ans=0.1 2023-06-24 09:14:30,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1706892.0, ans=0.125 2023-06-24 09:14:45,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1706952.0, ans=0.2 2023-06-24 09:14:54,919 INFO [train.py:996] (0/4) Epoch 10, batch 10050, loss[loss=0.2044, simple_loss=0.2861, pruned_loss=0.0613, over 21783.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3043, pruned_loss=0.08158, over 4267915.21 frames. ], batch size: 282, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:16:11,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 6.781e+02 9.769e+02 1.554e+03 3.220e+03, threshold=1.954e+03, percent-clipped=12.0 2023-06-24 09:16:30,133 INFO [train.py:996] (0/4) Epoch 10, batch 10100, loss[loss=0.2393, simple_loss=0.319, pruned_loss=0.0798, over 21740.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3025, pruned_loss=0.07998, over 4268020.37 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:17:15,507 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:18:07,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1707552.0, ans=22.5 2023-06-24 09:18:13,782 INFO [train.py:996] (0/4) Epoch 10, batch 10150, loss[loss=0.2049, simple_loss=0.2722, pruned_loss=0.06886, over 21855.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3076, pruned_loss=0.08219, over 4263926.70 frames. ], batch size: 102, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:18:48,973 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-24 09:19:25,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.627e+02 7.106e+02 9.650e+02 1.431e+03 2.478e+03, threshold=1.930e+03, percent-clipped=8.0 2023-06-24 09:19:54,012 INFO [train.py:996] (0/4) Epoch 10, batch 10200, loss[loss=0.1773, simple_loss=0.2408, pruned_loss=0.05693, over 20803.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3054, pruned_loss=0.07969, over 4269984.24 frames. ], batch size: 608, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:20:05,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1707912.0, ans=0.1 2023-06-24 09:20:30,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1708032.0, ans=0.0 2023-06-24 09:20:46,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1708092.0, ans=0.2 2023-06-24 09:21:05,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1708092.0, ans=0.1 2023-06-24 09:21:33,128 INFO [train.py:996] (0/4) Epoch 10, batch 10250, loss[loss=0.1779, simple_loss=0.2725, pruned_loss=0.04165, over 21570.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3, pruned_loss=0.07359, over 4268661.73 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:21:50,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.40 vs. limit=10.0 2023-06-24 09:22:12,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1708332.0, ans=0.0 2023-06-24 09:22:19,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1708332.0, ans=0.0 2023-06-24 09:22:46,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.201e+02 5.691e+02 8.251e+02 1.364e+03 2.412e+03, threshold=1.650e+03, percent-clipped=9.0 2023-06-24 09:23:01,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1708452.0, ans=0.2 2023-06-24 09:23:11,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1708452.0, ans=0.025 2023-06-24 09:23:21,849 INFO [train.py:996] (0/4) Epoch 10, batch 10300, loss[loss=0.2277, simple_loss=0.3229, pruned_loss=0.06626, over 21817.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.302, pruned_loss=0.07387, over 4275134.00 frames. ], batch size: 282, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:23:36,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1708512.0, ans=10.0 2023-06-24 09:23:45,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1708572.0, ans=0.125 2023-06-24 09:24:19,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1708692.0, ans=0.0 2023-06-24 09:25:04,364 INFO [train.py:996] (0/4) Epoch 10, batch 10350, loss[loss=0.1639, simple_loss=0.2175, pruned_loss=0.05518, over 21884.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3054, pruned_loss=0.07439, over 4275701.01 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:25:54,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1708932.0, ans=0.07 2023-06-24 09:26:21,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.286e+02 6.885e+02 1.073e+03 1.600e+03 3.112e+03, threshold=2.146e+03, percent-clipped=24.0 2023-06-24 09:26:25,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1709052.0, ans=0.125 2023-06-24 09:26:41,154 INFO [train.py:996] (0/4) Epoch 10, batch 10400, loss[loss=0.1937, simple_loss=0.2391, pruned_loss=0.07411, over 21150.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2981, pruned_loss=0.07304, over 4264331.01 frames. ], batch size: 143, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:26:47,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1709112.0, ans=0.1 2023-06-24 09:27:52,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1709292.0, ans=0.1 2023-06-24 09:27:57,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1709292.0, ans=0.125 2023-06-24 09:28:14,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1709352.0, ans=0.0 2023-06-24 09:28:17,475 INFO [train.py:996] (0/4) Epoch 10, batch 10450, loss[loss=0.244, simple_loss=0.3194, pruned_loss=0.08433, over 21791.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3021, pruned_loss=0.07589, over 4269533.11 frames. ], batch size: 118, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:28:19,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1709412.0, ans=0.125 2023-06-24 09:28:54,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1709472.0, ans=0.125 2023-06-24 09:29:08,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1709532.0, ans=0.1 2023-06-24 09:29:32,231 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-24 09:29:33,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-24 09:29:37,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.221e+02 7.690e+02 1.222e+03 1.916e+03 3.478e+03, threshold=2.445e+03, percent-clipped=16.0 2023-06-24 09:29:56,560 INFO [train.py:996] (0/4) Epoch 10, batch 10500, loss[loss=0.2156, simple_loss=0.2779, pruned_loss=0.07661, over 21607.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3031, pruned_loss=0.07516, over 4265204.64 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:30:10,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-24 09:30:32,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1709772.0, ans=0.125 2023-06-24 09:31:06,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1709892.0, ans=0.125 2023-06-24 09:31:35,616 INFO [train.py:996] (0/4) Epoch 10, batch 10550, loss[loss=0.1998, simple_loss=0.2564, pruned_loss=0.07163, over 21371.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2981, pruned_loss=0.07502, over 4261708.89 frames. ], batch size: 211, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:32:28,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1710132.0, ans=0.05 2023-06-24 09:32:47,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1710192.0, ans=10.0 2023-06-24 09:32:52,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1710192.0, ans=0.0 2023-06-24 09:32:55,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.399e+02 7.494e+02 1.006e+03 1.488e+03 3.263e+03, threshold=2.013e+03, percent-clipped=2.0 2023-06-24 09:33:07,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1710252.0, ans=0.2 2023-06-24 09:33:12,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1710252.0, ans=0.0 2023-06-24 09:33:15,649 INFO [train.py:996] (0/4) Epoch 10, batch 10600, loss[loss=0.1927, simple_loss=0.2818, pruned_loss=0.05181, over 21679.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2957, pruned_loss=0.07431, over 4257449.38 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:33:41,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1710372.0, ans=0.125 2023-06-24 09:34:07,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1710432.0, ans=0.125 2023-06-24 09:34:10,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1710432.0, ans=0.125 2023-06-24 09:34:23,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1710492.0, ans=0.1 2023-06-24 09:34:23,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1710492.0, ans=0.125 2023-06-24 09:34:57,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1710552.0, ans=0.125 2023-06-24 09:34:59,854 INFO [train.py:996] (0/4) Epoch 10, batch 10650, loss[loss=0.2381, simple_loss=0.3013, pruned_loss=0.08744, over 19989.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2969, pruned_loss=0.07312, over 4251585.21 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:35:54,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1710732.0, ans=0.0 2023-06-24 09:36:16,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.325e+02 7.408e+02 1.241e+03 1.885e+03 3.956e+03, threshold=2.481e+03, percent-clipped=17.0 2023-06-24 09:36:19,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1710852.0, ans=0.125 2023-06-24 09:36:45,453 INFO [train.py:996] (0/4) Epoch 10, batch 10700, loss[loss=0.2819, simple_loss=0.3538, pruned_loss=0.1051, over 21568.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2958, pruned_loss=0.07291, over 4257470.05 frames. ], batch size: 389, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:37:28,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1711032.0, ans=0.125 2023-06-24 09:37:35,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-24 09:37:48,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1711092.0, ans=0.0 2023-06-24 09:38:32,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1711212.0, ans=0.125 2023-06-24 09:38:33,062 INFO [train.py:996] (0/4) Epoch 10, batch 10750, loss[loss=0.2962, simple_loss=0.3942, pruned_loss=0.09915, over 21609.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3067, pruned_loss=0.07687, over 4261159.84 frames. ], batch size: 389, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:38:36,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=22.5 2023-06-24 09:39:01,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1711272.0, ans=0.125 2023-06-24 09:39:06,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1711272.0, ans=0.025 2023-06-24 09:39:18,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1711332.0, ans=0.2 2023-06-24 09:39:25,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1711332.0, ans=0.125 2023-06-24 09:39:50,487 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.281e+02 7.385e+02 1.038e+03 1.565e+03 3.899e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-24 09:40:20,315 INFO [train.py:996] (0/4) Epoch 10, batch 10800, loss[loss=0.3201, simple_loss=0.3821, pruned_loss=0.129, over 21353.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3132, pruned_loss=0.07827, over 4264443.27 frames. ], batch size: 507, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:40:26,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-24 09:40:48,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1711572.0, ans=0.2 2023-06-24 09:40:49,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1711572.0, ans=0.1 2023-06-24 09:41:18,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=8.0 2023-06-24 09:41:27,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1711692.0, ans=0.1 2023-06-24 09:41:59,466 INFO [train.py:996] (0/4) Epoch 10, batch 10850, loss[loss=0.2286, simple_loss=0.2974, pruned_loss=0.07987, over 21593.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3129, pruned_loss=0.07911, over 4264352.88 frames. ], batch size: 391, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:42:06,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1711812.0, ans=0.0 2023-06-24 09:42:19,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-24 09:42:28,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1711872.0, ans=0.0 2023-06-24 09:42:37,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-24 09:43:10,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1711992.0, ans=0.125 2023-06-24 09:43:10,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1711992.0, ans=0.125 2023-06-24 09:43:12,443 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.37 vs. limit=10.0 2023-06-24 09:43:18,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.679e+02 6.562e+02 9.748e+02 1.395e+03 3.143e+03, threshold=1.950e+03, percent-clipped=4.0 2023-06-24 09:43:38,806 INFO [train.py:996] (0/4) Epoch 10, batch 10900, loss[loss=0.2515, simple_loss=0.3231, pruned_loss=0.08999, over 21575.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3086, pruned_loss=0.07746, over 4244246.39 frames. ], batch size: 414, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:43:44,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1712112.0, ans=0.125 2023-06-24 09:43:49,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1712112.0, ans=0.2 2023-06-24 09:43:54,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1712172.0, ans=0.125 2023-06-24 09:43:59,402 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-24 09:44:41,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1712292.0, ans=0.1 2023-06-24 09:44:44,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1712292.0, ans=0.0 2023-06-24 09:45:18,167 INFO [train.py:996] (0/4) Epoch 10, batch 10950, loss[loss=0.2566, simple_loss=0.3553, pruned_loss=0.07897, over 19913.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.306, pruned_loss=0.07593, over 4231123.38 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:45:28,347 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:45:53,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1712532.0, ans=0.125 2023-06-24 09:46:06,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1712532.0, ans=0.2 2023-06-24 09:46:25,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1712592.0, ans=0.125 2023-06-24 09:46:35,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 6.641e+02 1.100e+03 1.576e+03 3.666e+03, threshold=2.199e+03, percent-clipped=18.0 2023-06-24 09:46:56,694 INFO [train.py:996] (0/4) Epoch 10, batch 11000, loss[loss=0.2417, simple_loss=0.3028, pruned_loss=0.09034, over 21248.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3038, pruned_loss=0.07652, over 4229126.45 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:47:09,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1712712.0, ans=0.0 2023-06-24 09:47:11,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1712772.0, ans=0.1 2023-06-24 09:47:56,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1712832.0, ans=0.125 2023-06-24 09:48:12,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-24 09:48:29,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-24 09:48:35,817 INFO [train.py:996] (0/4) Epoch 10, batch 11050, loss[loss=0.2251, simple_loss=0.2853, pruned_loss=0.08243, over 21850.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3017, pruned_loss=0.07826, over 4246190.47 frames. ], batch size: 98, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:48:40,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1713012.0, ans=0.0 2023-06-24 09:48:47,658 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-24 09:48:54,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1713072.0, ans=0.1 2023-06-24 09:49:09,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1713132.0, ans=0.0 2023-06-24 09:49:46,634 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:49:52,351 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.455e+02 6.589e+02 8.856e+02 1.146e+03 2.849e+03, threshold=1.771e+03, percent-clipped=5.0 2023-06-24 09:50:11,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-24 09:50:13,218 INFO [train.py:996] (0/4) Epoch 10, batch 11100, loss[loss=0.2333, simple_loss=0.2916, pruned_loss=0.08751, over 21871.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3011, pruned_loss=0.07806, over 4253863.30 frames. ], batch size: 98, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:50:15,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1713312.0, ans=0.1 2023-06-24 09:50:37,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1713372.0, ans=0.125 2023-06-24 09:51:54,357 INFO [train.py:996] (0/4) Epoch 10, batch 11150, loss[loss=0.2245, simple_loss=0.3196, pruned_loss=0.06469, over 21738.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2977, pruned_loss=0.07735, over 4249783.43 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:52:50,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1713792.0, ans=0.125 2023-06-24 09:53:11,663 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.257e+02 6.370e+02 9.682e+02 1.600e+03 2.878e+03, threshold=1.936e+03, percent-clipped=17.0 2023-06-24 09:53:33,212 INFO [train.py:996] (0/4) Epoch 10, batch 11200, loss[loss=0.2422, simple_loss=0.2888, pruned_loss=0.09777, over 21288.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2964, pruned_loss=0.07749, over 4250386.96 frames. ], batch size: 507, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:53:45,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1713912.0, ans=0.1 2023-06-24 09:54:04,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1714032.0, ans=0.2 2023-06-24 09:54:10,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1714032.0, ans=0.2 2023-06-24 09:54:26,296 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:54:36,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.13 vs. limit=10.0 2023-06-24 09:55:07,063 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.59 vs. limit=5.0 2023-06-24 09:55:12,183 INFO [train.py:996] (0/4) Epoch 10, batch 11250, loss[loss=0.237, simple_loss=0.305, pruned_loss=0.08446, over 21559.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.296, pruned_loss=0.07828, over 4254209.97 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:55:36,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1714272.0, ans=0.0 2023-06-24 09:56:26,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.030e+02 7.779e+02 1.066e+03 3.071e+03, threshold=1.556e+03, percent-clipped=6.0 2023-06-24 09:56:46,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1714512.0, ans=0.1 2023-06-24 09:56:47,865 INFO [train.py:996] (0/4) Epoch 10, batch 11300, loss[loss=0.1979, simple_loss=0.2827, pruned_loss=0.05653, over 21872.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2971, pruned_loss=0.0773, over 4260150.35 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:57:38,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1714632.0, ans=0.2 2023-06-24 09:58:27,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1714812.0, ans=0.1 2023-06-24 09:58:28,634 INFO [train.py:996] (0/4) Epoch 10, batch 11350, loss[loss=0.1789, simple_loss=0.2591, pruned_loss=0.04936, over 21498.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2997, pruned_loss=0.07741, over 4257667.42 frames. ], batch size: 195, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:58:38,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1714812.0, ans=0.0 2023-06-24 09:58:54,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1714872.0, ans=0.125 2023-06-24 09:59:33,548 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-24 09:59:49,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1714992.0, ans=0.0 2023-06-24 09:59:53,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.336e+02 5.946e+02 8.078e+02 1.222e+03 2.329e+03, threshold=1.616e+03, percent-clipped=17.0 2023-06-24 10:00:10,841 INFO [train.py:996] (0/4) Epoch 10, batch 11400, loss[loss=0.2215, simple_loss=0.3047, pruned_loss=0.06919, over 21623.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3033, pruned_loss=0.07858, over 4253839.99 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:00:17,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1715112.0, ans=0.2 2023-06-24 10:00:25,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1715172.0, ans=0.125 2023-06-24 10:00:38,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1715172.0, ans=0.2 2023-06-24 10:00:59,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1715232.0, ans=0.95 2023-06-24 10:01:09,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1715232.0, ans=0.0 2023-06-24 10:01:29,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1715292.0, ans=0.04949747468305833 2023-06-24 10:01:50,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1715412.0, ans=0.0 2023-06-24 10:01:51,731 INFO [train.py:996] (0/4) Epoch 10, batch 11450, loss[loss=0.2262, simple_loss=0.3027, pruned_loss=0.07487, over 21280.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3033, pruned_loss=0.07683, over 4257533.47 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:02:19,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1715472.0, ans=0.125 2023-06-24 10:02:29,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1715472.0, ans=0.2 2023-06-24 10:02:31,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1715472.0, ans=0.0 2023-06-24 10:02:42,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1715532.0, ans=0.125 2023-06-24 10:03:16,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.172e+02 6.928e+02 8.683e+02 1.191e+03 2.555e+03, threshold=1.737e+03, percent-clipped=7.0 2023-06-24 10:03:29,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1715652.0, ans=0.125 2023-06-24 10:03:33,694 INFO [train.py:996] (0/4) Epoch 10, batch 11500, loss[loss=0.2037, simple_loss=0.291, pruned_loss=0.05813, over 21245.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3066, pruned_loss=0.07796, over 4253379.41 frames. ], batch size: 159, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:03:55,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1715772.0, ans=0.125 2023-06-24 10:05:01,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1715952.0, ans=10.0 2023-06-24 10:05:16,094 INFO [train.py:996] (0/4) Epoch 10, batch 11550, loss[loss=0.3361, simple_loss=0.4443, pruned_loss=0.1139, over 21664.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.313, pruned_loss=0.07815, over 4261323.52 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:05:21,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1716012.0, ans=0.125 2023-06-24 10:06:36,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.588e+02 7.008e+02 1.217e+03 2.057e+03 3.971e+03, threshold=2.435e+03, percent-clipped=35.0 2023-06-24 10:06:37,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1716252.0, ans=0.125 2023-06-24 10:07:02,454 INFO [train.py:996] (0/4) Epoch 10, batch 11600, loss[loss=0.2692, simple_loss=0.354, pruned_loss=0.09214, over 21241.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3283, pruned_loss=0.08093, over 4264725.14 frames. ], batch size: 159, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:07:04,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1716312.0, ans=0.125 2023-06-24 10:08:35,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1716552.0, ans=0.04949747468305833 2023-06-24 10:08:36,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1716612.0, ans=0.0 2023-06-24 10:08:37,189 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-24 10:08:37,728 INFO [train.py:996] (0/4) Epoch 10, batch 11650, loss[loss=0.2807, simple_loss=0.3645, pruned_loss=0.0985, over 21295.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3341, pruned_loss=0.0812, over 4258934.72 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:09:06,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2023-06-24 10:09:40,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1716792.0, ans=0.2 2023-06-24 10:09:51,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 6.606e+02 1.019e+03 1.524e+03 3.241e+03, threshold=2.038e+03, percent-clipped=8.0 2023-06-24 10:10:16,857 INFO [train.py:996] (0/4) Epoch 10, batch 11700, loss[loss=0.2304, simple_loss=0.2829, pruned_loss=0.08893, over 21643.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3271, pruned_loss=0.08111, over 4255160.39 frames. ], batch size: 248, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:10:39,848 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:11:10,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=15.0 2023-06-24 10:11:11,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-24 10:11:24,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-24 10:11:55,176 INFO [train.py:996] (0/4) Epoch 10, batch 11750, loss[loss=0.2273, simple_loss=0.3284, pruned_loss=0.06312, over 19868.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3174, pruned_loss=0.08092, over 4250904.82 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:12:17,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1717272.0, ans=0.0 2023-06-24 10:12:22,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1717272.0, ans=0.125 2023-06-24 10:12:37,625 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:12:57,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1717392.0, ans=0.1 2023-06-24 10:13:15,604 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.135e+02 6.168e+02 8.880e+02 1.254e+03 3.045e+03, threshold=1.776e+03, percent-clipped=3.0 2023-06-24 10:13:27,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1717452.0, ans=0.2 2023-06-24 10:13:34,954 INFO [train.py:996] (0/4) Epoch 10, batch 11800, loss[loss=0.2353, simple_loss=0.3143, pruned_loss=0.07817, over 21452.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3176, pruned_loss=0.08274, over 4252571.32 frames. ], batch size: 211, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:14:12,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.34 vs. limit=8.0 2023-06-24 10:14:32,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1717692.0, ans=0.5 2023-06-24 10:14:37,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1717692.0, ans=0.125 2023-06-24 10:15:10,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1717752.0, ans=0.0 2023-06-24 10:15:19,749 INFO [train.py:996] (0/4) Epoch 10, batch 11850, loss[loss=0.2504, simple_loss=0.3416, pruned_loss=0.07963, over 21653.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3197, pruned_loss=0.08162, over 4261489.43 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:15:25,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-06-24 10:15:53,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1717932.0, ans=0.05 2023-06-24 10:16:41,803 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 6.048e+02 1.109e+03 1.865e+03 3.176e+03, threshold=2.218e+03, percent-clipped=24.0 2023-06-24 10:16:45,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1718052.0, ans=0.07 2023-06-24 10:17:01,690 INFO [train.py:996] (0/4) Epoch 10, batch 11900, loss[loss=0.1861, simple_loss=0.2621, pruned_loss=0.05507, over 21327.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3203, pruned_loss=0.07905, over 4259169.00 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:17:59,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-24 10:18:12,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-24 10:18:34,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1718352.0, ans=0.05 2023-06-24 10:18:40,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1718412.0, ans=0.0 2023-06-24 10:18:42,222 INFO [train.py:996] (0/4) Epoch 10, batch 11950, loss[loss=0.212, simple_loss=0.3036, pruned_loss=0.06022, over 21741.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3205, pruned_loss=0.07625, over 4256908.75 frames. ], batch size: 351, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:18:42,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1718412.0, ans=0.125 2023-06-24 10:19:17,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1718532.0, ans=0.125 2023-06-24 10:20:07,038 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.251e+02 6.812e+02 1.209e+03 1.807e+03 3.979e+03, threshold=2.418e+03, percent-clipped=18.0 2023-06-24 10:20:21,825 INFO [train.py:996] (0/4) Epoch 10, batch 12000, loss[loss=0.1898, simple_loss=0.254, pruned_loss=0.06279, over 21286.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3129, pruned_loss=0.07375, over 4261848.64 frames. ], batch size: 551, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:20:21,826 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 10:20:37,783 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2579, simple_loss=0.3537, pruned_loss=0.08105, over 1796401.00 frames. 2023-06-24 10:20:37,784 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 10:21:16,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1718772.0, ans=0.125 2023-06-24 10:21:16,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1718772.0, ans=0.125 2023-06-24 10:21:16,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1718772.0, ans=0.0 2023-06-24 10:21:27,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1718832.0, ans=0.125 2023-06-24 10:22:03,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1718952.0, ans=0.0 2023-06-24 10:22:14,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1718952.0, ans=0.0 2023-06-24 10:22:17,070 INFO [train.py:996] (0/4) Epoch 10, batch 12050, loss[loss=0.2175, simple_loss=0.2803, pruned_loss=0.07734, over 21596.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3088, pruned_loss=0.07526, over 4265961.13 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:22:18,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.53 vs. limit=5.0 2023-06-24 10:23:32,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-24 10:23:44,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 7.636e+02 1.060e+03 1.403e+03 2.653e+03, threshold=2.120e+03, percent-clipped=2.0 2023-06-24 10:24:03,743 INFO [train.py:996] (0/4) Epoch 10, batch 12100, loss[loss=0.2568, simple_loss=0.3403, pruned_loss=0.08669, over 21637.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3134, pruned_loss=0.07883, over 4270897.89 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:24:05,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1719312.0, ans=0.1 2023-06-24 10:24:08,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-24 10:24:34,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1719372.0, ans=10.0 2023-06-24 10:24:42,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1719372.0, ans=0.125 2023-06-24 10:24:46,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1719432.0, ans=0.0 2023-06-24 10:25:20,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1719492.0, ans=0.04949747468305833 2023-06-24 10:25:24,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1719492.0, ans=0.125 2023-06-24 10:25:53,230 INFO [train.py:996] (0/4) Epoch 10, batch 12150, loss[loss=0.2316, simple_loss=0.3159, pruned_loss=0.07367, over 21839.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3181, pruned_loss=0.07901, over 4274728.83 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:25:58,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1719612.0, ans=0.2 2023-06-24 10:26:25,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1719672.0, ans=0.0 2023-06-24 10:27:15,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1719852.0, ans=0.0 2023-06-24 10:27:22,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.407e+02 7.316e+02 1.017e+03 1.333e+03 3.987e+03, threshold=2.033e+03, percent-clipped=9.0 2023-06-24 10:27:34,154 INFO [train.py:996] (0/4) Epoch 10, batch 12200, loss[loss=0.2407, simple_loss=0.2869, pruned_loss=0.09723, over 21331.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3146, pruned_loss=0.07715, over 4266893.34 frames. ], batch size: 508, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:28:27,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1720032.0, ans=0.0 2023-06-24 10:28:30,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1720092.0, ans=0.05 2023-06-24 10:28:35,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1720092.0, ans=0.1 2023-06-24 10:29:13,865 INFO [train.py:996] (0/4) Epoch 10, batch 12250, loss[loss=0.2288, simple_loss=0.2841, pruned_loss=0.08678, over 20791.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3061, pruned_loss=0.07457, over 4263730.68 frames. ], batch size: 608, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:29:14,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1720212.0, ans=0.125 2023-06-24 10:29:18,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1720212.0, ans=0.125 2023-06-24 10:29:46,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1720272.0, ans=0.125 2023-06-24 10:30:37,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 6.057e+02 9.242e+02 1.403e+03 3.346e+03, threshold=1.848e+03, percent-clipped=10.0 2023-06-24 10:30:52,772 INFO [train.py:996] (0/4) Epoch 10, batch 12300, loss[loss=0.2274, simple_loss=0.3212, pruned_loss=0.06682, over 21841.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3007, pruned_loss=0.07047, over 4248167.06 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:31:54,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-24 10:32:11,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-24 10:32:38,253 INFO [train.py:996] (0/4) Epoch 10, batch 12350, loss[loss=0.2714, simple_loss=0.34, pruned_loss=0.1014, over 21523.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3053, pruned_loss=0.07182, over 4254779.92 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:33:04,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1720872.0, ans=0.125 2023-06-24 10:33:46,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1720992.0, ans=0.0 2023-06-24 10:33:56,484 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.324e+02 6.849e+02 9.475e+02 1.484e+03 4.503e+03, threshold=1.895e+03, percent-clipped=12.0 2023-06-24 10:33:58,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1721052.0, ans=0.0 2023-06-24 10:34:16,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1721112.0, ans=0.0 2023-06-24 10:34:17,411 INFO [train.py:996] (0/4) Epoch 10, batch 12400, loss[loss=0.2121, simple_loss=0.3387, pruned_loss=0.0428, over 19894.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3094, pruned_loss=0.07417, over 4260111.82 frames. ], batch size: 703, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:34:38,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1721172.0, ans=0.125 2023-06-24 10:35:15,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1721292.0, ans=0.95 2023-06-24 10:35:17,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1721292.0, ans=0.125 2023-06-24 10:35:27,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-24 10:35:32,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1721292.0, ans=0.0 2023-06-24 10:35:56,476 INFO [train.py:996] (0/4) Epoch 10, batch 12450, loss[loss=0.2471, simple_loss=0.3304, pruned_loss=0.0819, over 21610.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3121, pruned_loss=0.07733, over 4267144.06 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:36:13,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1721412.0, ans=0.015 2023-06-24 10:36:24,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1721472.0, ans=0.1 2023-06-24 10:36:24,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1721472.0, ans=0.0 2023-06-24 10:36:52,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1721532.0, ans=0.2 2023-06-24 10:37:09,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1721592.0, ans=0.125 2023-06-24 10:37:26,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.427e+02 8.062e+02 1.038e+03 1.545e+03 2.621e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-24 10:37:42,462 INFO [train.py:996] (0/4) Epoch 10, batch 12500, loss[loss=0.2819, simple_loss=0.3725, pruned_loss=0.09568, over 21437.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.32, pruned_loss=0.08052, over 4269391.80 frames. ], batch size: 211, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:37:52,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1721712.0, ans=0.125 2023-06-24 10:38:04,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1721772.0, ans=0.0 2023-06-24 10:38:10,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1721772.0, ans=0.125 2023-06-24 10:39:09,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-24 10:39:19,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1721952.0, ans=0.1 2023-06-24 10:39:21,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1721952.0, ans=0.1 2023-06-24 10:39:23,803 INFO [train.py:996] (0/4) Epoch 10, batch 12550, loss[loss=0.2599, simple_loss=0.3436, pruned_loss=0.08807, over 21422.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.326, pruned_loss=0.08255, over 4265865.30 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:39:33,400 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-24 10:39:43,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1722072.0, ans=10.0 2023-06-24 10:40:27,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1722192.0, ans=0.125 2023-06-24 10:40:40,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1722192.0, ans=0.125 2023-06-24 10:40:48,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.945e+02 6.642e+02 8.862e+02 1.444e+03 2.963e+03, threshold=1.772e+03, percent-clipped=6.0 2023-06-24 10:40:58,125 INFO [train.py:996] (0/4) Epoch 10, batch 12600, loss[loss=0.1986, simple_loss=0.2893, pruned_loss=0.05395, over 21771.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3242, pruned_loss=0.08154, over 4264351.54 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:41:00,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-24 10:41:50,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1722432.0, ans=0.125 2023-06-24 10:41:55,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1722432.0, ans=0.1 2023-06-24 10:42:14,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=8.0 2023-06-24 10:42:19,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1722552.0, ans=0.0 2023-06-24 10:42:36,259 INFO [train.py:996] (0/4) Epoch 10, batch 12650, loss[loss=0.2402, simple_loss=0.3081, pruned_loss=0.08618, over 21829.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3154, pruned_loss=0.07779, over 4272639.82 frames. ], batch size: 371, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:43:46,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1722792.0, ans=0.125 2023-06-24 10:44:02,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1722852.0, ans=0.0 2023-06-24 10:44:06,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.090e+02 7.127e+02 1.026e+03 1.420e+03 2.601e+03, threshold=2.052e+03, percent-clipped=16.0 2023-06-24 10:44:16,331 INFO [train.py:996] (0/4) Epoch 10, batch 12700, loss[loss=0.2871, simple_loss=0.3536, pruned_loss=0.1103, over 21621.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3151, pruned_loss=0.08034, over 4280687.27 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:44:17,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-24 10:44:58,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1723032.0, ans=0.125 2023-06-24 10:45:54,659 INFO [train.py:996] (0/4) Epoch 10, batch 12750, loss[loss=0.1984, simple_loss=0.2887, pruned_loss=0.05405, over 21703.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3154, pruned_loss=0.08019, over 4273982.06 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:46:05,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1723212.0, ans=0.0 2023-06-24 10:47:18,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 6.526e+02 8.799e+02 1.342e+03 3.585e+03, threshold=1.760e+03, percent-clipped=6.0 2023-06-24 10:47:33,313 INFO [train.py:996] (0/4) Epoch 10, batch 12800, loss[loss=0.2273, simple_loss=0.2992, pruned_loss=0.07769, over 21648.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3155, pruned_loss=0.08113, over 4277902.96 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:48:30,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1723632.0, ans=0.1 2023-06-24 10:49:18,857 INFO [train.py:996] (0/4) Epoch 10, batch 12850, loss[loss=0.2197, simple_loss=0.3029, pruned_loss=0.06825, over 21434.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3183, pruned_loss=0.08283, over 4277716.47 frames. ], batch size: 211, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:49:41,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1723872.0, ans=0.125 2023-06-24 10:50:12,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1723932.0, ans=0.125 2023-06-24 10:50:48,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.383e+02 5.816e+02 7.846e+02 1.216e+03 2.443e+03, threshold=1.569e+03, percent-clipped=11.0 2023-06-24 10:51:02,587 INFO [train.py:996] (0/4) Epoch 10, batch 12900, loss[loss=0.2322, simple_loss=0.3369, pruned_loss=0.06377, over 21246.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3142, pruned_loss=0.07834, over 4283170.48 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:51:11,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1724112.0, ans=0.1 2023-06-24 10:51:31,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1724172.0, ans=0.1 2023-06-24 10:51:39,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1724232.0, ans=10.0 2023-06-24 10:51:41,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1724232.0, ans=0.125 2023-06-24 10:52:34,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1724352.0, ans=0.025 2023-06-24 10:52:37,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1724352.0, ans=0.125 2023-06-24 10:52:43,501 INFO [train.py:996] (0/4) Epoch 10, batch 12950, loss[loss=0.2352, simple_loss=0.3141, pruned_loss=0.07813, over 21792.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3146, pruned_loss=0.07779, over 4282118.90 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:54:15,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.837e+02 8.466e+02 1.346e+03 1.826e+03 3.659e+03, threshold=2.691e+03, percent-clipped=37.0 2023-06-24 10:54:23,663 INFO [train.py:996] (0/4) Epoch 10, batch 13000, loss[loss=0.2628, simple_loss=0.336, pruned_loss=0.09477, over 21421.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3151, pruned_loss=0.07889, over 4281958.29 frames. ], batch size: 507, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:54:57,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-24 10:56:01,991 INFO [train.py:996] (0/4) Epoch 10, batch 13050, loss[loss=0.2058, simple_loss=0.2783, pruned_loss=0.06662, over 21655.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3102, pruned_loss=0.07648, over 4279057.09 frames. ], batch size: 230, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:56:06,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1725012.0, ans=0.1 2023-06-24 10:56:26,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1725072.0, ans=0.0 2023-06-24 10:57:32,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1725252.0, ans=0.05 2023-06-24 10:57:33,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.137e+02 5.824e+02 8.081e+02 1.133e+03 2.445e+03, threshold=1.616e+03, percent-clipped=0.0 2023-06-24 10:57:41,736 INFO [train.py:996] (0/4) Epoch 10, batch 13100, loss[loss=0.2601, simple_loss=0.3376, pruned_loss=0.09127, over 21612.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3121, pruned_loss=0.07663, over 4281239.43 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:58:36,244 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:58:44,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1725432.0, ans=0.125 2023-06-24 10:59:06,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1725552.0, ans=0.0 2023-06-24 10:59:21,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.49 vs. limit=8.0 2023-06-24 10:59:28,006 INFO [train.py:996] (0/4) Epoch 10, batch 13150, loss[loss=0.2134, simple_loss=0.2897, pruned_loss=0.06852, over 21746.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3171, pruned_loss=0.07775, over 4270530.40 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:59:35,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1725612.0, ans=0.125 2023-06-24 10:59:54,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1725672.0, ans=0.125 2023-06-24 11:00:22,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1725732.0, ans=0.125 2023-06-24 11:00:59,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 8.375e+02 1.328e+03 1.823e+03 3.736e+03, threshold=2.655e+03, percent-clipped=31.0 2023-06-24 11:01:07,605 INFO [train.py:996] (0/4) Epoch 10, batch 13200, loss[loss=0.2552, simple_loss=0.327, pruned_loss=0.09168, over 21746.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.316, pruned_loss=0.07823, over 4270403.09 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:01:21,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1725912.0, ans=0.5 2023-06-24 11:01:46,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1725972.0, ans=0.125 2023-06-24 11:01:59,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1726032.0, ans=0.125 2023-06-24 11:02:10,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1726092.0, ans=0.125 2023-06-24 11:02:23,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1726092.0, ans=0.0 2023-06-24 11:02:41,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1726152.0, ans=0.125 2023-06-24 11:02:52,531 INFO [train.py:996] (0/4) Epoch 10, batch 13250, loss[loss=0.2292, simple_loss=0.3118, pruned_loss=0.07329, over 21670.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3148, pruned_loss=0.08021, over 4269909.02 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:02:56,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1726212.0, ans=0.0 2023-06-24 11:03:08,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.00 vs. limit=15.0 2023-06-24 11:03:57,593 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:04:09,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1726392.0, ans=0.125 2023-06-24 11:04:19,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1726452.0, ans=10.0 2023-06-24 11:04:20,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1726452.0, ans=0.125 2023-06-24 11:04:24,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.098e+02 9.382e+02 1.293e+03 1.907e+03 4.949e+03, threshold=2.585e+03, percent-clipped=10.0 2023-06-24 11:04:32,056 INFO [train.py:996] (0/4) Epoch 10, batch 13300, loss[loss=0.2232, simple_loss=0.3136, pruned_loss=0.06638, over 21921.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3168, pruned_loss=0.07974, over 4274574.33 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:05:03,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1726572.0, ans=0.125 2023-06-24 11:06:14,244 INFO [train.py:996] (0/4) Epoch 10, batch 13350, loss[loss=0.2659, simple_loss=0.3331, pruned_loss=0.09933, over 20652.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3216, pruned_loss=0.08263, over 4273079.14 frames. ], batch size: 608, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:06:26,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1726812.0, ans=0.125 2023-06-24 11:07:13,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1726992.0, ans=0.0 2023-06-24 11:07:39,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.530e+02 6.336e+02 8.525e+02 1.254e+03 2.418e+03, threshold=1.705e+03, percent-clipped=0.0 2023-06-24 11:07:50,835 INFO [train.py:996] (0/4) Epoch 10, batch 13400, loss[loss=0.2316, simple_loss=0.2997, pruned_loss=0.0818, over 21468.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3224, pruned_loss=0.08399, over 4274470.30 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:08:05,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-06-24 11:08:56,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1727292.0, ans=0.125 2023-06-24 11:09:14,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1727352.0, ans=0.125 2023-06-24 11:09:20,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1727352.0, ans=0.125 2023-06-24 11:09:27,810 INFO [train.py:996] (0/4) Epoch 10, batch 13450, loss[loss=0.239, simple_loss=0.304, pruned_loss=0.08696, over 21158.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3239, pruned_loss=0.08715, over 4282406.87 frames. ], batch size: 143, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:09:37,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-24 11:10:07,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1727532.0, ans=0.0 2023-06-24 11:10:09,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1727532.0, ans=0.2 2023-06-24 11:10:47,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1727592.0, ans=0.09899494936611666 2023-06-24 11:10:57,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1727652.0, ans=0.0 2023-06-24 11:10:59,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.972e+02 7.908e+02 1.192e+03 1.823e+03 3.915e+03, threshold=2.384e+03, percent-clipped=24.0 2023-06-24 11:11:06,217 INFO [train.py:996] (0/4) Epoch 10, batch 13500, loss[loss=0.1767, simple_loss=0.2362, pruned_loss=0.05855, over 21341.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3154, pruned_loss=0.08374, over 4279711.60 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:11:41,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1727772.0, ans=0.125 2023-06-24 11:12:06,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1727892.0, ans=0.125 2023-06-24 11:12:39,898 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-288000.pt 2023-06-24 11:12:50,302 INFO [train.py:996] (0/4) Epoch 10, batch 13550, loss[loss=0.2651, simple_loss=0.3618, pruned_loss=0.08421, over 21739.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.319, pruned_loss=0.08363, over 4274963.37 frames. ], batch size: 298, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:12:56,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1728012.0, ans=0.125 2023-06-24 11:13:15,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1728072.0, ans=0.125 2023-06-24 11:14:09,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1728252.0, ans=0.1 2023-06-24 11:14:14,896 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.411e+02 7.999e+02 1.250e+03 1.835e+03 3.854e+03, threshold=2.499e+03, percent-clipped=11.0 2023-06-24 11:14:21,352 INFO [train.py:996] (0/4) Epoch 10, batch 13600, loss[loss=0.2048, simple_loss=0.2844, pruned_loss=0.06256, over 21586.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3191, pruned_loss=0.08411, over 4285067.92 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:14:32,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1728312.0, ans=0.0 2023-06-24 11:14:46,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1728372.0, ans=0.2 2023-06-24 11:15:03,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1728432.0, ans=0.125 2023-06-24 11:16:02,446 INFO [train.py:996] (0/4) Epoch 10, batch 13650, loss[loss=0.1948, simple_loss=0.2643, pruned_loss=0.06258, over 21627.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.313, pruned_loss=0.08069, over 4272651.86 frames. ], batch size: 247, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:16:05,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-24 11:16:28,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.04 vs. limit=15.0 2023-06-24 11:16:33,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1728672.0, ans=0.2 2023-06-24 11:16:42,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1728732.0, ans=0.0 2023-06-24 11:16:45,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1728732.0, ans=0.0 2023-06-24 11:17:29,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.339e+02 6.847e+02 1.012e+03 1.771e+03 3.769e+03, threshold=2.024e+03, percent-clipped=10.0 2023-06-24 11:17:38,801 INFO [train.py:996] (0/4) Epoch 10, batch 13700, loss[loss=0.194, simple_loss=0.2574, pruned_loss=0.06534, over 21391.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3068, pruned_loss=0.07994, over 4272360.56 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:17:39,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-24 11:18:05,626 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:18:39,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1729092.0, ans=0.125 2023-06-24 11:18:44,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.90 vs. limit=6.0 2023-06-24 11:19:10,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1729152.0, ans=0.125 2023-06-24 11:19:13,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1729152.0, ans=0.125 2023-06-24 11:19:16,521 INFO [train.py:996] (0/4) Epoch 10, batch 13750, loss[loss=0.2183, simple_loss=0.2865, pruned_loss=0.07504, over 21295.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3058, pruned_loss=0.07898, over 4268516.98 frames. ], batch size: 159, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:19:28,780 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-24 11:19:57,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1729272.0, ans=0.035 2023-06-24 11:20:28,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1729392.0, ans=0.2 2023-06-24 11:20:37,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1729392.0, ans=0.125 2023-06-24 11:20:56,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.866e+02 6.962e+02 1.196e+03 1.876e+03 4.514e+03, threshold=2.392e+03, percent-clipped=21.0 2023-06-24 11:21:05,681 INFO [train.py:996] (0/4) Epoch 10, batch 13800, loss[loss=0.2592, simple_loss=0.3784, pruned_loss=0.07002, over 19773.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3116, pruned_loss=0.078, over 4262389.41 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:21:30,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1729572.0, ans=0.035 2023-06-24 11:22:49,294 INFO [train.py:996] (0/4) Epoch 10, batch 13850, loss[loss=0.2536, simple_loss=0.3414, pruned_loss=0.0829, over 21735.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3183, pruned_loss=0.07939, over 4264573.08 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:23:00,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1729812.0, ans=0.125 2023-06-24 11:23:35,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-24 11:24:12,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1730052.0, ans=15.0 2023-06-24 11:24:21,353 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 7.694e+02 1.020e+03 1.467e+03 3.637e+03, threshold=2.040e+03, percent-clipped=4.0 2023-06-24 11:24:25,908 INFO [train.py:996] (0/4) Epoch 10, batch 13900, loss[loss=0.2745, simple_loss=0.3377, pruned_loss=0.1056, over 21264.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3229, pruned_loss=0.08244, over 4267178.07 frames. ], batch size: 143, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:24:35,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1730112.0, ans=0.125 2023-06-24 11:25:23,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1730292.0, ans=0.125 2023-06-24 11:25:26,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1730292.0, ans=0.125 2023-06-24 11:25:55,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-24 11:26:02,529 INFO [train.py:996] (0/4) Epoch 10, batch 13950, loss[loss=0.2855, simple_loss=0.3585, pruned_loss=0.1063, over 21876.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.323, pruned_loss=0.08504, over 4277802.18 frames. ], batch size: 107, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:26:15,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1730412.0, ans=0.0 2023-06-24 11:26:35,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1730472.0, ans=0.1 2023-06-24 11:26:58,005 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:27:33,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.002e+02 6.836e+02 9.588e+02 1.525e+03 4.378e+03, threshold=1.918e+03, percent-clipped=10.0 2023-06-24 11:27:37,581 INFO [train.py:996] (0/4) Epoch 10, batch 14000, loss[loss=0.168, simple_loss=0.2385, pruned_loss=0.04875, over 21383.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.319, pruned_loss=0.08284, over 4271487.57 frames. ], batch size: 160, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:27:48,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1730712.0, ans=0.125 2023-06-24 11:28:13,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1730832.0, ans=0.125 2023-06-24 11:28:57,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1730952.0, ans=0.2 2023-06-24 11:29:12,985 INFO [train.py:996] (0/4) Epoch 10, batch 14050, loss[loss=0.1991, simple_loss=0.2727, pruned_loss=0.0628, over 21768.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3133, pruned_loss=0.07827, over 4280376.26 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:29:50,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731132.0, ans=0.1 2023-06-24 11:30:14,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1731192.0, ans=0.0 2023-06-24 11:30:24,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1731192.0, ans=0.0 2023-06-24 11:30:30,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1731252.0, ans=0.125 2023-06-24 11:30:40,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1731252.0, ans=0.2 2023-06-24 11:30:45,131 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 8.037e+02 1.202e+03 1.798e+03 5.374e+03, threshold=2.404e+03, percent-clipped=21.0 2023-06-24 11:30:48,126 INFO [train.py:996] (0/4) Epoch 10, batch 14100, loss[loss=0.2167, simple_loss=0.3336, pruned_loss=0.04985, over 19819.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3082, pruned_loss=0.07808, over 4264900.92 frames. ], batch size: 704, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:31:13,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1731372.0, ans=0.125 2023-06-24 11:31:23,477 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-24 11:31:41,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1731432.0, ans=0.5 2023-06-24 11:31:46,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.28 vs. limit=15.0 2023-06-24 11:32:23,916 INFO [train.py:996] (0/4) Epoch 10, batch 14150, loss[loss=0.2405, simple_loss=0.3243, pruned_loss=0.07834, over 21658.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3099, pruned_loss=0.07831, over 4264797.29 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:32:24,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-24 11:32:47,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-24 11:32:58,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1731732.0, ans=0.125 2023-06-24 11:33:50,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 6.112e+02 7.492e+02 9.193e+02 1.976e+03, threshold=1.498e+03, percent-clipped=0.0 2023-06-24 11:33:58,958 INFO [train.py:996] (0/4) Epoch 10, batch 14200, loss[loss=0.2014, simple_loss=0.2719, pruned_loss=0.06543, over 21776.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3086, pruned_loss=0.07727, over 4269725.30 frames. ], batch size: 371, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:34:40,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1732032.0, ans=0.0 2023-06-24 11:34:58,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1732092.0, ans=0.0 2023-06-24 11:35:08,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-24 11:35:29,037 INFO [train.py:996] (0/4) Epoch 10, batch 14250, loss[loss=0.1908, simple_loss=0.2783, pruned_loss=0.05165, over 21662.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3037, pruned_loss=0.07674, over 4267238.13 frames. ], batch size: 415, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:35:53,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1732272.0, ans=0.125 2023-06-24 11:36:08,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.75 vs. limit=10.0 2023-06-24 11:36:36,055 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:36:44,577 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-24 11:37:04,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1732452.0, ans=0.5 2023-06-24 11:37:04,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1732452.0, ans=0.0 2023-06-24 11:37:05,372 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 5.862e+02 7.753e+02 1.347e+03 3.974e+03, threshold=1.551e+03, percent-clipped=20.0 2023-06-24 11:37:08,459 INFO [train.py:996] (0/4) Epoch 10, batch 14300, loss[loss=0.2599, simple_loss=0.3515, pruned_loss=0.0842, over 21727.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3044, pruned_loss=0.07555, over 4265599.54 frames. ], batch size: 332, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:37:10,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1732512.0, ans=0.2 2023-06-24 11:37:25,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=22.5 2023-06-24 11:38:00,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1732632.0, ans=0.1 2023-06-24 11:38:00,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1732632.0, ans=0.125 2023-06-24 11:38:44,754 INFO [train.py:996] (0/4) Epoch 10, batch 14350, loss[loss=0.2812, simple_loss=0.3593, pruned_loss=0.1015, over 21559.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3098, pruned_loss=0.07744, over 4259849.84 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:39:06,374 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-24 11:39:07,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-24 11:39:10,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1732872.0, ans=0.125 2023-06-24 11:39:42,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1732932.0, ans=0.125 2023-06-24 11:40:15,034 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-24 11:40:17,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.258e+02 7.078e+02 1.011e+03 1.349e+03 3.463e+03, threshold=2.022e+03, percent-clipped=22.0 2023-06-24 11:40:25,273 INFO [train.py:996] (0/4) Epoch 10, batch 14400, loss[loss=0.2338, simple_loss=0.3061, pruned_loss=0.08079, over 16494.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3082, pruned_loss=0.07797, over 4262567.58 frames. ], batch size: 60, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:40:30,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1733112.0, ans=0.125 2023-06-24 11:40:52,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1733172.0, ans=10.0 2023-06-24 11:40:58,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1733232.0, ans=0.0 2023-06-24 11:41:54,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-24 11:41:54,321 INFO [train.py:996] (0/4) Epoch 10, batch 14450, loss[loss=0.1979, simple_loss=0.2605, pruned_loss=0.06766, over 21609.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.304, pruned_loss=0.07792, over 4256372.54 frames. ], batch size: 298, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:41:54,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1733412.0, ans=0.0 2023-06-24 11:42:24,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.12 vs. limit=6.0 2023-06-24 11:42:32,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1733532.0, ans=0.1 2023-06-24 11:42:58,855 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:43:23,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.538e+02 6.420e+02 9.253e+02 1.365e+03 3.104e+03, threshold=1.851e+03, percent-clipped=3.0 2023-06-24 11:43:26,356 INFO [train.py:996] (0/4) Epoch 10, batch 14500, loss[loss=0.2196, simple_loss=0.2931, pruned_loss=0.07302, over 21747.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3009, pruned_loss=0.07781, over 4262859.49 frames. ], batch size: 98, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:43:37,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1733712.0, ans=0.125 2023-06-24 11:43:43,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1733712.0, ans=0.1 2023-06-24 11:43:44,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1733712.0, ans=0.125 2023-06-24 11:43:48,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1733772.0, ans=0.0 2023-06-24 11:44:09,893 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:45:03,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1733952.0, ans=0.125 2023-06-24 11:45:08,987 INFO [train.py:996] (0/4) Epoch 10, batch 14550, loss[loss=0.2852, simple_loss=0.3608, pruned_loss=0.1048, over 21238.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3035, pruned_loss=0.07854, over 4263046.53 frames. ], batch size: 143, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:46:04,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1734192.0, ans=0.125 2023-06-24 11:46:06,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1734192.0, ans=0.1 2023-06-24 11:46:39,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1734252.0, ans=0.125 2023-06-24 11:46:42,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.617e+02 6.748e+02 9.842e+02 1.367e+03 3.226e+03, threshold=1.968e+03, percent-clipped=9.0 2023-06-24 11:46:45,782 INFO [train.py:996] (0/4) Epoch 10, batch 14600, loss[loss=0.2478, simple_loss=0.3322, pruned_loss=0.08173, over 21338.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3139, pruned_loss=0.08346, over 4266017.61 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:47:55,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-24 11:48:21,380 INFO [train.py:996] (0/4) Epoch 10, batch 14650, loss[loss=0.2407, simple_loss=0.3451, pruned_loss=0.06817, over 19797.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.316, pruned_loss=0.08293, over 4251991.85 frames. ], batch size: 702, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:49:02,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1734732.0, ans=0.125 2023-06-24 11:49:05,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1734732.0, ans=0.125 2023-06-24 11:49:24,664 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-24 11:49:54,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 6.829e+02 9.850e+02 1.571e+03 3.523e+03, threshold=1.970e+03, percent-clipped=13.0 2023-06-24 11:49:57,775 INFO [train.py:996] (0/4) Epoch 10, batch 14700, loss[loss=0.1796, simple_loss=0.2363, pruned_loss=0.06147, over 20747.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3071, pruned_loss=0.07596, over 4252113.76 frames. ], batch size: 608, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:50:23,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1734972.0, ans=0.05 2023-06-24 11:51:08,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1735092.0, ans=0.125 2023-06-24 11:51:30,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1735152.0, ans=0.2 2023-06-24 11:51:36,606 INFO [train.py:996] (0/4) Epoch 10, batch 14750, loss[loss=0.3176, simple_loss=0.3853, pruned_loss=0.125, over 21428.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3129, pruned_loss=0.07906, over 4257517.72 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:51:43,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1735212.0, ans=0.125 2023-06-24 11:52:39,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1735332.0, ans=0.0 2023-06-24 11:52:46,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1735392.0, ans=0.0 2023-06-24 11:52:55,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1735392.0, ans=0.125 2023-06-24 11:53:10,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.743e+02 7.762e+02 1.072e+03 1.702e+03 3.196e+03, threshold=2.144e+03, percent-clipped=17.0 2023-06-24 11:53:13,744 INFO [train.py:996] (0/4) Epoch 10, batch 14800, loss[loss=0.2355, simple_loss=0.3201, pruned_loss=0.0754, over 21657.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.327, pruned_loss=0.08663, over 4262142.65 frames. ], batch size: 298, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:53:20,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1735512.0, ans=0.2 2023-06-24 11:53:38,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1735572.0, ans=0.125 2023-06-24 11:53:42,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1735572.0, ans=0.0 2023-06-24 11:54:19,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735692.0, ans=0.1 2023-06-24 11:54:23,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1735692.0, ans=0.125 2023-06-24 11:55:02,612 INFO [train.py:996] (0/4) Epoch 10, batch 14850, loss[loss=0.2447, simple_loss=0.3162, pruned_loss=0.08658, over 21883.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3212, pruned_loss=0.086, over 4262371.84 frames. ], batch size: 372, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:55:16,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1735812.0, ans=0.0 2023-06-24 11:55:32,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-24 11:56:04,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735992.0, ans=0.1 2023-06-24 11:56:11,831 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=22.5 2023-06-24 11:56:37,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.296e+02 7.240e+02 1.054e+03 1.566e+03 3.588e+03, threshold=2.108e+03, percent-clipped=9.0 2023-06-24 11:56:40,615 INFO [train.py:996] (0/4) Epoch 10, batch 14900, loss[loss=0.2705, simple_loss=0.3365, pruned_loss=0.1023, over 20038.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3232, pruned_loss=0.08716, over 4254668.82 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:57:08,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-24 11:57:31,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-24 11:57:36,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1736232.0, ans=0.125 2023-06-24 11:57:45,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1736292.0, ans=0.5 2023-06-24 11:58:05,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-24 11:58:06,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1736352.0, ans=0.0 2023-06-24 11:58:28,368 INFO [train.py:996] (0/4) Epoch 10, batch 14950, loss[loss=0.2503, simple_loss=0.332, pruned_loss=0.0843, over 21798.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3246, pruned_loss=0.08652, over 4263221.81 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:58:35,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736412.0, ans=0.1 2023-06-24 11:59:10,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1736532.0, ans=0.2 2023-06-24 11:59:22,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1736592.0, ans=0.125 2023-06-24 11:59:23,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1736592.0, ans=0.1 2023-06-24 11:59:31,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1736592.0, ans=0.0 2023-06-24 12:00:04,940 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.178e+02 7.042e+02 9.483e+02 1.424e+03 2.881e+03, threshold=1.897e+03, percent-clipped=9.0 2023-06-24 12:00:06,692 INFO [train.py:996] (0/4) Epoch 10, batch 15000, loss[loss=0.2583, simple_loss=0.3204, pruned_loss=0.09808, over 21665.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3258, pruned_loss=0.08748, over 4272326.97 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:00:06,693 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 12:00:22,750 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2522, simple_loss=0.3488, pruned_loss=0.07776, over 1796401.00 frames. 2023-06-24 12:00:22,751 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 12:00:32,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1736712.0, ans=0.125 2023-06-24 12:00:55,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=22.5 2023-06-24 12:01:04,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1736832.0, ans=0.125 2023-06-24 12:01:05,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736832.0, ans=0.1 2023-06-24 12:01:10,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1736832.0, ans=0.2 2023-06-24 12:01:13,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1736832.0, ans=0.125 2023-06-24 12:01:23,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1736892.0, ans=0.04949747468305833 2023-06-24 12:01:35,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1736892.0, ans=0.125 2023-06-24 12:01:47,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1736952.0, ans=0.0 2023-06-24 12:02:00,959 INFO [train.py:996] (0/4) Epoch 10, batch 15050, loss[loss=0.2818, simple_loss=0.3726, pruned_loss=0.09553, over 19936.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3262, pruned_loss=0.08813, over 4268773.06 frames. ], batch size: 702, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:02:01,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1737012.0, ans=0.0 2023-06-24 12:02:04,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1737012.0, ans=0.1 2023-06-24 12:02:38,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1737072.0, ans=0.0 2023-06-24 12:02:50,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1737132.0, ans=0.0 2023-06-24 12:02:58,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-24 12:03:15,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1737192.0, ans=0.0 2023-06-24 12:03:23,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1737252.0, ans=0.2 2023-06-24 12:03:36,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 8.670e+02 1.430e+03 2.221e+03 3.965e+03, threshold=2.861e+03, percent-clipped=33.0 2023-06-24 12:03:38,491 INFO [train.py:996] (0/4) Epoch 10, batch 15100, loss[loss=0.2421, simple_loss=0.3186, pruned_loss=0.08281, over 21588.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3297, pruned_loss=0.08855, over 4269746.68 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:04:06,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1737372.0, ans=0.07 2023-06-24 12:04:11,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1737372.0, ans=0.0 2023-06-24 12:05:14,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1737612.0, ans=0.0 2023-06-24 12:05:15,769 INFO [train.py:996] (0/4) Epoch 10, batch 15150, loss[loss=0.2079, simple_loss=0.2802, pruned_loss=0.06782, over 21421.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3249, pruned_loss=0.08733, over 4269565.43 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:06:15,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1737732.0, ans=15.0 2023-06-24 12:06:26,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.40 vs. limit=15.0 2023-06-24 12:06:27,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1737792.0, ans=0.125 2023-06-24 12:06:28,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1737792.0, ans=0.0 2023-06-24 12:06:43,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1737852.0, ans=0.2 2023-06-24 12:06:53,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-24 12:06:55,785 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.279e+02 6.808e+02 1.091e+03 1.698e+03 5.270e+03, threshold=2.181e+03, percent-clipped=2.0 2023-06-24 12:07:02,091 INFO [train.py:996] (0/4) Epoch 10, batch 15200, loss[loss=0.2583, simple_loss=0.3722, pruned_loss=0.07218, over 19779.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3165, pruned_loss=0.08319, over 4260374.76 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 12:07:03,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1737912.0, ans=0.1 2023-06-24 12:07:31,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1737972.0, ans=0.125 2023-06-24 12:08:03,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1738092.0, ans=0.0 2023-06-24 12:08:32,600 INFO [train.py:996] (0/4) Epoch 10, batch 15250, loss[loss=0.3114, simple_loss=0.3493, pruned_loss=0.1367, over 21389.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3106, pruned_loss=0.0815, over 4262825.95 frames. ], batch size: 509, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:08:33,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=12.0 2023-06-24 12:08:39,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1738212.0, ans=0.1 2023-06-24 12:08:40,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1738212.0, ans=0.05 2023-06-24 12:08:47,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1738212.0, ans=0.125 2023-06-24 12:08:58,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1738272.0, ans=0.1 2023-06-24 12:10:17,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.809e+02 9.550e+02 1.644e+03 2.423e+03 4.460e+03, threshold=3.287e+03, percent-clipped=35.0 2023-06-24 12:10:17,260 INFO [train.py:996] (0/4) Epoch 10, batch 15300, loss[loss=0.3094, simple_loss=0.3567, pruned_loss=0.131, over 21410.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3137, pruned_loss=0.08412, over 4264206.22 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:10:27,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1738512.0, ans=0.1 2023-06-24 12:10:47,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1738572.0, ans=0.125 2023-06-24 12:11:45,834 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:11:54,380 INFO [train.py:996] (0/4) Epoch 10, batch 15350, loss[loss=0.2566, simple_loss=0.3487, pruned_loss=0.08231, over 21631.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3191, pruned_loss=0.08701, over 4267251.25 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:11:56,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1738812.0, ans=0.05 2023-06-24 12:13:00,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1738992.0, ans=0.0 2023-06-24 12:13:24,422 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 7.880e+02 1.088e+03 1.633e+03 3.514e+03, threshold=2.175e+03, percent-clipped=1.0 2023-06-24 12:13:24,453 INFO [train.py:996] (0/4) Epoch 10, batch 15400, loss[loss=0.2362, simple_loss=0.3099, pruned_loss=0.08126, over 21196.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3198, pruned_loss=0.0845, over 4263661.31 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:13:41,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1739112.0, ans=0.0 2023-06-24 12:13:52,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1739172.0, ans=0.125 2023-06-24 12:13:54,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1739172.0, ans=0.125 2023-06-24 12:14:48,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1739352.0, ans=0.125 2023-06-24 12:15:05,363 INFO [train.py:996] (0/4) Epoch 10, batch 15450, loss[loss=0.2326, simple_loss=0.3269, pruned_loss=0.06915, over 21823.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3181, pruned_loss=0.0839, over 4262993.06 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:15:10,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1739412.0, ans=0.125 2023-06-24 12:15:10,747 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:15:40,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-24 12:15:40,716 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-24 12:16:21,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1739652.0, ans=0.125 2023-06-24 12:16:43,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 7.285e+02 1.027e+03 1.632e+03 3.153e+03, threshold=2.054e+03, percent-clipped=10.0 2023-06-24 12:16:43,425 INFO [train.py:996] (0/4) Epoch 10, batch 15500, loss[loss=0.2646, simple_loss=0.3362, pruned_loss=0.09655, over 21838.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3201, pruned_loss=0.08399, over 4241962.31 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:17:01,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1739712.0, ans=0.125 2023-06-24 12:18:11,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1739952.0, ans=0.0 2023-06-24 12:18:19,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1739952.0, ans=0.125 2023-06-24 12:18:21,670 INFO [train.py:996] (0/4) Epoch 10, batch 15550, loss[loss=0.1782, simple_loss=0.2451, pruned_loss=0.05566, over 21841.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3182, pruned_loss=0.08262, over 4250471.78 frames. ], batch size: 98, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:18:42,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1740072.0, ans=0.0 2023-06-24 12:18:50,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1740072.0, ans=0.125 2023-06-24 12:19:32,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1740192.0, ans=0.1 2023-06-24 12:19:47,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1740252.0, ans=0.1 2023-06-24 12:19:58,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.243e+02 5.866e+02 9.218e+02 1.616e+03 3.082e+03, threshold=1.844e+03, percent-clipped=8.0 2023-06-24 12:19:58,939 INFO [train.py:996] (0/4) Epoch 10, batch 15600, loss[loss=0.2109, simple_loss=0.2842, pruned_loss=0.06883, over 21404.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3113, pruned_loss=0.08049, over 4256190.79 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:20:03,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.80 vs. limit=6.0 2023-06-24 12:20:04,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=22.5 2023-06-24 12:21:09,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1740492.0, ans=0.125 2023-06-24 12:21:21,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1740552.0, ans=0.125 2023-06-24 12:21:30,706 INFO [train.py:996] (0/4) Epoch 10, batch 15650, loss[loss=0.2488, simple_loss=0.3115, pruned_loss=0.09307, over 20647.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3093, pruned_loss=0.08008, over 4244259.48 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:21:33,288 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-24 12:22:33,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1740792.0, ans=0.1 2023-06-24 12:23:07,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 7.365e+02 1.096e+03 1.379e+03 2.536e+03, threshold=2.192e+03, percent-clipped=6.0 2023-06-24 12:23:07,134 INFO [train.py:996] (0/4) Epoch 10, batch 15700, loss[loss=0.2268, simple_loss=0.2865, pruned_loss=0.08357, over 21205.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.305, pruned_loss=0.07885, over 4248785.00 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:23:19,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1740912.0, ans=0.125 2023-06-24 12:23:41,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-24 12:24:13,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1741092.0, ans=0.125 2023-06-24 12:24:43,549 INFO [train.py:996] (0/4) Epoch 10, batch 15750, loss[loss=0.2249, simple_loss=0.2969, pruned_loss=0.07644, over 21738.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3029, pruned_loss=0.07899, over 4254624.01 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:25:26,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.97 vs. limit=15.0 2023-06-24 12:25:41,076 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:25:42,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1741392.0, ans=0.04949747468305833 2023-06-24 12:25:47,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=12.0 2023-06-24 12:25:50,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1741392.0, ans=10.0 2023-06-24 12:26:13,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.443e+02 6.609e+02 9.142e+02 1.184e+03 2.398e+03, threshold=1.828e+03, percent-clipped=2.0 2023-06-24 12:26:13,808 INFO [train.py:996] (0/4) Epoch 10, batch 15800, loss[loss=0.199, simple_loss=0.2624, pruned_loss=0.0678, over 21319.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2971, pruned_loss=0.07816, over 4259039.65 frames. ], batch size: 144, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:26:28,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-24 12:26:42,207 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:27:49,635 INFO [train.py:996] (0/4) Epoch 10, batch 15850, loss[loss=0.2189, simple_loss=0.2792, pruned_loss=0.0793, over 21570.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2991, pruned_loss=0.07991, over 4260529.23 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:28:34,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1741932.0, ans=0.1 2023-06-24 12:29:26,899 INFO [train.py:996] (0/4) Epoch 10, batch 15900, loss[loss=0.2532, simple_loss=0.3152, pruned_loss=0.09564, over 21249.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.297, pruned_loss=0.08023, over 4258866.99 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:29:28,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 8.317e+02 1.237e+03 1.605e+03 4.098e+03, threshold=2.474e+03, percent-clipped=15.0 2023-06-24 12:29:42,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1742112.0, ans=0.125 2023-06-24 12:30:04,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1742172.0, ans=0.125 2023-06-24 12:30:22,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1742292.0, ans=0.2 2023-06-24 12:31:04,169 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-24 12:31:04,688 INFO [train.py:996] (0/4) Epoch 10, batch 15950, loss[loss=0.188, simple_loss=0.2719, pruned_loss=0.05205, over 21467.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2989, pruned_loss=0.0776, over 4249725.63 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:31:18,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1742412.0, ans=0.1 2023-06-24 12:31:35,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-24 12:31:40,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1742472.0, ans=0.0 2023-06-24 12:31:41,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-24 12:31:51,226 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-24 12:31:53,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1742532.0, ans=0.1 2023-06-24 12:31:56,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=15.0 2023-06-24 12:32:22,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1742652.0, ans=0.0 2023-06-24 12:32:42,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1742712.0, ans=0.0 2023-06-24 12:32:43,177 INFO [train.py:996] (0/4) Epoch 10, batch 16000, loss[loss=0.3015, simple_loss=0.3814, pruned_loss=0.1107, over 21546.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3012, pruned_loss=0.07594, over 4256222.77 frames. ], batch size: 508, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:32:44,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 6.384e+02 8.996e+02 1.327e+03 2.604e+03, threshold=1.799e+03, percent-clipped=2.0 2023-06-24 12:32:49,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1742712.0, ans=0.2 2023-06-24 12:33:18,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1742772.0, ans=0.0 2023-06-24 12:33:40,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1742892.0, ans=0.125 2023-06-24 12:34:20,853 INFO [train.py:996] (0/4) Epoch 10, batch 16050, loss[loss=0.1899, simple_loss=0.274, pruned_loss=0.05287, over 21466.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3051, pruned_loss=0.07535, over 4261693.84 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:34:52,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1743072.0, ans=0.125 2023-06-24 12:35:12,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-24 12:35:51,319 INFO [train.py:996] (0/4) Epoch 10, batch 16100, loss[loss=0.2586, simple_loss=0.3186, pruned_loss=0.09931, over 21319.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3086, pruned_loss=0.07642, over 4271867.71 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:35:54,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 6.034e+02 8.242e+02 1.333e+03 2.832e+03, threshold=1.648e+03, percent-clipped=8.0 2023-06-24 12:37:26,917 INFO [train.py:996] (0/4) Epoch 10, batch 16150, loss[loss=0.2188, simple_loss=0.2973, pruned_loss=0.07015, over 21687.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3066, pruned_loss=0.07797, over 4285934.82 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:37:35,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1743612.0, ans=0.125 2023-06-24 12:37:36,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1743612.0, ans=0.1 2023-06-24 12:38:28,279 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-24 12:39:05,208 INFO [train.py:996] (0/4) Epoch 10, batch 16200, loss[loss=0.3269, simple_loss=0.3793, pruned_loss=0.1373, over 21461.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3105, pruned_loss=0.07872, over 4277818.84 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:39:08,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.546e+02 7.202e+02 1.055e+03 1.408e+03 3.192e+03, threshold=2.110e+03, percent-clipped=15.0 2023-06-24 12:39:36,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-24 12:40:12,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.05 vs. limit=15.0 2023-06-24 12:40:27,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1744152.0, ans=0.125 2023-06-24 12:40:37,334 INFO [train.py:996] (0/4) Epoch 10, batch 16250, loss[loss=0.1875, simple_loss=0.2977, pruned_loss=0.03866, over 19760.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.311, pruned_loss=0.07912, over 4272349.09 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:42:18,497 INFO [train.py:996] (0/4) Epoch 10, batch 16300, loss[loss=0.1738, simple_loss=0.2706, pruned_loss=0.03848, over 21844.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3053, pruned_loss=0.07478, over 4275684.25 frames. ], batch size: 317, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:42:20,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1744512.0, ans=0.125 2023-06-24 12:42:27,043 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.249e+02 6.511e+02 9.054e+02 1.473e+03 4.161e+03, threshold=1.811e+03, percent-clipped=10.0 2023-06-24 12:42:42,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1744572.0, ans=0.125 2023-06-24 12:43:17,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1744692.0, ans=0.125 2023-06-24 12:43:29,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1744692.0, ans=0.125 2023-06-24 12:43:30,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1744692.0, ans=0.0 2023-06-24 12:43:34,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1744692.0, ans=15.0 2023-06-24 12:43:44,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1744752.0, ans=0.0 2023-06-24 12:44:00,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1744812.0, ans=0.125 2023-06-24 12:44:01,350 INFO [train.py:996] (0/4) Epoch 10, batch 16350, loss[loss=0.2275, simple_loss=0.3024, pruned_loss=0.07625, over 21707.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3049, pruned_loss=0.07531, over 4268493.87 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 8.0 2023-06-24 12:44:42,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1744932.0, ans=0.0 2023-06-24 12:44:45,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1744932.0, ans=0.2 2023-06-24 12:45:38,335 INFO [train.py:996] (0/4) Epoch 10, batch 16400, loss[loss=0.2302, simple_loss=0.3002, pruned_loss=0.08011, over 21320.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3105, pruned_loss=0.07774, over 4268679.53 frames. ], batch size: 176, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:45:41,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1745112.0, ans=0.125 2023-06-24 12:45:42,806 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.368e+02 7.745e+02 1.144e+03 1.661e+03 2.943e+03, threshold=2.288e+03, percent-clipped=22.0 2023-06-24 12:46:01,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1745172.0, ans=0.125 2023-06-24 12:46:51,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1745292.0, ans=0.07 2023-06-24 12:47:16,259 INFO [train.py:996] (0/4) Epoch 10, batch 16450, loss[loss=0.2269, simple_loss=0.2937, pruned_loss=0.08002, over 21524.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3119, pruned_loss=0.07873, over 4273271.19 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:47:27,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1745412.0, ans=0.09899494936611666 2023-06-24 12:47:55,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1745532.0, ans=0.035 2023-06-24 12:48:53,704 INFO [train.py:996] (0/4) Epoch 10, batch 16500, loss[loss=0.1943, simple_loss=0.2592, pruned_loss=0.0647, over 21371.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3103, pruned_loss=0.07899, over 4278963.49 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:48:58,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.573e+02 7.722e+02 1.056e+03 1.682e+03 4.861e+03, threshold=2.112e+03, percent-clipped=4.0 2023-06-24 12:49:03,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1745712.0, ans=0.125 2023-06-24 12:49:11,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1745772.0, ans=0.125 2023-06-24 12:49:43,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-24 12:49:52,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1745832.0, ans=0.0 2023-06-24 12:50:31,403 INFO [train.py:996] (0/4) Epoch 10, batch 16550, loss[loss=0.2391, simple_loss=0.3055, pruned_loss=0.08637, over 20099.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.31, pruned_loss=0.07754, over 4274673.06 frames. ], batch size: 702, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:50:31,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1746012.0, ans=0.0 2023-06-24 12:50:35,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1746012.0, ans=0.07 2023-06-24 12:51:31,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1746132.0, ans=0.2 2023-06-24 12:51:53,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1746192.0, ans=0.125 2023-06-24 12:52:16,512 INFO [train.py:996] (0/4) Epoch 10, batch 16600, loss[loss=0.2805, simple_loss=0.3835, pruned_loss=0.08878, over 21714.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3157, pruned_loss=0.07986, over 4272858.91 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:52:21,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 8.131e+02 1.242e+03 1.757e+03 3.477e+03, threshold=2.484e+03, percent-clipped=12.0 2023-06-24 12:52:21,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1746312.0, ans=0.125 2023-06-24 12:52:37,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-24 12:52:38,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1746372.0, ans=0.0 2023-06-24 12:52:48,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1746372.0, ans=0.125 2023-06-24 12:53:09,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1746432.0, ans=0.125 2023-06-24 12:53:34,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1746552.0, ans=0.0 2023-06-24 12:54:00,393 INFO [train.py:996] (0/4) Epoch 10, batch 16650, loss[loss=0.2938, simple_loss=0.3636, pruned_loss=0.112, over 21835.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3253, pruned_loss=0.08343, over 4276151.80 frames. ], batch size: 118, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:54:47,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1746732.0, ans=0.1 2023-06-24 12:55:27,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1746852.0, ans=0.0 2023-06-24 12:55:44,457 INFO [train.py:996] (0/4) Epoch 10, batch 16700, loss[loss=0.2037, simple_loss=0.2667, pruned_loss=0.07035, over 21456.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.327, pruned_loss=0.08437, over 4263308.04 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:55:44,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1746912.0, ans=0.125 2023-06-24 12:55:48,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-24 12:55:49,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.755e+02 7.026e+02 1.004e+03 1.401e+03 2.239e+03, threshold=2.007e+03, percent-clipped=0.0 2023-06-24 12:56:01,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1746912.0, ans=0.1 2023-06-24 12:56:26,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1747032.0, ans=0.0 2023-06-24 12:57:30,026 INFO [train.py:996] (0/4) Epoch 10, batch 16750, loss[loss=0.3241, simple_loss=0.4085, pruned_loss=0.1199, over 21701.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3303, pruned_loss=0.08733, over 4268127.58 frames. ], batch size: 441, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:57:59,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1747272.0, ans=0.125 2023-06-24 12:58:00,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1747272.0, ans=0.125 2023-06-24 12:58:30,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-24 12:58:34,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1747392.0, ans=0.125 2023-06-24 12:59:08,101 INFO [train.py:996] (0/4) Epoch 10, batch 16800, loss[loss=0.2402, simple_loss=0.3031, pruned_loss=0.08867, over 21848.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3305, pruned_loss=0.08579, over 4267012.65 frames. ], batch size: 118, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:59:12,605 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.402e+02 6.959e+02 1.077e+03 1.675e+03 3.931e+03, threshold=2.154e+03, percent-clipped=17.0 2023-06-24 12:59:59,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1747632.0, ans=0.0 2023-06-24 12:59:59,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1747632.0, ans=0.125 2023-06-24 13:00:06,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1747692.0, ans=0.04949747468305833 2023-06-24 13:00:08,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1747692.0, ans=0.125 2023-06-24 13:00:12,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-06-24 13:00:36,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1747752.0, ans=0.0 2023-06-24 13:00:43,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-24 13:00:43,551 INFO [train.py:996] (0/4) Epoch 10, batch 16850, loss[loss=0.2321, simple_loss=0.3064, pruned_loss=0.07895, over 21509.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3267, pruned_loss=0.08667, over 4273171.39 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:01:07,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1747872.0, ans=0.125 2023-06-24 13:02:00,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1747992.0, ans=0.0 2023-06-24 13:02:19,554 INFO [train.py:996] (0/4) Epoch 10, batch 16900, loss[loss=0.2458, simple_loss=0.3663, pruned_loss=0.06261, over 20732.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3212, pruned_loss=0.08468, over 4272962.93 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:02:30,797 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.364e+02 7.003e+02 1.142e+03 1.621e+03 3.220e+03, threshold=2.284e+03, percent-clipped=11.0 2023-06-24 13:03:01,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1748172.0, ans=0.2 2023-06-24 13:03:56,482 INFO [train.py:996] (0/4) Epoch 10, batch 16950, loss[loss=0.2553, simple_loss=0.3104, pruned_loss=0.1002, over 21761.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3139, pruned_loss=0.08267, over 4279224.15 frames. ], batch size: 508, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:04:14,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1748412.0, ans=0.1 2023-06-24 13:04:52,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1748532.0, ans=0.0 2023-06-24 13:05:14,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.03 vs. limit=22.5 2023-06-24 13:05:19,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1748652.0, ans=0.5 2023-06-24 13:05:33,931 INFO [train.py:996] (0/4) Epoch 10, batch 17000, loss[loss=0.2425, simple_loss=0.3011, pruned_loss=0.09201, over 21634.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3121, pruned_loss=0.08248, over 4283060.40 frames. ], batch size: 548, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:05:44,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.290e+02 6.732e+02 9.381e+02 1.306e+03 2.679e+03, threshold=1.876e+03, percent-clipped=4.0 2023-06-24 13:06:08,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1748772.0, ans=0.1 2023-06-24 13:06:14,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1748772.0, ans=0.125 2023-06-24 13:06:17,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1748832.0, ans=0.0 2023-06-24 13:06:19,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1748832.0, ans=0.125 2023-06-24 13:06:52,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1748892.0, ans=0.125 2023-06-24 13:07:15,594 INFO [train.py:996] (0/4) Epoch 10, batch 17050, loss[loss=0.3063, simple_loss=0.378, pruned_loss=0.1173, over 21581.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3203, pruned_loss=0.08446, over 4290431.68 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:07:37,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1749012.0, ans=0.2 2023-06-24 13:07:43,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1749072.0, ans=0.125 2023-06-24 13:08:25,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1749192.0, ans=0.0 2023-06-24 13:08:39,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1749252.0, ans=0.0 2023-06-24 13:08:46,025 INFO [train.py:996] (0/4) Epoch 10, batch 17100, loss[loss=0.22, simple_loss=0.2874, pruned_loss=0.07633, over 21437.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3191, pruned_loss=0.08504, over 4295082.47 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:08:56,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.820e+02 8.115e+02 1.148e+03 1.810e+03 4.142e+03, threshold=2.296e+03, percent-clipped=21.0 2023-06-24 13:09:33,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1749432.0, ans=0.125 2023-06-24 13:09:34,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1749432.0, ans=0.2 2023-06-24 13:10:13,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1749552.0, ans=0.035 2023-06-24 13:10:26,342 INFO [train.py:996] (0/4) Epoch 10, batch 17150, loss[loss=0.2038, simple_loss=0.278, pruned_loss=0.06478, over 21257.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3143, pruned_loss=0.0836, over 4293828.22 frames. ], batch size: 176, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:11:01,151 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.142e-03 2023-06-24 13:11:05,755 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:11:07,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1749732.0, ans=0.1 2023-06-24 13:11:22,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1749732.0, ans=0.0 2023-06-24 13:12:04,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1749852.0, ans=10.0 2023-06-24 13:12:07,596 INFO [train.py:996] (0/4) Epoch 10, batch 17200, loss[loss=0.2678, simple_loss=0.3305, pruned_loss=0.1026, over 21356.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3136, pruned_loss=0.08334, over 4291762.85 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 13:12:18,511 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.490e+02 5.928e+02 7.581e+02 1.081e+03 2.493e+03, threshold=1.516e+03, percent-clipped=2.0 2023-06-24 13:12:43,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1749972.0, ans=0.025 2023-06-24 13:13:52,377 INFO [train.py:996] (0/4) Epoch 10, batch 17250, loss[loss=0.2267, simple_loss=0.31, pruned_loss=0.07166, over 21850.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3175, pruned_loss=0.08587, over 4290424.73 frames. ], batch size: 282, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 13:14:08,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2023-06-24 13:14:34,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750332.0, ans=0.1 2023-06-24 13:14:37,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1750332.0, ans=0.125 2023-06-24 13:15:12,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1750392.0, ans=0.125 2023-06-24 13:15:13,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1750452.0, ans=0.125 2023-06-24 13:15:13,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1750452.0, ans=0.125 2023-06-24 13:15:34,702 INFO [train.py:996] (0/4) Epoch 10, batch 17300, loss[loss=0.2536, simple_loss=0.3604, pruned_loss=0.07341, over 20928.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3251, pruned_loss=0.08898, over 4286781.50 frames. ], batch size: 607, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:15:42,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.635e+02 9.609e+02 1.379e+03 2.737e+03, threshold=1.922e+03, percent-clipped=17.0 2023-06-24 13:15:47,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1750512.0, ans=0.2 2023-06-24 13:16:01,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750572.0, ans=0.1 2023-06-24 13:17:06,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-24 13:17:11,832 INFO [train.py:996] (0/4) Epoch 10, batch 17350, loss[loss=0.2173, simple_loss=0.2936, pruned_loss=0.07047, over 21274.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3264, pruned_loss=0.08875, over 4283450.98 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:17:15,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1750812.0, ans=0.125 2023-06-24 13:18:02,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1750932.0, ans=0.125 2023-06-24 13:18:12,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1750992.0, ans=0.125 2023-06-24 13:18:42,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1751052.0, ans=0.125 2023-06-24 13:18:49,865 INFO [train.py:996] (0/4) Epoch 10, batch 17400, loss[loss=0.2821, simple_loss=0.3629, pruned_loss=0.1006, over 21606.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3237, pruned_loss=0.08532, over 4272794.23 frames. ], batch size: 441, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:18:57,489 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.089e+02 6.083e+02 9.753e+02 1.322e+03 2.899e+03, threshold=1.951e+03, percent-clipped=8.0 2023-06-24 13:19:27,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1751172.0, ans=0.2 2023-06-24 13:19:39,214 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=12.0 2023-06-24 13:19:40,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1751232.0, ans=0.125 2023-06-24 13:19:41,024 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.43 vs. limit=10.0 2023-06-24 13:19:43,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1751232.0, ans=0.05 2023-06-24 13:20:06,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1751292.0, ans=0.125 2023-06-24 13:20:18,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1751352.0, ans=0.125 2023-06-24 13:20:26,392 INFO [train.py:996] (0/4) Epoch 10, batch 17450, loss[loss=0.2059, simple_loss=0.2993, pruned_loss=0.05621, over 21687.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3208, pruned_loss=0.08365, over 4266871.14 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:20:32,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1751412.0, ans=0.125 2023-06-24 13:21:05,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1751472.0, ans=0.0 2023-06-24 13:21:37,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1751592.0, ans=0.125 2023-06-24 13:21:41,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1751592.0, ans=0.0 2023-06-24 13:21:55,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1751652.0, ans=0.2 2023-06-24 13:22:00,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1751652.0, ans=0.125 2023-06-24 13:22:02,710 INFO [train.py:996] (0/4) Epoch 10, batch 17500, loss[loss=0.2715, simple_loss=0.3252, pruned_loss=0.1089, over 21698.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3158, pruned_loss=0.08141, over 4277327.25 frames. ], batch size: 473, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:22:16,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.845e+02 5.907e+02 8.163e+02 1.225e+03 3.069e+03, threshold=1.633e+03, percent-clipped=7.0 2023-06-24 13:23:08,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-24 13:23:13,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1751892.0, ans=0.0 2023-06-24 13:23:33,038 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-292000.pt 2023-06-24 13:23:39,117 INFO [train.py:996] (0/4) Epoch 10, batch 17550, loss[loss=0.2239, simple_loss=0.3095, pruned_loss=0.0692, over 21867.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3177, pruned_loss=0.08115, over 4277675.72 frames. ], batch size: 107, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:23:41,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-24 13:23:59,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1752072.0, ans=0.0 2023-06-24 13:24:11,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1752072.0, ans=0.0 2023-06-24 13:24:53,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1752252.0, ans=0.0 2023-06-24 13:25:10,571 INFO [train.py:996] (0/4) Epoch 10, batch 17600, loss[loss=0.2716, simple_loss=0.3524, pruned_loss=0.09535, over 21742.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3189, pruned_loss=0.08093, over 4273352.41 frames. ], batch size: 124, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:25:24,373 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.544e+02 6.468e+02 8.117e+02 1.176e+03 4.887e+03, threshold=1.623e+03, percent-clipped=13.0 2023-06-24 13:25:47,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-24 13:25:48,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1752372.0, ans=0.125 2023-06-24 13:25:54,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1752432.0, ans=0.0 2023-06-24 13:26:10,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1752432.0, ans=0.125 2023-06-24 13:26:11,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752432.0, ans=0.1 2023-06-24 13:26:23,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1752492.0, ans=0.125 2023-06-24 13:26:23,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1752492.0, ans=0.0 2023-06-24 13:26:30,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1752552.0, ans=0.125 2023-06-24 13:26:38,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1752552.0, ans=0.1 2023-06-24 13:26:51,774 INFO [train.py:996] (0/4) Epoch 10, batch 17650, loss[loss=0.1956, simple_loss=0.2815, pruned_loss=0.05482, over 21728.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3165, pruned_loss=0.0802, over 4268345.14 frames. ], batch size: 391, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:27:11,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1752612.0, ans=0.0 2023-06-24 13:27:12,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1752612.0, ans=0.125 2023-06-24 13:27:37,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1752732.0, ans=0.125 2023-06-24 13:27:40,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1752732.0, ans=0.0 2023-06-24 13:27:56,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1752792.0, ans=0.2 2023-06-24 13:28:33,527 INFO [train.py:996] (0/4) Epoch 10, batch 17700, loss[loss=0.2238, simple_loss=0.3124, pruned_loss=0.06757, over 21410.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3125, pruned_loss=0.07805, over 4270408.77 frames. ], batch size: 194, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:28:48,060 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.630e+02 6.330e+02 1.013e+03 1.605e+03 3.260e+03, threshold=2.027e+03, percent-clipped=24.0 2023-06-24 13:28:53,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752972.0, ans=0.1 2023-06-24 13:29:15,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1753032.0, ans=0.05 2023-06-24 13:29:39,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1753092.0, ans=0.2 2023-06-24 13:30:16,886 INFO [train.py:996] (0/4) Epoch 10, batch 17750, loss[loss=0.2601, simple_loss=0.3433, pruned_loss=0.08843, over 21602.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3184, pruned_loss=0.08052, over 4265590.53 frames. ], batch size: 389, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:30:17,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1753212.0, ans=0.125 2023-06-24 13:30:38,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1753272.0, ans=0.1 2023-06-24 13:30:48,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1753272.0, ans=0.1 2023-06-24 13:31:29,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-24 13:31:51,455 INFO [train.py:996] (0/4) Epoch 10, batch 17800, loss[loss=0.2529, simple_loss=0.3311, pruned_loss=0.08738, over 21310.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3178, pruned_loss=0.07996, over 4259531.12 frames. ], batch size: 549, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:32:07,440 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.377e+02 6.218e+02 8.448e+02 1.392e+03 2.915e+03, threshold=1.690e+03, percent-clipped=12.0 2023-06-24 13:32:08,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1753512.0, ans=0.125 2023-06-24 13:32:28,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1753632.0, ans=0.0 2023-06-24 13:32:51,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1753692.0, ans=0.025 2023-06-24 13:33:34,318 INFO [train.py:996] (0/4) Epoch 10, batch 17850, loss[loss=0.2578, simple_loss=0.3344, pruned_loss=0.09056, over 21677.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3191, pruned_loss=0.08045, over 4257083.75 frames. ], batch size: 351, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:34:43,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1753992.0, ans=0.125 2023-06-24 13:34:48,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1753992.0, ans=0.2 2023-06-24 13:35:12,054 INFO [train.py:996] (0/4) Epoch 10, batch 17900, loss[loss=0.2019, simple_loss=0.2661, pruned_loss=0.0689, over 21859.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3221, pruned_loss=0.08132, over 4265625.04 frames. ], batch size: 107, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:35:23,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.491e+02 6.124e+02 9.329e+02 1.248e+03 3.216e+03, threshold=1.866e+03, percent-clipped=9.0 2023-06-24 13:35:25,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1754112.0, ans=0.125 2023-06-24 13:35:46,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1754172.0, ans=0.125 2023-06-24 13:36:06,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.09 vs. limit=15.0 2023-06-24 13:36:13,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-24 13:36:39,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1754352.0, ans=0.2 2023-06-24 13:36:41,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1754352.0, ans=0.0 2023-06-24 13:36:52,237 INFO [train.py:996] (0/4) Epoch 10, batch 17950, loss[loss=0.197, simple_loss=0.2672, pruned_loss=0.06344, over 21857.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.322, pruned_loss=0.0786, over 4263923.70 frames. ], batch size: 98, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:36:54,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-24 13:36:59,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1754412.0, ans=0.0 2023-06-24 13:37:38,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1754532.0, ans=0.125 2023-06-24 13:37:38,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1754532.0, ans=0.0 2023-06-24 13:38:13,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1754652.0, ans=0.0 2023-06-24 13:38:27,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1754712.0, ans=0.1 2023-06-24 13:38:28,338 INFO [train.py:996] (0/4) Epoch 10, batch 18000, loss[loss=0.2117, simple_loss=0.2803, pruned_loss=0.07157, over 21610.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3143, pruned_loss=0.07684, over 4261215.74 frames. ], batch size: 248, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:38:28,339 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 13:38:47,229 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2575, simple_loss=0.3533, pruned_loss=0.08085, over 1796401.00 frames. 2023-06-24 13:38:47,230 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 13:39:02,584 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 8.313e+02 1.378e+03 2.030e+03 3.547e+03, threshold=2.755e+03, percent-clipped=28.0 2023-06-24 13:39:24,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1754832.0, ans=0.0 2023-06-24 13:39:39,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1754832.0, ans=0.125 2023-06-24 13:39:47,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1754892.0, ans=0.125 2023-06-24 13:40:09,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.79 vs. limit=6.0 2023-06-24 13:40:19,184 INFO [train.py:996] (0/4) Epoch 10, batch 18050, loss[loss=0.2471, simple_loss=0.3064, pruned_loss=0.09394, over 21915.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3093, pruned_loss=0.0764, over 4263996.23 frames. ], batch size: 373, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:40:48,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1755072.0, ans=0.125 2023-06-24 13:41:09,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-24 13:41:17,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755132.0, ans=0.1 2023-06-24 13:41:54,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1755252.0, ans=0.125 2023-06-24 13:42:03,158 INFO [train.py:996] (0/4) Epoch 10, batch 18100, loss[loss=0.2844, simple_loss=0.3662, pruned_loss=0.1013, over 21606.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3141, pruned_loss=0.07859, over 4260724.48 frames. ], batch size: 414, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:42:19,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.442e+02 6.227e+02 8.455e+02 1.236e+03 2.629e+03, threshold=1.691e+03, percent-clipped=0.0 2023-06-24 13:42:56,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1755432.0, ans=0.0 2023-06-24 13:43:05,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1755492.0, ans=0.0 2023-06-24 13:43:09,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755492.0, ans=0.1 2023-06-24 13:43:13,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1755552.0, ans=0.0 2023-06-24 13:43:44,539 INFO [train.py:996] (0/4) Epoch 10, batch 18150, loss[loss=0.2629, simple_loss=0.3244, pruned_loss=0.1007, over 21571.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3159, pruned_loss=0.07772, over 4261888.62 frames. ], batch size: 391, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:44:00,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1755672.0, ans=0.0 2023-06-24 13:44:05,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-24 13:44:25,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1755732.0, ans=0.125 2023-06-24 13:44:27,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1755732.0, ans=0.2 2023-06-24 13:45:10,538 INFO [train.py:996] (0/4) Epoch 10, batch 18200, loss[loss=0.1996, simple_loss=0.2721, pruned_loss=0.06353, over 21809.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.311, pruned_loss=0.07758, over 4267881.31 frames. ], batch size: 352, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:45:24,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-24 13:45:30,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.409e+02 6.816e+02 9.910e+02 1.570e+03 3.771e+03, threshold=1.982e+03, percent-clipped=24.0 2023-06-24 13:45:36,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1755972.0, ans=0.0 2023-06-24 13:45:39,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1755972.0, ans=0.125 2023-06-24 13:46:12,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1756092.0, ans=0.0 2023-06-24 13:46:26,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1756152.0, ans=10.0 2023-06-24 13:46:41,646 INFO [train.py:996] (0/4) Epoch 10, batch 18250, loss[loss=0.2236, simple_loss=0.2891, pruned_loss=0.07908, over 21568.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3012, pruned_loss=0.07447, over 4272316.67 frames. ], batch size: 548, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:47:03,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1756212.0, ans=0.2 2023-06-24 13:47:13,308 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:47:19,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1756272.0, ans=0.125 2023-06-24 13:47:24,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1756332.0, ans=0.1 2023-06-24 13:47:48,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1756392.0, ans=0.125 2023-06-24 13:48:09,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1756452.0, ans=0.2 2023-06-24 13:48:12,412 INFO [train.py:996] (0/4) Epoch 10, batch 18300, loss[loss=0.2057, simple_loss=0.2788, pruned_loss=0.06627, over 21171.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3021, pruned_loss=0.07476, over 4274712.24 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:48:15,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1756512.0, ans=0.0 2023-06-24 13:48:22,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1756512.0, ans=0.0 2023-06-24 13:48:23,313 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.671e+02 6.079e+02 7.788e+02 1.352e+03 4.344e+03, threshold=1.558e+03, percent-clipped=12.0 2023-06-24 13:49:01,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1756632.0, ans=0.0 2023-06-24 13:49:11,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-24 13:49:49,395 INFO [train.py:996] (0/4) Epoch 10, batch 18350, loss[loss=0.2216, simple_loss=0.3074, pruned_loss=0.06787, over 21700.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3098, pruned_loss=0.07489, over 4263086.15 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:49:49,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1756812.0, ans=0.0 2023-06-24 13:50:28,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1756932.0, ans=0.125 2023-06-24 13:50:42,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-24 13:50:50,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1756992.0, ans=0.0 2023-06-24 13:51:27,722 INFO [train.py:996] (0/4) Epoch 10, batch 18400, loss[loss=0.1829, simple_loss=0.2641, pruned_loss=0.05083, over 21558.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3045, pruned_loss=0.07349, over 4263553.24 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 13:51:43,858 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 6.378e+02 8.856e+02 1.210e+03 2.743e+03, threshold=1.771e+03, percent-clipped=10.0 2023-06-24 13:51:51,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1757172.0, ans=0.0 2023-06-24 13:51:53,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1757172.0, ans=0.2 2023-06-24 13:52:02,273 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-06-24 13:52:05,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1757232.0, ans=0.125 2023-06-24 13:52:13,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1757232.0, ans=12.0 2023-06-24 13:52:31,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1757292.0, ans=0.125 2023-06-24 13:52:44,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 13:52:59,717 INFO [train.py:996] (0/4) Epoch 10, batch 18450, loss[loss=0.1889, simple_loss=0.2742, pruned_loss=0.05183, over 21681.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2993, pruned_loss=0.07007, over 4272469.80 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 13:53:23,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1757472.0, ans=0.125 2023-06-24 13:53:33,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1757472.0, ans=0.125 2023-06-24 13:54:02,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1757592.0, ans=0.125 2023-06-24 13:54:17,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1757652.0, ans=0.0 2023-06-24 13:54:20,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1757652.0, ans=0.125 2023-06-24 13:54:35,876 INFO [train.py:996] (0/4) Epoch 10, batch 18500, loss[loss=0.2275, simple_loss=0.2908, pruned_loss=0.08208, over 21793.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2948, pruned_loss=0.06973, over 4272924.20 frames. ], batch size: 352, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:54:37,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1757712.0, ans=0.0 2023-06-24 13:54:57,481 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.156e+02 5.768e+02 9.363e+02 1.391e+03 2.603e+03, threshold=1.873e+03, percent-clipped=9.0 2023-06-24 13:55:01,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1757772.0, ans=0.125 2023-06-24 13:55:29,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1757832.0, ans=0.125 2023-06-24 13:55:49,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1757892.0, ans=0.0 2023-06-24 13:55:52,530 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:56:12,637 INFO [train.py:996] (0/4) Epoch 10, batch 18550, loss[loss=0.1907, simple_loss=0.2611, pruned_loss=0.06017, over 15921.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2923, pruned_loss=0.06909, over 4264909.86 frames. ], batch size: 60, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:56:32,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-24 13:57:12,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1758192.0, ans=0.035 2023-06-24 13:57:23,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1758192.0, ans=0.125 2023-06-24 13:57:32,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1758252.0, ans=0.0 2023-06-24 13:57:45,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-24 13:57:49,280 INFO [train.py:996] (0/4) Epoch 10, batch 18600, loss[loss=0.2156, simple_loss=0.2855, pruned_loss=0.07284, over 21636.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2913, pruned_loss=0.07032, over 4259785.52 frames. ], batch size: 415, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:58:12,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.175e+02 6.384e+02 9.662e+02 1.486e+03 4.666e+03, threshold=1.932e+03, percent-clipped=18.0 2023-06-24 13:58:12,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1758312.0, ans=0.2 2023-06-24 13:59:02,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1758492.0, ans=0.125 2023-06-24 13:59:25,742 INFO [train.py:996] (0/4) Epoch 10, batch 18650, loss[loss=0.208, simple_loss=0.2758, pruned_loss=0.07009, over 21703.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2919, pruned_loss=0.07118, over 4262670.63 frames. ], batch size: 333, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:59:27,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1758612.0, ans=0.125 2023-06-24 13:59:51,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-24 13:59:59,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1758672.0, ans=0.0 2023-06-24 14:00:48,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1758852.0, ans=0.125 2023-06-24 14:00:55,944 INFO [train.py:996] (0/4) Epoch 10, batch 18700, loss[loss=0.2267, simple_loss=0.3012, pruned_loss=0.07614, over 21864.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2908, pruned_loss=0.07238, over 4262138.46 frames. ], batch size: 118, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:01:02,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1758912.0, ans=0.0 2023-06-24 14:01:12,556 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.606e+02 6.882e+02 9.841e+02 1.662e+03 3.485e+03, threshold=1.968e+03, percent-clipped=16.0 2023-06-24 14:01:21,329 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:01:23,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-24 14:01:41,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1759032.0, ans=0.125 2023-06-24 14:01:42,392 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-24 14:02:28,422 INFO [train.py:996] (0/4) Epoch 10, batch 18750, loss[loss=0.2342, simple_loss=0.3048, pruned_loss=0.08183, over 21251.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.292, pruned_loss=0.07395, over 4263530.72 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:02:54,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1759272.0, ans=0.1 2023-06-24 14:03:06,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1759272.0, ans=0.125 2023-06-24 14:03:35,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-24 14:04:00,682 INFO [train.py:996] (0/4) Epoch 10, batch 18800, loss[loss=0.1812, simple_loss=0.268, pruned_loss=0.0472, over 21757.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2975, pruned_loss=0.07541, over 4265027.34 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:04:22,332 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.381e+02 6.431e+02 9.700e+02 1.560e+03 3.348e+03, threshold=1.940e+03, percent-clipped=15.0 2023-06-24 14:04:30,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1759572.0, ans=0.125 2023-06-24 14:04:37,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1759572.0, ans=0.0 2023-06-24 14:04:39,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.40 vs. limit=10.0 2023-06-24 14:04:47,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1759632.0, ans=0.125 2023-06-24 14:04:49,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-24 14:05:31,432 INFO [train.py:996] (0/4) Epoch 10, batch 18850, loss[loss=0.2504, simple_loss=0.3115, pruned_loss=0.09464, over 19977.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2916, pruned_loss=0.07114, over 4254679.80 frames. ], batch size: 703, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:05:50,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1759812.0, ans=0.07 2023-06-24 14:05:59,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-24 14:07:07,596 INFO [train.py:996] (0/4) Epoch 10, batch 18900, loss[loss=0.2324, simple_loss=0.2934, pruned_loss=0.08573, over 21355.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2899, pruned_loss=0.07228, over 4255737.45 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:07:08,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1760112.0, ans=0.0 2023-06-24 14:07:10,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 14:07:12,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1760112.0, ans=0.1 2023-06-24 14:07:24,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 6.602e+02 8.408e+02 1.056e+03 2.556e+03, threshold=1.682e+03, percent-clipped=3.0 2023-06-24 14:07:49,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1760172.0, ans=0.125 2023-06-24 14:08:03,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1760232.0, ans=0.125 2023-06-24 14:08:44,283 INFO [train.py:996] (0/4) Epoch 10, batch 18950, loss[loss=0.243, simple_loss=0.3099, pruned_loss=0.08801, over 21448.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2917, pruned_loss=0.07392, over 4255440.21 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:09:53,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1760592.0, ans=0.0 2023-06-24 14:10:03,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1760592.0, ans=0.035 2023-06-24 14:10:20,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1760652.0, ans=0.2 2023-06-24 14:10:26,159 INFO [train.py:996] (0/4) Epoch 10, batch 19000, loss[loss=0.2775, simple_loss=0.3486, pruned_loss=0.1032, over 21260.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3012, pruned_loss=0.0761, over 4229629.22 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:10:49,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.433e+02 8.812e+02 1.283e+03 1.934e+03 4.893e+03, threshold=2.566e+03, percent-clipped=32.0 2023-06-24 14:10:57,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1760772.0, ans=0.125 2023-06-24 14:11:02,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1760772.0, ans=0.1 2023-06-24 14:11:05,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-24 14:11:30,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-24 14:11:35,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1760892.0, ans=0.125 2023-06-24 14:12:02,061 INFO [train.py:996] (0/4) Epoch 10, batch 19050, loss[loss=0.2487, simple_loss=0.315, pruned_loss=0.09116, over 21898.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3076, pruned_loss=0.08069, over 4249542.89 frames. ], batch size: 371, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:12:29,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-24 14:12:41,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1761132.0, ans=0.125 2023-06-24 14:12:57,002 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:13:10,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-24 14:13:15,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1761192.0, ans=0.125 2023-06-24 14:13:42,731 INFO [train.py:996] (0/4) Epoch 10, batch 19100, loss[loss=0.2247, simple_loss=0.2821, pruned_loss=0.08365, over 21192.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3058, pruned_loss=0.08103, over 4261269.21 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:13:59,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-24 14:14:01,622 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.691e+02 6.507e+02 8.205e+02 1.177e+03 2.298e+03, threshold=1.641e+03, percent-clipped=0.0 2023-06-24 14:15:21,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1761552.0, ans=0.0 2023-06-24 14:15:25,879 INFO [train.py:996] (0/4) Epoch 10, batch 19150, loss[loss=0.292, simple_loss=0.3901, pruned_loss=0.09692, over 21625.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3075, pruned_loss=0.08135, over 4260641.60 frames. ], batch size: 414, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:16:00,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1761672.0, ans=0.2 2023-06-24 14:16:00,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1761672.0, ans=0.125 2023-06-24 14:17:06,241 INFO [train.py:996] (0/4) Epoch 10, batch 19200, loss[loss=0.2679, simple_loss=0.3709, pruned_loss=0.08248, over 21931.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3189, pruned_loss=0.08252, over 4265155.79 frames. ], batch size: 372, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:17:10,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1761912.0, ans=0.0 2023-06-24 14:17:11,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1761912.0, ans=0.0 2023-06-24 14:17:22,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.091e+02 1.026e+03 1.602e+03 3.229e+03, threshold=2.053e+03, percent-clipped=24.0 2023-06-24 14:17:40,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1761972.0, ans=0.125 2023-06-24 14:17:45,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1762032.0, ans=0.125 2023-06-24 14:18:44,966 INFO [train.py:996] (0/4) Epoch 10, batch 19250, loss[loss=0.177, simple_loss=0.2695, pruned_loss=0.04231, over 21444.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.318, pruned_loss=0.07776, over 4260061.12 frames. ], batch size: 211, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:18:48,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1762212.0, ans=0.1 2023-06-24 14:18:54,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.19 vs. limit=22.5 2023-06-24 14:18:56,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1762212.0, ans=0.125 2023-06-24 14:19:24,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-24 14:20:06,664 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-24 14:20:20,877 INFO [train.py:996] (0/4) Epoch 10, batch 19300, loss[loss=0.2431, simple_loss=0.305, pruned_loss=0.09066, over 21940.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3141, pruned_loss=0.07736, over 4267600.94 frames. ], batch size: 113, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:20:36,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.942e+02 6.247e+02 8.864e+02 1.177e+03 3.202e+03, threshold=1.773e+03, percent-clipped=6.0 2023-06-24 14:20:57,362 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.88 vs. limit=15.0 2023-06-24 14:21:34,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.70 vs. limit=22.5 2023-06-24 14:21:40,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1762692.0, ans=0.125 2023-06-24 14:22:00,326 INFO [train.py:996] (0/4) Epoch 10, batch 19350, loss[loss=0.2083, simple_loss=0.2998, pruned_loss=0.05834, over 21736.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3084, pruned_loss=0.07417, over 4268389.87 frames. ], batch size: 391, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:22:05,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1762812.0, ans=0.0 2023-06-24 14:22:06,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1762812.0, ans=0.0 2023-06-24 14:22:26,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1762872.0, ans=0.95 2023-06-24 14:22:31,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-24 14:22:55,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1762992.0, ans=0.125 2023-06-24 14:23:13,087 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-24 14:23:29,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1763052.0, ans=0.0 2023-06-24 14:23:36,228 INFO [train.py:996] (0/4) Epoch 10, batch 19400, loss[loss=0.2208, simple_loss=0.2828, pruned_loss=0.07939, over 21217.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3068, pruned_loss=0.07356, over 4279021.19 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:23:38,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1763112.0, ans=0.1 2023-06-24 14:23:38,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=12.0 2023-06-24 14:23:57,755 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-24 14:23:58,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.962e+02 7.006e+02 1.136e+03 1.736e+03 4.231e+03, threshold=2.271e+03, percent-clipped=24.0 2023-06-24 14:24:22,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-24 14:25:11,470 INFO [train.py:996] (0/4) Epoch 10, batch 19450, loss[loss=0.2363, simple_loss=0.3099, pruned_loss=0.08132, over 21968.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3065, pruned_loss=0.07578, over 4286320.93 frames. ], batch size: 113, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:26:24,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1763592.0, ans=0.125 2023-06-24 14:26:44,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-24 14:26:48,817 INFO [train.py:996] (0/4) Epoch 10, batch 19500, loss[loss=0.2087, simple_loss=0.2846, pruned_loss=0.0664, over 21257.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3014, pruned_loss=0.0763, over 4279192.61 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:27:11,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.173e+02 6.380e+02 1.047e+03 1.511e+03 3.799e+03, threshold=2.095e+03, percent-clipped=7.0 2023-06-24 14:27:45,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1763832.0, ans=0.125 2023-06-24 14:28:01,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1763892.0, ans=0.1 2023-06-24 14:28:25,376 INFO [train.py:996] (0/4) Epoch 10, batch 19550, loss[loss=0.2024, simple_loss=0.3036, pruned_loss=0.05057, over 21776.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.298, pruned_loss=0.07432, over 4275031.00 frames. ], batch size: 282, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:29:43,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1764192.0, ans=0.0 2023-06-24 14:30:01,057 INFO [train.py:996] (0/4) Epoch 10, batch 19600, loss[loss=0.2336, simple_loss=0.3, pruned_loss=0.08362, over 21640.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3002, pruned_loss=0.07533, over 4280942.38 frames. ], batch size: 263, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:30:28,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.111e+02 6.531e+02 1.025e+03 1.412e+03 3.718e+03, threshold=2.049e+03, percent-clipped=12.0 2023-06-24 14:31:38,455 INFO [train.py:996] (0/4) Epoch 10, batch 19650, loss[loss=0.2069, simple_loss=0.2855, pruned_loss=0.0642, over 21829.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3055, pruned_loss=0.07956, over 4283159.88 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:31:39,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1764612.0, ans=0.1 2023-06-24 14:32:18,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-24 14:32:48,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-24 14:33:28,007 INFO [train.py:996] (0/4) Epoch 10, batch 19700, loss[loss=0.1514, simple_loss=0.1838, pruned_loss=0.05945, over 16434.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3093, pruned_loss=0.07895, over 4274774.96 frames. ], batch size: 61, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:33:28,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1764912.0, ans=0.09899494936611666 2023-06-24 14:33:39,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1764912.0, ans=0.0 2023-06-24 14:33:47,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1764972.0, ans=0.125 2023-06-24 14:33:54,883 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 9.383e+02 1.272e+03 2.018e+03 4.455e+03, threshold=2.544e+03, percent-clipped=24.0 2023-06-24 14:34:03,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1764972.0, ans=0.0 2023-06-24 14:34:08,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1765032.0, ans=0.125 2023-06-24 14:34:09,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1765032.0, ans=0.2 2023-06-24 14:34:55,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1765152.0, ans=0.0 2023-06-24 14:35:08,189 INFO [train.py:996] (0/4) Epoch 10, batch 19750, loss[loss=0.3121, simple_loss=0.3894, pruned_loss=0.1174, over 21776.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3163, pruned_loss=0.07988, over 4281676.04 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:35:08,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1765212.0, ans=0.1 2023-06-24 14:35:50,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1765332.0, ans=0.125 2023-06-24 14:35:58,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=22.5 2023-06-24 14:36:02,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1765392.0, ans=0.0 2023-06-24 14:36:23,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1765392.0, ans=0.125 2023-06-24 14:36:34,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1765452.0, ans=0.125 2023-06-24 14:36:48,233 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:36:49,176 INFO [train.py:996] (0/4) Epoch 10, batch 19800, loss[loss=0.1872, simple_loss=0.2677, pruned_loss=0.05333, over 21724.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3154, pruned_loss=0.08058, over 4289997.20 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:37:11,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.026e+02 9.250e+02 1.586e+03 2.402e+03 4.902e+03, threshold=3.172e+03, percent-clipped=21.0 2023-06-24 14:37:21,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1765572.0, ans=0.125 2023-06-24 14:38:27,738 INFO [train.py:996] (0/4) Epoch 10, batch 19850, loss[loss=0.1759, simple_loss=0.2604, pruned_loss=0.04565, over 21642.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3102, pruned_loss=0.07607, over 4287731.94 frames. ], batch size: 230, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:39:38,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1766052.0, ans=15.0 2023-06-24 14:39:58,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-24 14:40:03,690 INFO [train.py:996] (0/4) Epoch 10, batch 19900, loss[loss=0.2245, simple_loss=0.2886, pruned_loss=0.08021, over 21319.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3099, pruned_loss=0.07445, over 4283817.11 frames. ], batch size: 471, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:40:04,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.68 vs. limit=15.0 2023-06-24 14:40:13,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1766112.0, ans=0.0 2023-06-24 14:40:20,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.106e+02 8.584e+02 1.585e+03 3.373e+03, threshold=1.717e+03, percent-clipped=1.0 2023-06-24 14:40:26,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1766172.0, ans=0.125 2023-06-24 14:41:11,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1766352.0, ans=0.125 2023-06-24 14:41:36,320 INFO [train.py:996] (0/4) Epoch 10, batch 19950, loss[loss=0.1702, simple_loss=0.2441, pruned_loss=0.04808, over 21159.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3043, pruned_loss=0.0737, over 4279774.74 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:41:44,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1766412.0, ans=0.0 2023-06-24 14:42:01,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1766472.0, ans=0.0 2023-06-24 14:42:31,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1766592.0, ans=0.125 2023-06-24 14:43:04,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1766652.0, ans=0.2 2023-06-24 14:43:12,279 INFO [train.py:996] (0/4) Epoch 10, batch 20000, loss[loss=0.2574, simple_loss=0.3397, pruned_loss=0.08761, over 21725.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3044, pruned_loss=0.0744, over 4275027.65 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:43:29,111 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.593e+02 7.271e+02 1.092e+03 1.631e+03 3.154e+03, threshold=2.184e+03, percent-clipped=18.0 2023-06-24 14:43:48,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-24 14:43:56,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1766832.0, ans=0.125 2023-06-24 14:44:47,356 INFO [train.py:996] (0/4) Epoch 10, batch 20050, loss[loss=0.2883, simple_loss=0.3471, pruned_loss=0.1148, over 21771.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3055, pruned_loss=0.07687, over 4277967.41 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:44:49,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1767012.0, ans=0.125 2023-06-24 14:44:59,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1767012.0, ans=0.09899494936611666 2023-06-24 14:45:08,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-24 14:45:33,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1767132.0, ans=0.125 2023-06-24 14:46:13,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1767252.0, ans=0.1 2023-06-24 14:46:26,771 INFO [train.py:996] (0/4) Epoch 10, batch 20100, loss[loss=0.3397, simple_loss=0.4276, pruned_loss=0.1259, over 21469.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3086, pruned_loss=0.07957, over 4281499.37 frames. ], batch size: 507, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:46:51,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.584e+02 6.086e+02 7.806e+02 1.176e+03 2.985e+03, threshold=1.561e+03, percent-clipped=5.0 2023-06-24 14:46:58,208 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-24 14:47:01,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=15.0 2023-06-24 14:48:00,013 INFO [train.py:996] (0/4) Epoch 10, batch 20150, loss[loss=0.2726, simple_loss=0.3498, pruned_loss=0.09771, over 21813.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3172, pruned_loss=0.08244, over 4283995.64 frames. ], batch size: 124, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:48:15,005 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:48:15,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1767612.0, ans=0.125 2023-06-24 14:48:26,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1767672.0, ans=0.0 2023-06-24 14:49:23,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=22.5 2023-06-24 14:49:32,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1767852.0, ans=0.2 2023-06-24 14:49:51,321 INFO [train.py:996] (0/4) Epoch 10, batch 20200, loss[loss=0.2966, simple_loss=0.3905, pruned_loss=0.1013, over 21639.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3223, pruned_loss=0.08435, over 4282789.02 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:50:10,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.888e+02 8.305e+02 1.166e+03 1.860e+03 3.941e+03, threshold=2.331e+03, percent-clipped=33.0 2023-06-24 14:50:34,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1768032.0, ans=0.0 2023-06-24 14:50:35,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-06-24 14:50:35,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1768032.0, ans=0.125 2023-06-24 14:51:20,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1768152.0, ans=0.07 2023-06-24 14:51:29,568 INFO [train.py:996] (0/4) Epoch 10, batch 20250, loss[loss=0.2265, simple_loss=0.2986, pruned_loss=0.07718, over 21225.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3228, pruned_loss=0.08316, over 4277704.70 frames. ], batch size: 143, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:52:06,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1768272.0, ans=0.2 2023-06-24 14:52:12,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1768332.0, ans=0.125 2023-06-24 14:52:20,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1768332.0, ans=12.0 2023-06-24 14:52:21,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1768332.0, ans=0.0 2023-06-24 14:53:05,421 INFO [train.py:996] (0/4) Epoch 10, batch 20300, loss[loss=0.221, simple_loss=0.2861, pruned_loss=0.07793, over 21948.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3207, pruned_loss=0.08088, over 4282574.27 frames. ], batch size: 107, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:53:07,881 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-24 14:53:20,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-24 14:53:28,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 6.165e+02 8.569e+02 1.423e+03 2.886e+03, threshold=1.714e+03, percent-clipped=5.0 2023-06-24 14:53:31,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1768572.0, ans=0.0 2023-06-24 14:53:50,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1768632.0, ans=0.125 2023-06-24 14:53:51,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1768632.0, ans=0.0 2023-06-24 14:53:59,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1768692.0, ans=10.0 2023-06-24 14:54:15,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1768752.0, ans=0.0 2023-06-24 14:54:15,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=15.0 2023-06-24 14:54:26,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1768752.0, ans=0.2 2023-06-24 14:54:41,379 INFO [train.py:996] (0/4) Epoch 10, batch 20350, loss[loss=0.2477, simple_loss=0.3194, pruned_loss=0.08802, over 21862.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3193, pruned_loss=0.08048, over 4260125.29 frames. ], batch size: 118, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:55:15,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1768872.0, ans=0.125 2023-06-24 14:55:45,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1768992.0, ans=0.09899494936611666 2023-06-24 14:56:15,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1769052.0, ans=0.125 2023-06-24 14:56:19,307 INFO [train.py:996] (0/4) Epoch 10, batch 20400, loss[loss=0.2706, simple_loss=0.3275, pruned_loss=0.1069, over 21301.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3215, pruned_loss=0.08296, over 4260662.13 frames. ], batch size: 159, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:56:21,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-24 14:56:42,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.538e+02 7.746e+02 1.148e+03 1.668e+03 3.679e+03, threshold=2.297e+03, percent-clipped=22.0 2023-06-24 14:57:05,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1769232.0, ans=0.1 2023-06-24 14:57:11,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1769232.0, ans=0.035 2023-06-24 14:57:21,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-24 14:57:23,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1769292.0, ans=0.125 2023-06-24 14:57:31,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1769292.0, ans=0.125 2023-06-24 14:57:55,680 INFO [train.py:996] (0/4) Epoch 10, batch 20450, loss[loss=0.2263, simple_loss=0.2974, pruned_loss=0.07754, over 21816.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3238, pruned_loss=0.08582, over 4252859.88 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:58:35,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1769532.0, ans=0.2 2023-06-24 14:58:38,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=12.0 2023-06-24 14:59:32,135 INFO [train.py:996] (0/4) Epoch 10, batch 20500, loss[loss=0.2396, simple_loss=0.3016, pruned_loss=0.08882, over 21795.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3203, pruned_loss=0.08595, over 4259865.50 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:59:41,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1769712.0, ans=0.125 2023-06-24 15:00:01,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.514e+02 6.913e+02 8.863e+02 1.328e+03 2.262e+03, threshold=1.773e+03, percent-clipped=0.0 2023-06-24 15:00:21,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1769832.0, ans=0.125 2023-06-24 15:00:29,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1769892.0, ans=0.125 2023-06-24 15:00:30,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1769892.0, ans=0.125 2023-06-24 15:01:09,484 INFO [train.py:996] (0/4) Epoch 10, batch 20550, loss[loss=0.2291, simple_loss=0.3208, pruned_loss=0.06867, over 21660.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3133, pruned_loss=0.08432, over 4260025.90 frames. ], batch size: 332, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:01:14,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-06-24 15:01:47,772 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-24 15:01:50,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1770132.0, ans=0.0 2023-06-24 15:02:46,104 INFO [train.py:996] (0/4) Epoch 10, batch 20600, loss[loss=0.2075, simple_loss=0.2745, pruned_loss=0.07028, over 21452.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3142, pruned_loss=0.08297, over 4261166.25 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:03:01,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1770312.0, ans=0.125 2023-06-24 15:03:14,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1770372.0, ans=0.2 2023-06-24 15:03:15,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.661e+02 6.737e+02 1.120e+03 2.042e+03 4.837e+03, threshold=2.240e+03, percent-clipped=29.0 2023-06-24 15:03:38,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-06-24 15:04:16,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1770552.0, ans=0.125 2023-06-24 15:04:18,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1770552.0, ans=0.1 2023-06-24 15:04:21,177 INFO [train.py:996] (0/4) Epoch 10, batch 20650, loss[loss=0.2049, simple_loss=0.2689, pruned_loss=0.07043, over 21356.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3113, pruned_loss=0.08346, over 4252550.56 frames. ], batch size: 131, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:05:08,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1770732.0, ans=0.0 2023-06-24 15:05:46,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1770852.0, ans=0.125 2023-06-24 15:05:58,618 INFO [train.py:996] (0/4) Epoch 10, batch 20700, loss[loss=0.278, simple_loss=0.3749, pruned_loss=0.09051, over 21226.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3042, pruned_loss=0.08, over 4253114.40 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:06:23,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.424e+02 6.389e+02 9.253e+02 1.399e+03 2.647e+03, threshold=1.851e+03, percent-clipped=4.0 2023-06-24 15:06:37,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-24 15:07:22,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1771152.0, ans=0.0 2023-06-24 15:07:42,284 INFO [train.py:996] (0/4) Epoch 10, batch 20750, loss[loss=0.213, simple_loss=0.264, pruned_loss=0.08099, over 20970.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3069, pruned_loss=0.07971, over 4243763.11 frames. ], batch size: 608, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:08:01,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=22.5 2023-06-24 15:09:20,755 INFO [train.py:996] (0/4) Epoch 10, batch 20800, loss[loss=0.1985, simple_loss=0.2627, pruned_loss=0.06712, over 21835.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3097, pruned_loss=0.08042, over 4241256.36 frames. ], batch size: 118, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 15:09:47,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.552e+02 1.010e+03 1.567e+03 2.337e+03 4.966e+03, threshold=3.135e+03, percent-clipped=39.0 2023-06-24 15:09:52,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1771572.0, ans=0.04949747468305833 2023-06-24 15:10:06,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-24 15:10:40,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1771752.0, ans=0.0 2023-06-24 15:10:44,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-24 15:10:46,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1771752.0, ans=0.07 2023-06-24 15:10:48,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1771752.0, ans=0.125 2023-06-24 15:10:57,060 INFO [train.py:996] (0/4) Epoch 10, batch 20850, loss[loss=0.1544, simple_loss=0.2268, pruned_loss=0.04096, over 21558.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.305, pruned_loss=0.07828, over 4239979.50 frames. ], batch size: 212, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:11:08,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1771812.0, ans=0.125 2023-06-24 15:12:05,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1771992.0, ans=0.2 2023-06-24 15:12:28,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1772052.0, ans=0.0 2023-06-24 15:12:33,602 INFO [train.py:996] (0/4) Epoch 10, batch 20900, loss[loss=0.2388, simple_loss=0.3011, pruned_loss=0.08829, over 21247.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3048, pruned_loss=0.07916, over 4255978.12 frames. ], batch size: 159, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:12:59,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 6.808e+02 1.169e+03 1.577e+03 3.825e+03, threshold=2.338e+03, percent-clipped=3.0 2023-06-24 15:13:37,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1772292.0, ans=0.1 2023-06-24 15:13:37,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1772292.0, ans=0.125 2023-06-24 15:13:40,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1772292.0, ans=0.0 2023-06-24 15:13:54,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1772352.0, ans=0.125 2023-06-24 15:14:09,070 INFO [train.py:996] (0/4) Epoch 10, batch 20950, loss[loss=0.2198, simple_loss=0.2882, pruned_loss=0.07573, over 21605.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.301, pruned_loss=0.07631, over 4262260.77 frames. ], batch size: 230, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:15:22,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-24 15:15:44,320 INFO [train.py:996] (0/4) Epoch 10, batch 21000, loss[loss=0.2198, simple_loss=0.3019, pruned_loss=0.06881, over 21863.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2992, pruned_loss=0.07602, over 4257788.43 frames. ], batch size: 316, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:15:44,321 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 15:16:03,224 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2634, simple_loss=0.3598, pruned_loss=0.08347, over 1796401.00 frames. 2023-06-24 15:16:03,224 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 15:16:16,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1772712.0, ans=0.125 2023-06-24 15:16:24,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.402e+02 6.763e+02 8.645e+02 1.170e+03 2.024e+03, threshold=1.729e+03, percent-clipped=0.0 2023-06-24 15:16:43,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1772832.0, ans=0.125 2023-06-24 15:16:46,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1772832.0, ans=0.0 2023-06-24 15:17:23,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1772952.0, ans=0.125 2023-06-24 15:17:33,531 INFO [train.py:996] (0/4) Epoch 10, batch 21050, loss[loss=0.1721, simple_loss=0.2419, pruned_loss=0.05112, over 16399.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2967, pruned_loss=0.07607, over 4251306.40 frames. ], batch size: 63, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:17:39,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1773012.0, ans=0.2 2023-06-24 15:17:39,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1773012.0, ans=0.1 2023-06-24 15:18:46,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1773192.0, ans=0.0 2023-06-24 15:19:06,330 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:19:08,759 INFO [train.py:996] (0/4) Epoch 10, batch 21100, loss[loss=0.2084, simple_loss=0.2801, pruned_loss=0.06841, over 21404.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2936, pruned_loss=0.07547, over 4245259.74 frames. ], batch size: 389, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:19:11,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=22.5 2023-06-24 15:19:30,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-24 15:19:30,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1773372.0, ans=0.125 2023-06-24 15:19:36,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.103e+02 5.905e+02 7.931e+02 1.116e+03 2.788e+03, threshold=1.586e+03, percent-clipped=2.0 2023-06-24 15:20:45,198 INFO [train.py:996] (0/4) Epoch 10, batch 21150, loss[loss=0.2003, simple_loss=0.2668, pruned_loss=0.06689, over 21823.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.289, pruned_loss=0.07562, over 4252222.57 frames. ], batch size: 352, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:20:47,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1773612.0, ans=0.125 2023-06-24 15:21:03,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1773612.0, ans=0.0 2023-06-24 15:21:46,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1773792.0, ans=0.125 2023-06-24 15:22:21,500 INFO [train.py:996] (0/4) Epoch 10, batch 21200, loss[loss=0.2245, simple_loss=0.2817, pruned_loss=0.08358, over 21482.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2855, pruned_loss=0.07488, over 4250774.09 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:22:28,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-24 15:22:32,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1773912.0, ans=0.0 2023-06-24 15:22:34,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1773912.0, ans=0.1 2023-06-24 15:22:49,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.351e+02 6.353e+02 8.503e+02 1.111e+03 2.488e+03, threshold=1.701e+03, percent-clipped=3.0 2023-06-24 15:23:50,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1774152.0, ans=0.125 2023-06-24 15:23:55,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1774152.0, ans=0.2 2023-06-24 15:23:57,727 INFO [train.py:996] (0/4) Epoch 10, batch 21250, loss[loss=0.211, simple_loss=0.2828, pruned_loss=0.06963, over 21618.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2832, pruned_loss=0.07418, over 4247980.70 frames. ], batch size: 263, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:25:32,754 INFO [train.py:996] (0/4) Epoch 10, batch 21300, loss[loss=0.2478, simple_loss=0.3137, pruned_loss=0.09096, over 21910.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2893, pruned_loss=0.07629, over 4243305.56 frames. ], batch size: 107, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:26:02,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 7.124e+02 9.830e+02 1.357e+03 3.184e+03, threshold=1.966e+03, percent-clipped=15.0 2023-06-24 15:26:04,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1774572.0, ans=0.95 2023-06-24 15:27:10,404 INFO [train.py:996] (0/4) Epoch 10, batch 21350, loss[loss=0.2338, simple_loss=0.2969, pruned_loss=0.0853, over 21325.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2952, pruned_loss=0.07805, over 4254762.23 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:27:32,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-24 15:27:53,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774932.0, ans=0.1 2023-06-24 15:28:48,142 INFO [train.py:996] (0/4) Epoch 10, batch 21400, loss[loss=0.2766, simple_loss=0.3502, pruned_loss=0.1015, over 21553.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2979, pruned_loss=0.07736, over 4260469.11 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:28:56,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1775112.0, ans=10.0 2023-06-24 15:29:23,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.641e+02 5.834e+02 7.962e+02 1.308e+03 2.363e+03, threshold=1.592e+03, percent-clipped=6.0 2023-06-24 15:30:24,980 INFO [train.py:996] (0/4) Epoch 10, batch 21450, loss[loss=0.255, simple_loss=0.3201, pruned_loss=0.09493, over 21851.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.302, pruned_loss=0.07884, over 4270047.47 frames. ], batch size: 118, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:30:49,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.98 vs. limit=22.5 2023-06-24 15:30:51,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1775472.0, ans=0.035 2023-06-24 15:31:30,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1775592.0, ans=0.125 2023-06-24 15:32:06,978 INFO [train.py:996] (0/4) Epoch 10, batch 21500, loss[loss=0.2023, simple_loss=0.2664, pruned_loss=0.06915, over 21667.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3021, pruned_loss=0.08034, over 4267186.57 frames. ], batch size: 264, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:32:28,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1775772.0, ans=0.0 2023-06-24 15:32:36,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.266e+02 7.584e+02 1.027e+03 1.446e+03 3.225e+03, threshold=2.054e+03, percent-clipped=19.0 2023-06-24 15:32:39,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1775772.0, ans=0.05 2023-06-24 15:32:50,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-24 15:33:07,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1775892.0, ans=0.125 2023-06-24 15:33:16,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1775892.0, ans=0.125 2023-06-24 15:33:33,001 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-296000.pt 2023-06-24 15:33:34,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1775952.0, ans=0.2 2023-06-24 15:33:45,358 INFO [train.py:996] (0/4) Epoch 10, batch 21550, loss[loss=0.184, simple_loss=0.2558, pruned_loss=0.05606, over 21730.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2953, pruned_loss=0.07741, over 4265152.37 frames. ], batch size: 282, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:33:47,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1776012.0, ans=0.015 2023-06-24 15:33:49,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1776012.0, ans=0.0 2023-06-24 15:34:50,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1776192.0, ans=0.125 2023-06-24 15:35:08,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1776252.0, ans=0.125 2023-06-24 15:35:25,111 INFO [train.py:996] (0/4) Epoch 10, batch 21600, loss[loss=0.1806, simple_loss=0.2513, pruned_loss=0.05501, over 21779.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2925, pruned_loss=0.07614, over 4269472.18 frames. ], batch size: 124, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:36:01,303 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 8.052e+02 1.212e+03 1.997e+03 4.912e+03, threshold=2.424e+03, percent-clipped=21.0 2023-06-24 15:36:53,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1776552.0, ans=0.1 2023-06-24 15:37:01,710 INFO [train.py:996] (0/4) Epoch 10, batch 21650, loss[loss=0.315, simple_loss=0.4018, pruned_loss=0.1141, over 21470.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2955, pruned_loss=0.07401, over 4270464.10 frames. ], batch size: 507, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:37:28,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1776672.0, ans=0.125 2023-06-24 15:37:35,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-24 15:37:36,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1776672.0, ans=0.1 2023-06-24 15:37:42,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1776732.0, ans=0.1 2023-06-24 15:38:10,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=13.34 vs. limit=15.0 2023-06-24 15:38:37,617 INFO [train.py:996] (0/4) Epoch 10, batch 21700, loss[loss=0.2324, simple_loss=0.2937, pruned_loss=0.08557, over 21854.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2959, pruned_loss=0.07217, over 4264026.78 frames. ], batch size: 98, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:38:37,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1776912.0, ans=0.125 2023-06-24 15:38:52,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1776912.0, ans=0.0 2023-06-24 15:39:12,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.660e+02 9.555e+02 1.550e+03 3.491e+03, threshold=1.911e+03, percent-clipped=7.0 2023-06-24 15:39:42,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1777092.0, ans=0.125 2023-06-24 15:40:00,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-24 15:40:13,249 INFO [train.py:996] (0/4) Epoch 10, batch 21750, loss[loss=0.2171, simple_loss=0.2816, pruned_loss=0.07629, over 21825.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2922, pruned_loss=0.07279, over 4271069.75 frames. ], batch size: 98, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:40:13,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1777212.0, ans=0.125 2023-06-24 15:40:29,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1777212.0, ans=0.125 2023-06-24 15:40:31,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1777212.0, ans=0.125 2023-06-24 15:40:48,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2023-06-24 15:41:50,274 INFO [train.py:996] (0/4) Epoch 10, batch 21800, loss[loss=0.2983, simple_loss=0.3776, pruned_loss=0.1095, over 21823.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.292, pruned_loss=0.07476, over 4276043.46 frames. ], batch size: 352, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:41:52,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1777512.0, ans=0.125 2023-06-24 15:42:15,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1777572.0, ans=0.125 2023-06-24 15:42:25,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.465e+02 6.460e+02 8.682e+02 1.144e+03 2.406e+03, threshold=1.736e+03, percent-clipped=3.0 2023-06-24 15:42:39,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1777632.0, ans=0.0 2023-06-24 15:43:03,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777692.0, ans=0.1 2023-06-24 15:43:25,444 INFO [train.py:996] (0/4) Epoch 10, batch 21850, loss[loss=0.2444, simple_loss=0.3517, pruned_loss=0.06859, over 21662.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2985, pruned_loss=0.07567, over 4258985.23 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:43:25,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777812.0, ans=0.1 2023-06-24 15:43:57,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1777872.0, ans=0.125 2023-06-24 15:43:58,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777872.0, ans=0.1 2023-06-24 15:44:27,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1777992.0, ans=0.035 2023-06-24 15:44:38,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1777992.0, ans=0.2 2023-06-24 15:44:56,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1778052.0, ans=0.07 2023-06-24 15:45:05,298 INFO [train.py:996] (0/4) Epoch 10, batch 21900, loss[loss=0.1982, simple_loss=0.2629, pruned_loss=0.06678, over 21837.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2971, pruned_loss=0.07667, over 4267662.55 frames. ], batch size: 107, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:45:09,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1778112.0, ans=0.5 2023-06-24 15:45:12,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1778112.0, ans=0.125 2023-06-24 15:45:36,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.519e+02 8.112e+02 1.126e+03 1.862e+03 4.122e+03, threshold=2.252e+03, percent-clipped=27.0 2023-06-24 15:45:54,422 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=12.0 2023-06-24 15:46:42,343 INFO [train.py:996] (0/4) Epoch 10, batch 21950, loss[loss=0.2162, simple_loss=0.2929, pruned_loss=0.0697, over 21494.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2927, pruned_loss=0.07472, over 4269242.17 frames. ], batch size: 509, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:47:20,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1778532.0, ans=0.125 2023-06-24 15:47:42,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1778592.0, ans=0.125 2023-06-24 15:47:58,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1778652.0, ans=0.125 2023-06-24 15:48:14,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1778652.0, ans=0.125 2023-06-24 15:48:18,380 INFO [train.py:996] (0/4) Epoch 10, batch 22000, loss[loss=0.2215, simple_loss=0.3083, pruned_loss=0.06738, over 21211.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2876, pruned_loss=0.07232, over 4273162.09 frames. ], batch size: 549, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:48:44,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1778772.0, ans=0.125 2023-06-24 15:48:54,793 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.014e+02 5.065e+02 6.961e+02 1.085e+03 3.109e+03, threshold=1.392e+03, percent-clipped=2.0 2023-06-24 15:50:02,074 INFO [train.py:996] (0/4) Epoch 10, batch 22050, loss[loss=0.2717, simple_loss=0.3487, pruned_loss=0.09738, over 21354.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2937, pruned_loss=0.07409, over 4269910.30 frames. ], batch size: 143, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:50:10,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1779012.0, ans=0.125 2023-06-24 15:51:27,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1779252.0, ans=15.0 2023-06-24 15:51:38,360 INFO [train.py:996] (0/4) Epoch 10, batch 22100, loss[loss=0.2437, simple_loss=0.3159, pruned_loss=0.08576, over 21787.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3029, pruned_loss=0.0786, over 4275118.33 frames. ], batch size: 332, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:52:09,371 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.723e+02 7.259e+02 1.034e+03 1.568e+03 3.837e+03, threshold=2.069e+03, percent-clipped=34.0 2023-06-24 15:52:25,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1779432.0, ans=10.0 2023-06-24 15:52:26,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-24 15:52:26,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1779432.0, ans=0.0 2023-06-24 15:52:43,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=15.0 2023-06-24 15:52:51,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1779552.0, ans=0.2 2023-06-24 15:53:16,265 INFO [train.py:996] (0/4) Epoch 10, batch 22150, loss[loss=0.252, simple_loss=0.322, pruned_loss=0.09099, over 21910.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3064, pruned_loss=0.08089, over 4284063.69 frames. ], batch size: 415, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:53:29,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=15.0 2023-06-24 15:53:49,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1779672.0, ans=0.1 2023-06-24 15:54:51,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1779852.0, ans=0.125 2023-06-24 15:54:54,166 INFO [train.py:996] (0/4) Epoch 10, batch 22200, loss[loss=0.2362, simple_loss=0.3236, pruned_loss=0.07445, over 21422.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3079, pruned_loss=0.0818, over 4292530.64 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:54:54,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1779912.0, ans=0.0 2023-06-24 15:55:24,792 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.644e+02 6.933e+02 1.129e+03 1.517e+03 2.505e+03, threshold=2.259e+03, percent-clipped=10.0 2023-06-24 15:56:30,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.72 vs. limit=10.0 2023-06-24 15:56:31,915 INFO [train.py:996] (0/4) Epoch 10, batch 22250, loss[loss=0.2543, simple_loss=0.321, pruned_loss=0.09373, over 21467.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3146, pruned_loss=0.08303, over 4288406.06 frames. ], batch size: 211, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:56:32,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1780212.0, ans=0.2 2023-06-24 15:56:46,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1780212.0, ans=0.1 2023-06-24 15:56:54,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1780272.0, ans=0.1 2023-06-24 15:57:30,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1780392.0, ans=0.1 2023-06-24 15:57:31,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1780392.0, ans=0.0 2023-06-24 15:57:32,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-24 15:57:51,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-24 15:58:04,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1780452.0, ans=15.0 2023-06-24 15:58:06,612 INFO [train.py:996] (0/4) Epoch 10, batch 22300, loss[loss=0.2816, simple_loss=0.3362, pruned_loss=0.1135, over 21635.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3155, pruned_loss=0.08452, over 4280520.04 frames. ], batch size: 471, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:58:37,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.918e+02 7.899e+02 1.099e+03 1.483e+03 2.745e+03, threshold=2.199e+03, percent-clipped=4.0 2023-06-24 15:58:45,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1780632.0, ans=0.125 2023-06-24 15:58:50,214 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:59:43,744 INFO [train.py:996] (0/4) Epoch 10, batch 22350, loss[loss=0.2318, simple_loss=0.3166, pruned_loss=0.07355, over 21610.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3138, pruned_loss=0.08507, over 4295178.79 frames. ], batch size: 473, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:59:47,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-24 16:00:53,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1780992.0, ans=0.125 2023-06-24 16:00:56,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1780992.0, ans=0.0 2023-06-24 16:01:21,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1781052.0, ans=0.125 2023-06-24 16:01:22,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1781052.0, ans=0.125 2023-06-24 16:01:25,457 INFO [train.py:996] (0/4) Epoch 10, batch 22400, loss[loss=0.233, simple_loss=0.3008, pruned_loss=0.08259, over 21603.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3112, pruned_loss=0.08252, over 4289621.74 frames. ], batch size: 263, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:01:36,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-24 16:01:51,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.474e+02 7.856e+02 9.897e+02 1.374e+03 2.984e+03, threshold=1.979e+03, percent-clipped=5.0 2023-06-24 16:02:10,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1781232.0, ans=0.125 2023-06-24 16:02:23,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1781292.0, ans=0.125 2023-06-24 16:02:55,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-24 16:02:56,374 INFO [train.py:996] (0/4) Epoch 10, batch 22450, loss[loss=0.2075, simple_loss=0.267, pruned_loss=0.07399, over 21331.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3047, pruned_loss=0.08097, over 4286178.88 frames. ], batch size: 144, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:03:09,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-24 16:03:18,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1781472.0, ans=0.1 2023-06-24 16:03:28,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1781532.0, ans=0.0 2023-06-24 16:03:42,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1781532.0, ans=0.125 2023-06-24 16:03:54,297 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.60 vs. limit=15.0 2023-06-24 16:04:33,437 INFO [train.py:996] (0/4) Epoch 10, batch 22500, loss[loss=0.244, simple_loss=0.3266, pruned_loss=0.08073, over 21234.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3013, pruned_loss=0.08076, over 4269881.10 frames. ], batch size: 159, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:04:39,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1781712.0, ans=0.125 2023-06-24 16:04:42,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1781712.0, ans=0.125 2023-06-24 16:04:48,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-24 16:04:52,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1781772.0, ans=0.125 2023-06-24 16:04:54,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1781772.0, ans=0.2 2023-06-24 16:05:06,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.137e+02 1.060e+03 1.856e+03 3.830e+03, threshold=2.121e+03, percent-clipped=17.0 2023-06-24 16:05:32,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1781892.0, ans=0.125 2023-06-24 16:06:08,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1781952.0, ans=0.125 2023-06-24 16:06:10,769 INFO [train.py:996] (0/4) Epoch 10, batch 22550, loss[loss=0.1755, simple_loss=0.2297, pruned_loss=0.0607, over 20736.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3046, pruned_loss=0.08152, over 4278381.62 frames. ], batch size: 607, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:07:38,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1782252.0, ans=0.125 2023-06-24 16:07:49,284 INFO [train.py:996] (0/4) Epoch 10, batch 22600, loss[loss=0.2427, simple_loss=0.3217, pruned_loss=0.08181, over 21746.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3072, pruned_loss=0.08134, over 4281867.52 frames. ], batch size: 351, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:07:52,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1782312.0, ans=0.0 2023-06-24 16:08:15,796 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-24 16:08:25,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-24 16:08:27,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.566e+02 7.815e+02 1.188e+03 1.926e+03 4.524e+03, threshold=2.375e+03, percent-clipped=20.0 2023-06-24 16:08:32,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1782432.0, ans=0.125 2023-06-24 16:09:22,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1782552.0, ans=0.015 2023-06-24 16:09:25,460 INFO [train.py:996] (0/4) Epoch 10, batch 22650, loss[loss=0.1946, simple_loss=0.2598, pruned_loss=0.06474, over 21981.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3035, pruned_loss=0.08129, over 4269782.14 frames. ], batch size: 103, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:09:35,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1782612.0, ans=0.2 2023-06-24 16:09:56,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1782672.0, ans=0.125 2023-06-24 16:10:03,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1782672.0, ans=0.09899494936611666 2023-06-24 16:10:10,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-24 16:10:52,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1782852.0, ans=0.125 2023-06-24 16:10:54,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1782852.0, ans=0.125 2023-06-24 16:11:01,273 INFO [train.py:996] (0/4) Epoch 10, batch 22700, loss[loss=0.2118, simple_loss=0.2679, pruned_loss=0.07782, over 21198.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2979, pruned_loss=0.08074, over 4256200.61 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:11:38,445 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 7.699e+02 1.029e+03 1.382e+03 2.516e+03, threshold=2.058e+03, percent-clipped=2.0 2023-06-24 16:12:16,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1783092.0, ans=0.5 2023-06-24 16:12:22,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1783152.0, ans=0.125 2023-06-24 16:12:37,728 INFO [train.py:996] (0/4) Epoch 10, batch 22750, loss[loss=0.2815, simple_loss=0.3502, pruned_loss=0.1064, over 21501.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3019, pruned_loss=0.08216, over 4259183.37 frames. ], batch size: 131, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:12:41,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1783212.0, ans=10.0 2023-06-24 16:12:41,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1783212.0, ans=0.125 2023-06-24 16:12:44,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1783212.0, ans=0.125 2023-06-24 16:13:36,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1783332.0, ans=0.035 2023-06-24 16:14:00,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1783452.0, ans=0.2 2023-06-24 16:14:14,073 INFO [train.py:996] (0/4) Epoch 10, batch 22800, loss[loss=0.2484, simple_loss=0.314, pruned_loss=0.09137, over 21853.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3059, pruned_loss=0.08472, over 4265945.01 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:14:17,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1783512.0, ans=0.125 2023-06-24 16:14:51,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.113e+02 9.771e+02 1.479e+03 3.289e+03, threshold=1.954e+03, percent-clipped=6.0 2023-06-24 16:15:38,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1783752.0, ans=0.125 2023-06-24 16:15:49,976 INFO [train.py:996] (0/4) Epoch 10, batch 22850, loss[loss=0.1962, simple_loss=0.2616, pruned_loss=0.06543, over 21674.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3014, pruned_loss=0.08355, over 4276542.32 frames. ], batch size: 299, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:16:41,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-24 16:17:00,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-24 16:17:19,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.05 vs. limit=6.0 2023-06-24 16:17:27,983 INFO [train.py:996] (0/4) Epoch 10, batch 22900, loss[loss=0.2243, simple_loss=0.3234, pruned_loss=0.06265, over 21541.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.302, pruned_loss=0.08278, over 4278033.28 frames. ], batch size: 230, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:17:38,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1784112.0, ans=0.125 2023-06-24 16:17:43,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1784112.0, ans=0.0 2023-06-24 16:18:12,794 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 7.158e+02 1.057e+03 1.638e+03 3.126e+03, threshold=2.114e+03, percent-clipped=14.0 2023-06-24 16:18:41,862 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:18:59,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1784352.0, ans=0.125 2023-06-24 16:19:16,535 INFO [train.py:996] (0/4) Epoch 10, batch 22950, loss[loss=0.2634, simple_loss=0.3941, pruned_loss=0.06632, over 21278.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3138, pruned_loss=0.0815, over 4277621.12 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:19:31,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1784472.0, ans=0.125 2023-06-24 16:20:13,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1784592.0, ans=0.2 2023-06-24 16:20:20,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1784592.0, ans=0.0 2023-06-24 16:20:34,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1784652.0, ans=0.07 2023-06-24 16:20:52,506 INFO [train.py:996] (0/4) Epoch 10, batch 23000, loss[loss=0.2622, simple_loss=0.3281, pruned_loss=0.09814, over 21509.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3153, pruned_loss=0.07912, over 4278543.04 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:20:58,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=12.0 2023-06-24 16:20:59,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1784712.0, ans=0.04949747468305833 2023-06-24 16:21:11,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1784772.0, ans=0.05 2023-06-24 16:21:16,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1784772.0, ans=0.0 2023-06-24 16:21:30,662 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.479e+02 7.106e+02 9.780e+02 1.454e+03 3.933e+03, threshold=1.956e+03, percent-clipped=7.0 2023-06-24 16:21:42,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=15.0 2023-06-24 16:21:49,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-24 16:22:36,146 INFO [train.py:996] (0/4) Epoch 10, batch 23050, loss[loss=0.3561, simple_loss=0.398, pruned_loss=0.1571, over 21358.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3157, pruned_loss=0.08098, over 4284420.00 frames. ], batch size: 508, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:23:00,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1785072.0, ans=0.0 2023-06-24 16:23:25,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1785132.0, ans=0.1 2023-06-24 16:24:01,224 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-24 16:24:11,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1785252.0, ans=0.125 2023-06-24 16:24:13,730 INFO [train.py:996] (0/4) Epoch 10, batch 23100, loss[loss=0.1941, simple_loss=0.2565, pruned_loss=0.06582, over 21244.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3107, pruned_loss=0.08095, over 4282279.93 frames. ], batch size: 144, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:24:28,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1785312.0, ans=0.04949747468305833 2023-06-24 16:24:41,125 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-24 16:24:46,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1785372.0, ans=0.0 2023-06-24 16:24:47,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.180e+02 6.884e+02 9.412e+02 1.257e+03 2.198e+03, threshold=1.882e+03, percent-clipped=3.0 2023-06-24 16:24:53,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-24 16:25:08,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1785492.0, ans=0.0 2023-06-24 16:25:50,176 INFO [train.py:996] (0/4) Epoch 10, batch 23150, loss[loss=0.2599, simple_loss=0.3186, pruned_loss=0.1007, over 21779.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3053, pruned_loss=0.08057, over 4281673.91 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:25:53,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1785612.0, ans=0.1 2023-06-24 16:26:16,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1785672.0, ans=0.1 2023-06-24 16:26:41,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1785792.0, ans=0.1 2023-06-24 16:27:17,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.18 vs. limit=10.0 2023-06-24 16:27:20,644 INFO [train.py:996] (0/4) Epoch 10, batch 23200, loss[loss=0.2202, simple_loss=0.2848, pruned_loss=0.07779, over 21862.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3041, pruned_loss=0.08109, over 4285311.86 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:27:31,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1785912.0, ans=0.125 2023-06-24 16:28:00,007 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.595e+02 6.989e+02 9.310e+02 1.266e+03 2.936e+03, threshold=1.862e+03, percent-clipped=9.0 2023-06-24 16:28:08,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1786032.0, ans=0.125 2023-06-24 16:29:01,257 INFO [train.py:996] (0/4) Epoch 10, batch 23250, loss[loss=0.2507, simple_loss=0.3137, pruned_loss=0.09389, over 21748.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3052, pruned_loss=0.08284, over 4287371.72 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:29:22,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-24 16:29:23,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1786272.0, ans=15.0 2023-06-24 16:29:44,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-24 16:30:05,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1786392.0, ans=0.2 2023-06-24 16:30:38,200 INFO [train.py:996] (0/4) Epoch 10, batch 23300, loss[loss=0.2886, simple_loss=0.3771, pruned_loss=0.1, over 21715.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3124, pruned_loss=0.0843, over 4290558.65 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:31:07,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-24 16:31:14,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.511e+02 8.047e+02 1.147e+03 1.597e+03 3.212e+03, threshold=2.293e+03, percent-clipped=17.0 2023-06-24 16:32:05,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1786752.0, ans=0.0 2023-06-24 16:32:20,654 INFO [train.py:996] (0/4) Epoch 10, batch 23350, loss[loss=0.267, simple_loss=0.3533, pruned_loss=0.09031, over 21309.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.316, pruned_loss=0.08333, over 4288720.56 frames. ], batch size: 549, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:32:46,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.66 vs. limit=6.0 2023-06-24 16:33:57,553 INFO [train.py:996] (0/4) Epoch 10, batch 23400, loss[loss=0.2507, simple_loss=0.3144, pruned_loss=0.09353, over 21545.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3097, pruned_loss=0.07959, over 4290483.83 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:33:59,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1787112.0, ans=0.0 2023-06-24 16:34:05,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1787112.0, ans=0.125 2023-06-24 16:34:15,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1787172.0, ans=10.0 2023-06-24 16:34:20,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1787172.0, ans=0.0 2023-06-24 16:34:28,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.034e+02 7.282e+02 9.879e+02 1.387e+03 3.167e+03, threshold=1.976e+03, percent-clipped=3.0 2023-06-24 16:35:00,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1787292.0, ans=0.125 2023-06-24 16:35:14,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1787352.0, ans=0.0 2023-06-24 16:35:21,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1787352.0, ans=0.1 2023-06-24 16:35:34,826 INFO [train.py:996] (0/4) Epoch 10, batch 23450, loss[loss=0.2768, simple_loss=0.3383, pruned_loss=0.1077, over 21849.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3103, pruned_loss=0.08134, over 4287887.19 frames. ], batch size: 247, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:36:37,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=1787592.0, ans=12.0 2023-06-24 16:36:58,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1787652.0, ans=0.125 2023-06-24 16:37:04,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-24 16:37:09,842 INFO [train.py:996] (0/4) Epoch 10, batch 23500, loss[loss=0.2491, simple_loss=0.3068, pruned_loss=0.09574, over 21472.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3105, pruned_loss=0.08328, over 4291854.42 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:37:15,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1787712.0, ans=0.0 2023-06-24 16:37:18,143 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:37:29,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1787772.0, ans=0.125 2023-06-24 16:37:31,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1787772.0, ans=0.125 2023-06-24 16:37:35,024 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:37:45,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 6.777e+02 9.961e+02 1.518e+03 3.385e+03, threshold=1.992e+03, percent-clipped=9.0 2023-06-24 16:38:46,202 INFO [train.py:996] (0/4) Epoch 10, batch 23550, loss[loss=0.2398, simple_loss=0.2909, pruned_loss=0.09441, over 21196.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.306, pruned_loss=0.08309, over 4293467.14 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:39:09,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1788072.0, ans=0.125 2023-06-24 16:39:43,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1788192.0, ans=0.0 2023-06-24 16:39:56,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1788192.0, ans=0.0 2023-06-24 16:40:10,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1788252.0, ans=0.1 2023-06-24 16:40:16,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-06-24 16:40:18,354 INFO [train.py:996] (0/4) Epoch 10, batch 23600, loss[loss=0.2473, simple_loss=0.3257, pruned_loss=0.08443, over 21507.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3063, pruned_loss=0.08324, over 4295645.68 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:40:22,815 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-06-24 16:40:25,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.64 vs. limit=22.5 2023-06-24 16:41:05,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.671e+02 8.088e+02 1.163e+03 1.528e+03 3.406e+03, threshold=2.327e+03, percent-clipped=14.0 2023-06-24 16:41:17,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-24 16:41:55,368 INFO [train.py:996] (0/4) Epoch 10, batch 23650, loss[loss=0.2319, simple_loss=0.3076, pruned_loss=0.07814, over 21456.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3047, pruned_loss=0.08123, over 4283839.14 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:42:32,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1788672.0, ans=0.125 2023-06-24 16:42:34,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1788672.0, ans=0.0 2023-06-24 16:43:27,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1788852.0, ans=0.125 2023-06-24 16:43:38,466 INFO [train.py:996] (0/4) Epoch 10, batch 23700, loss[loss=0.2753, simple_loss=0.3416, pruned_loss=0.1045, over 21399.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3086, pruned_loss=0.08145, over 4282925.86 frames. ], batch size: 509, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:43:43,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1788912.0, ans=0.09899494936611666 2023-06-24 16:44:26,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.212e+02 6.427e+02 8.775e+02 1.177e+03 2.225e+03, threshold=1.755e+03, percent-clipped=0.0 2023-06-24 16:44:28,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-24 16:44:32,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1789032.0, ans=0.125 2023-06-24 16:45:06,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789152.0, ans=0.1 2023-06-24 16:45:22,449 INFO [train.py:996] (0/4) Epoch 10, batch 23750, loss[loss=0.1972, simple_loss=0.2831, pruned_loss=0.0556, over 21324.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3108, pruned_loss=0.08151, over 4281701.02 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:45:23,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1789212.0, ans=0.5 2023-06-24 16:45:24,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1789212.0, ans=0.125 2023-06-24 16:45:40,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1789212.0, ans=0.0 2023-06-24 16:45:51,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1789272.0, ans=0.125 2023-06-24 16:45:52,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1789272.0, ans=0.1 2023-06-24 16:45:57,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1789272.0, ans=0.125 2023-06-24 16:46:07,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1789332.0, ans=0.125 2023-06-24 16:46:42,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1789452.0, ans=0.125 2023-06-24 16:47:01,130 INFO [train.py:996] (0/4) Epoch 10, batch 23800, loss[loss=0.2348, simple_loss=0.3296, pruned_loss=0.06995, over 21787.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3095, pruned_loss=0.07907, over 4274168.99 frames. ], batch size: 316, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:47:39,072 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.271e+02 7.251e+02 1.165e+03 1.659e+03 4.396e+03, threshold=2.330e+03, percent-clipped=22.0 2023-06-24 16:47:57,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1789692.0, ans=0.125 2023-06-24 16:48:44,370 INFO [train.py:996] (0/4) Epoch 10, batch 23850, loss[loss=0.2429, simple_loss=0.3187, pruned_loss=0.08356, over 21469.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3205, pruned_loss=0.08097, over 4265317.20 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:48:56,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1789812.0, ans=0.0 2023-06-24 16:50:20,410 INFO [train.py:996] (0/4) Epoch 10, batch 23900, loss[loss=0.2711, simple_loss=0.3575, pruned_loss=0.09233, over 21694.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3257, pruned_loss=0.08333, over 4270789.20 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:50:31,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1790112.0, ans=0.0 2023-06-24 16:50:46,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1790172.0, ans=0.0 2023-06-24 16:50:46,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-24 16:50:59,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 1.066e+03 1.472e+03 2.085e+03 4.372e+03, threshold=2.943e+03, percent-clipped=19.0 2023-06-24 16:51:58,866 INFO [train.py:996] (0/4) Epoch 10, batch 23950, loss[loss=0.2313, simple_loss=0.2833, pruned_loss=0.08964, over 21364.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3214, pruned_loss=0.08348, over 4276294.71 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:53:36,875 INFO [train.py:996] (0/4) Epoch 10, batch 24000, loss[loss=0.2946, simple_loss=0.356, pruned_loss=0.1166, over 21576.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3225, pruned_loss=0.08639, over 4276701.91 frames. ], batch size: 415, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:53:36,876 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 16:53:49,106 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.6621, 2.9876, 2.8945, 1.5605], device='cuda:0') 2023-06-24 16:53:52,746 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2655, simple_loss=0.3589, pruned_loss=0.08609, over 1796401.00 frames. 2023-06-24 16:53:52,747 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 16:54:41,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.229e+02 9.460e+02 1.386e+03 2.838e+03, threshold=1.892e+03, percent-clipped=0.0 2023-06-24 16:55:10,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1790892.0, ans=0.125 2023-06-24 16:55:21,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1790952.0, ans=0.125 2023-06-24 16:55:29,415 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:55:31,759 INFO [train.py:996] (0/4) Epoch 10, batch 24050, loss[loss=0.225, simple_loss=0.3138, pruned_loss=0.06809, over 21866.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3246, pruned_loss=0.08756, over 4281362.37 frames. ], batch size: 316, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:55:40,354 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:56:32,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-24 16:57:00,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1791252.0, ans=22.5 2023-06-24 16:57:13,075 INFO [train.py:996] (0/4) Epoch 10, batch 24100, loss[loss=0.2898, simple_loss=0.3675, pruned_loss=0.1061, over 21576.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3221, pruned_loss=0.08498, over 4275340.70 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:57:13,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1791312.0, ans=0.0 2023-06-24 16:57:33,029 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 16:57:50,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1791372.0, ans=0.125 2023-06-24 16:58:01,296 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 7.131e+02 9.675e+02 1.402e+03 3.208e+03, threshold=1.935e+03, percent-clipped=13.0 2023-06-24 16:58:30,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1791552.0, ans=0.0 2023-06-24 16:58:52,138 INFO [train.py:996] (0/4) Epoch 10, batch 24150, loss[loss=0.2971, simple_loss=0.3385, pruned_loss=0.1278, over 21777.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3208, pruned_loss=0.08583, over 4280222.56 frames. ], batch size: 508, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:59:27,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1791672.0, ans=0.0 2023-06-24 16:59:36,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1791732.0, ans=0.125 2023-06-24 16:59:44,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1791732.0, ans=0.125 2023-06-24 16:59:54,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-24 16:59:55,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1791792.0, ans=0.125 2023-06-24 17:00:30,317 INFO [train.py:996] (0/4) Epoch 10, batch 24200, loss[loss=0.2938, simple_loss=0.3656, pruned_loss=0.111, over 21812.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3233, pruned_loss=0.08699, over 4282299.46 frames. ], batch size: 333, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:00:43,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1791912.0, ans=0.07 2023-06-24 17:00:46,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1791912.0, ans=0.1 2023-06-24 17:01:15,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.664e+02 7.347e+02 1.079e+03 1.481e+03 2.381e+03, threshold=2.159e+03, percent-clipped=5.0 2023-06-24 17:01:35,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792092.0, ans=0.1 2023-06-24 17:01:50,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1792152.0, ans=0.07 2023-06-24 17:02:14,145 INFO [train.py:996] (0/4) Epoch 10, batch 24250, loss[loss=0.1952, simple_loss=0.2956, pruned_loss=0.04738, over 21838.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3199, pruned_loss=0.081, over 4280982.60 frames. ], batch size: 316, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:02:32,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1792212.0, ans=0.0 2023-06-24 17:02:41,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1792272.0, ans=0.125 2023-06-24 17:03:54,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1792452.0, ans=0.125 2023-06-24 17:03:56,611 INFO [train.py:996] (0/4) Epoch 10, batch 24300, loss[loss=0.155, simple_loss=0.2387, pruned_loss=0.0356, over 21615.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3139, pruned_loss=0.07581, over 4274539.76 frames. ], batch size: 230, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:04:32,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.370e+02 5.749e+02 8.681e+02 1.337e+03 2.668e+03, threshold=1.736e+03, percent-clipped=3.0 2023-06-24 17:04:34,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792632.0, ans=0.1 2023-06-24 17:05:13,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1792752.0, ans=0.125 2023-06-24 17:05:25,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792752.0, ans=0.1 2023-06-24 17:05:29,324 INFO [train.py:996] (0/4) Epoch 10, batch 24350, loss[loss=0.2372, simple_loss=0.3161, pruned_loss=0.07921, over 21483.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3118, pruned_loss=0.07579, over 4282367.16 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:05:49,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1792872.0, ans=0.125 2023-06-24 17:07:07,869 INFO [train.py:996] (0/4) Epoch 10, batch 24400, loss[loss=0.2551, simple_loss=0.3294, pruned_loss=0.09036, over 21710.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3149, pruned_loss=0.07936, over 4282955.74 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:07:24,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1793172.0, ans=0.125 2023-06-24 17:07:29,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1793172.0, ans=0.125 2023-06-24 17:07:39,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1793232.0, ans=0.1 2023-06-24 17:07:39,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-24 17:07:52,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.844e+02 8.479e+02 1.173e+03 1.615e+03 2.996e+03, threshold=2.346e+03, percent-clipped=19.0 2023-06-24 17:08:36,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1793352.0, ans=0.0 2023-06-24 17:08:48,995 INFO [train.py:996] (0/4) Epoch 10, batch 24450, loss[loss=0.2209, simple_loss=0.303, pruned_loss=0.06935, over 21273.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3158, pruned_loss=0.08104, over 4278452.97 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:09:10,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1793472.0, ans=0.025 2023-06-24 17:09:26,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1793532.0, ans=0.125 2023-06-24 17:09:27,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1793532.0, ans=0.1 2023-06-24 17:09:27,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1793532.0, ans=0.125 2023-06-24 17:10:26,074 INFO [train.py:996] (0/4) Epoch 10, batch 24500, loss[loss=0.2235, simple_loss=0.31, pruned_loss=0.06849, over 21267.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3163, pruned_loss=0.08056, over 4279547.88 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:11:07,637 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.494e+02 6.645e+02 1.000e+03 1.722e+03 3.391e+03, threshold=2.001e+03, percent-clipped=7.0 2023-06-24 17:11:27,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1793892.0, ans=0.0 2023-06-24 17:11:31,259 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-06-24 17:11:46,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1793952.0, ans=0.1 2023-06-24 17:11:48,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1793952.0, ans=0.1 2023-06-24 17:11:55,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1793952.0, ans=0.0 2023-06-24 17:11:59,912 INFO [train.py:996] (0/4) Epoch 10, batch 24550, loss[loss=0.2429, simple_loss=0.3288, pruned_loss=0.07854, over 21600.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3188, pruned_loss=0.08214, over 4275908.82 frames. ], batch size: 389, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:12:18,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-24 17:13:05,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1794192.0, ans=0.1 2023-06-24 17:13:37,076 INFO [train.py:996] (0/4) Epoch 10, batch 24600, loss[loss=0.2508, simple_loss=0.331, pruned_loss=0.08533, over 21493.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3174, pruned_loss=0.08264, over 4268094.95 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:14:28,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.632e+02 7.490e+02 1.116e+03 1.562e+03 6.451e+03, threshold=2.232e+03, percent-clipped=18.0 2023-06-24 17:15:15,837 INFO [train.py:996] (0/4) Epoch 10, batch 24650, loss[loss=0.2151, simple_loss=0.2809, pruned_loss=0.0747, over 21309.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3095, pruned_loss=0.08121, over 4273018.55 frames. ], batch size: 131, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:16:43,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-24 17:16:46,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-24 17:16:53,144 INFO [train.py:996] (0/4) Epoch 10, batch 24700, loss[loss=0.2351, simple_loss=0.3088, pruned_loss=0.08064, over 21789.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3064, pruned_loss=0.07992, over 4275231.63 frames. ], batch size: 317, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:17:49,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 6.249e+02 8.587e+02 1.281e+03 3.151e+03, threshold=1.717e+03, percent-clipped=6.0 2023-06-24 17:18:05,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1795092.0, ans=0.125 2023-06-24 17:18:31,297 INFO [train.py:996] (0/4) Epoch 10, batch 24750, loss[loss=0.2007, simple_loss=0.2609, pruned_loss=0.07024, over 21609.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2993, pruned_loss=0.07804, over 4270653.99 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:18:44,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1795212.0, ans=0.125 2023-06-24 17:19:12,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1795332.0, ans=0.0 2023-06-24 17:19:18,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1795332.0, ans=0.1 2023-06-24 17:19:21,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1795332.0, ans=0.0 2023-06-24 17:19:36,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1795392.0, ans=0.125 2023-06-24 17:19:44,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1795392.0, ans=0.1 2023-06-24 17:19:53,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1795452.0, ans=0.0 2023-06-24 17:20:07,392 INFO [train.py:996] (0/4) Epoch 10, batch 24800, loss[loss=0.2194, simple_loss=0.2834, pruned_loss=0.07773, over 21761.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2934, pruned_loss=0.07728, over 4279147.30 frames. ], batch size: 247, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 17:20:08,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1795512.0, ans=0.125 2023-06-24 17:20:56,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1795632.0, ans=0.125 2023-06-24 17:20:59,565 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.326e+02 5.922e+02 8.431e+02 1.285e+03 2.453e+03, threshold=1.686e+03, percent-clipped=12.0 2023-06-24 17:21:18,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-24 17:21:18,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1795692.0, ans=0.2 2023-06-24 17:21:36,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-24 17:21:39,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1795752.0, ans=0.125 2023-06-24 17:21:45,152 INFO [train.py:996] (0/4) Epoch 10, batch 24850, loss[loss=0.2202, simple_loss=0.2909, pruned_loss=0.0748, over 21857.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2938, pruned_loss=0.07834, over 4281190.07 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:21:58,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-24 17:23:08,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-24 17:23:22,413 INFO [train.py:996] (0/4) Epoch 10, batch 24900, loss[loss=0.2518, simple_loss=0.33, pruned_loss=0.08681, over 21699.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2969, pruned_loss=0.07923, over 4285743.75 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:23:57,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1796172.0, ans=0.125 2023-06-24 17:24:17,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-06-24 17:24:20,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 8.107e+02 1.257e+03 1.989e+03 3.453e+03, threshold=2.514e+03, percent-clipped=33.0 2023-06-24 17:24:33,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1796292.0, ans=0.04949747468305833 2023-06-24 17:25:00,299 INFO [train.py:996] (0/4) Epoch 10, batch 24950, loss[loss=0.2567, simple_loss=0.3291, pruned_loss=0.09215, over 21672.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.305, pruned_loss=0.08281, over 4289608.05 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:25:16,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1796412.0, ans=0.125 2023-06-24 17:25:31,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1796472.0, ans=0.2 2023-06-24 17:25:58,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1796532.0, ans=0.125 2023-06-24 17:26:40,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1796652.0, ans=0.1 2023-06-24 17:26:43,363 INFO [train.py:996] (0/4) Epoch 10, batch 25000, loss[loss=0.2338, simple_loss=0.2953, pruned_loss=0.08613, over 21479.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3104, pruned_loss=0.08431, over 4284112.01 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:27:37,652 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 7.141e+02 9.275e+02 1.381e+03 2.945e+03, threshold=1.855e+03, percent-clipped=4.0 2023-06-24 17:28:12,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-24 17:28:31,909 INFO [train.py:996] (0/4) Epoch 10, batch 25050, loss[loss=0.2418, simple_loss=0.302, pruned_loss=0.09085, over 21796.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3043, pruned_loss=0.08263, over 4280759.32 frames. ], batch size: 352, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:29:55,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1797252.0, ans=0.07 2023-06-24 17:30:03,462 INFO [train.py:996] (0/4) Epoch 10, batch 25100, loss[loss=0.2192, simple_loss=0.3068, pruned_loss=0.06582, over 21753.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2991, pruned_loss=0.08082, over 4283241.27 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:30:18,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1797312.0, ans=0.1 2023-06-24 17:30:27,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1797372.0, ans=0.125 2023-06-24 17:30:51,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.233e+02 6.255e+02 9.404e+02 1.549e+03 2.850e+03, threshold=1.881e+03, percent-clipped=12.0 2023-06-24 17:31:06,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1797492.0, ans=0.125 2023-06-24 17:31:12,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1797492.0, ans=0.0 2023-06-24 17:31:29,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=22.5 2023-06-24 17:31:35,824 INFO [train.py:996] (0/4) Epoch 10, batch 25150, loss[loss=0.2206, simple_loss=0.3056, pruned_loss=0.06779, over 21361.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3007, pruned_loss=0.07801, over 4279294.87 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:32:49,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1797792.0, ans=0.0 2023-06-24 17:33:00,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1797852.0, ans=0.2 2023-06-24 17:33:01,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1797852.0, ans=10.0 2023-06-24 17:33:12,442 INFO [train.py:996] (0/4) Epoch 10, batch 25200, loss[loss=0.201, simple_loss=0.2847, pruned_loss=0.05863, over 17029.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3017, pruned_loss=0.07625, over 4269804.04 frames. ], batch size: 65, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:33:18,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1797912.0, ans=0.2 2023-06-24 17:34:00,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.948e+02 1.057e+03 1.508e+03 2.758e+03, threshold=2.115e+03, percent-clipped=16.0 2023-06-24 17:34:13,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1798092.0, ans=0.0 2023-06-24 17:34:16,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-24 17:34:18,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1798092.0, ans=0.125 2023-06-24 17:34:35,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1798152.0, ans=0.125 2023-06-24 17:34:36,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-24 17:34:49,137 INFO [train.py:996] (0/4) Epoch 10, batch 25250, loss[loss=0.1964, simple_loss=0.2891, pruned_loss=0.05187, over 21656.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3, pruned_loss=0.07484, over 4262982.91 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:35:05,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1798212.0, ans=0.0 2023-06-24 17:35:05,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1798212.0, ans=0.0 2023-06-24 17:35:07,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-24 17:35:08,037 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.25 vs. limit=15.0 2023-06-24 17:36:26,978 INFO [train.py:996] (0/4) Epoch 10, batch 25300, loss[loss=0.2052, simple_loss=0.2711, pruned_loss=0.06968, over 21303.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2968, pruned_loss=0.07399, over 4252059.60 frames. ], batch size: 144, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:37:15,751 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 6.891e+02 9.341e+02 1.406e+03 3.031e+03, threshold=1.868e+03, percent-clipped=2.0 2023-06-24 17:37:19,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1798632.0, ans=0.125 2023-06-24 17:37:34,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1798692.0, ans=0.0 2023-06-24 17:37:54,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.93 vs. limit=5.0 2023-06-24 17:38:10,245 INFO [train.py:996] (0/4) Epoch 10, batch 25350, loss[loss=0.1888, simple_loss=0.2866, pruned_loss=0.04547, over 20793.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2988, pruned_loss=0.07355, over 4250871.54 frames. ], batch size: 608, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:38:57,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1798932.0, ans=0.5 2023-06-24 17:39:23,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1799052.0, ans=0.2 2023-06-24 17:39:33,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1799052.0, ans=0.125 2023-06-24 17:39:36,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1799052.0, ans=0.04949747468305833 2023-06-24 17:39:38,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1799052.0, ans=0.125 2023-06-24 17:39:38,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1799052.0, ans=0.125 2023-06-24 17:39:42,338 INFO [train.py:996] (0/4) Epoch 10, batch 25400, loss[loss=0.2365, simple_loss=0.3054, pruned_loss=0.08382, over 21836.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2958, pruned_loss=0.07302, over 4256750.98 frames. ], batch size: 107, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:40:00,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1799112.0, ans=0.125 2023-06-24 17:40:02,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-24 17:40:05,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1799112.0, ans=0.125 2023-06-24 17:40:13,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.58 vs. limit=15.0 2023-06-24 17:40:31,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.306e+02 6.316e+02 8.896e+02 1.149e+03 2.761e+03, threshold=1.779e+03, percent-clipped=5.0 2023-06-24 17:40:31,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1799232.0, ans=0.0 2023-06-24 17:40:45,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1799292.0, ans=0.125 2023-06-24 17:40:58,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1799352.0, ans=10.0 2023-06-24 17:41:09,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1799352.0, ans=0.2 2023-06-24 17:41:24,901 INFO [train.py:996] (0/4) Epoch 10, batch 25450, loss[loss=0.2154, simple_loss=0.2883, pruned_loss=0.0712, over 20048.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2963, pruned_loss=0.07418, over 4257832.65 frames. ], batch size: 702, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:41:44,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-24 17:43:04,180 INFO [train.py:996] (0/4) Epoch 10, batch 25500, loss[loss=0.1969, simple_loss=0.2775, pruned_loss=0.05817, over 21268.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2966, pruned_loss=0.07214, over 4247679.26 frames. ], batch size: 176, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:43:43,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.327e+02 6.563e+02 1.058e+03 1.442e+03 3.790e+03, threshold=2.117e+03, percent-clipped=15.0 2023-06-24 17:43:44,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.95 vs. limit=15.0 2023-06-24 17:44:33,193 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-300000.pt 2023-06-24 17:44:39,404 INFO [train.py:996] (0/4) Epoch 10, batch 25550, loss[loss=0.2307, simple_loss=0.306, pruned_loss=0.07771, over 15573.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3031, pruned_loss=0.07201, over 4231744.71 frames. ], batch size: 60, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:44:44,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1800012.0, ans=0.0 2023-06-24 17:44:53,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1800072.0, ans=0.125 2023-06-24 17:45:01,566 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:45:10,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1800132.0, ans=0.1 2023-06-24 17:45:23,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1800132.0, ans=0.125 2023-06-24 17:45:24,911 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:45:40,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1800192.0, ans=0.125 2023-06-24 17:45:48,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1800192.0, ans=0.0 2023-06-24 17:46:17,028 INFO [train.py:996] (0/4) Epoch 10, batch 25600, loss[loss=0.2752, simple_loss=0.3371, pruned_loss=0.1066, over 21844.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3078, pruned_loss=0.07359, over 4237868.82 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:46:52,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1800432.0, ans=0.125 2023-06-24 17:46:58,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.521e+02 8.019e+02 1.314e+03 1.703e+03 3.186e+03, threshold=2.628e+03, percent-clipped=13.0 2023-06-24 17:47:49,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1800552.0, ans=0.125 2023-06-24 17:47:54,747 INFO [train.py:996] (0/4) Epoch 10, batch 25650, loss[loss=0.2317, simple_loss=0.2918, pruned_loss=0.08582, over 21839.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3085, pruned_loss=0.07688, over 4246492.20 frames. ], batch size: 98, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:48:33,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1800732.0, ans=0.0 2023-06-24 17:49:13,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1800852.0, ans=0.2 2023-06-24 17:49:28,876 INFO [train.py:996] (0/4) Epoch 10, batch 25700, loss[loss=0.3303, simple_loss=0.3765, pruned_loss=0.142, over 21651.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3045, pruned_loss=0.07755, over 4255645.43 frames. ], batch size: 508, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:49:31,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1800912.0, ans=0.0 2023-06-24 17:49:31,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1800912.0, ans=0.125 2023-06-24 17:49:43,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1800972.0, ans=0.0 2023-06-24 17:49:54,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1800972.0, ans=0.125 2023-06-24 17:50:16,729 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 8.817e+02 1.358e+03 2.096e+03 4.463e+03, threshold=2.717e+03, percent-clipped=14.0 2023-06-24 17:50:41,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-24 17:51:03,527 INFO [train.py:996] (0/4) Epoch 10, batch 25750, loss[loss=0.3094, simple_loss=0.4097, pruned_loss=0.1045, over 20764.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.31, pruned_loss=0.08055, over 4251887.98 frames. ], batch size: 608, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:51:13,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1801212.0, ans=0.125 2023-06-24 17:51:21,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1801272.0, ans=0.035 2023-06-24 17:51:40,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1801272.0, ans=0.0 2023-06-24 17:52:31,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1801452.0, ans=0.2 2023-06-24 17:52:38,287 INFO [train.py:996] (0/4) Epoch 10, batch 25800, loss[loss=0.222, simple_loss=0.3028, pruned_loss=0.07055, over 21812.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3207, pruned_loss=0.08421, over 4262712.57 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:53:40,824 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.838e+02 7.699e+02 1.038e+03 1.788e+03 4.629e+03, threshold=2.076e+03, percent-clipped=8.0 2023-06-24 17:53:46,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1801692.0, ans=0.125 2023-06-24 17:53:56,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1801692.0, ans=0.04949747468305833 2023-06-24 17:54:06,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-24 17:54:17,150 INFO [train.py:996] (0/4) Epoch 10, batch 25850, loss[loss=0.2463, simple_loss=0.3091, pruned_loss=0.09176, over 21851.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3254, pruned_loss=0.08538, over 4260741.79 frames. ], batch size: 118, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:55:15,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1801932.0, ans=0.2 2023-06-24 17:55:26,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1801992.0, ans=0.125 2023-06-24 17:55:28,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-24 17:55:56,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-24 17:56:01,217 INFO [train.py:996] (0/4) Epoch 10, batch 25900, loss[loss=0.2869, simple_loss=0.3729, pruned_loss=0.1005, over 21667.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3261, pruned_loss=0.0855, over 4268658.79 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:56:01,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1802112.0, ans=0.0 2023-06-24 17:56:36,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1802172.0, ans=0.2 2023-06-24 17:56:50,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1802232.0, ans=0.125 2023-06-24 17:56:53,711 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.611e+02 7.173e+02 1.132e+03 1.463e+03 2.574e+03, threshold=2.264e+03, percent-clipped=5.0 2023-06-24 17:57:03,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1802292.0, ans=0.125 2023-06-24 17:57:07,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-24 17:57:29,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1802352.0, ans=0.0 2023-06-24 17:57:38,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1802352.0, ans=0.125 2023-06-24 17:57:45,046 INFO [train.py:996] (0/4) Epoch 10, batch 25950, loss[loss=0.3014, simple_loss=0.3699, pruned_loss=0.1165, over 21588.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3313, pruned_loss=0.08794, over 4273443.16 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:57:51,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1802412.0, ans=0.1 2023-06-24 17:58:10,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1802472.0, ans=0.125 2023-06-24 17:58:50,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1802592.0, ans=0.125 2023-06-24 17:59:00,956 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:59:20,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1802652.0, ans=0.125 2023-06-24 17:59:23,439 INFO [train.py:996] (0/4) Epoch 10, batch 26000, loss[loss=0.2369, simple_loss=0.331, pruned_loss=0.07146, over 21624.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3305, pruned_loss=0.08624, over 4277476.95 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:59:33,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1802712.0, ans=0.2 2023-06-24 18:00:06,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.493e+02 7.586e+02 1.074e+03 1.522e+03 3.008e+03, threshold=2.148e+03, percent-clipped=6.0 2023-06-24 18:00:56,237 INFO [train.py:996] (0/4) Epoch 10, batch 26050, loss[loss=0.2255, simple_loss=0.2906, pruned_loss=0.08024, over 21660.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3298, pruned_loss=0.08605, over 4273252.49 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:01:06,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1803012.0, ans=0.1 2023-06-24 18:01:10,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1803072.0, ans=0.125 2023-06-24 18:01:12,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1803072.0, ans=0.0 2023-06-24 18:01:13,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1803072.0, ans=0.0 2023-06-24 18:01:56,353 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:02:32,725 INFO [train.py:996] (0/4) Epoch 10, batch 26100, loss[loss=0.2367, simple_loss=0.2961, pruned_loss=0.08859, over 21452.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3254, pruned_loss=0.08707, over 4283858.92 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:02:33,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1803312.0, ans=0.125 2023-06-24 18:03:15,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.148e+02 7.025e+02 9.795e+02 1.501e+03 3.322e+03, threshold=1.959e+03, percent-clipped=9.0 2023-06-24 18:03:15,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1803432.0, ans=0.0 2023-06-24 18:03:34,211 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:03:46,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.75 vs. limit=5.0 2023-06-24 18:03:59,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1803552.0, ans=0.1 2023-06-24 18:04:05,734 INFO [train.py:996] (0/4) Epoch 10, batch 26150, loss[loss=0.2922, simple_loss=0.3575, pruned_loss=0.1134, over 21588.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3222, pruned_loss=0.08665, over 4290145.44 frames. ], batch size: 414, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:04:10,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-24 18:04:12,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1803612.0, ans=0.0 2023-06-24 18:04:32,436 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:05:18,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1803792.0, ans=0.2 2023-06-24 18:05:37,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-24 18:05:44,638 INFO [train.py:996] (0/4) Epoch 10, batch 26200, loss[loss=0.2773, simple_loss=0.3817, pruned_loss=0.0864, over 21256.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.322, pruned_loss=0.08405, over 4286643.16 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:06:11,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1803972.0, ans=0.125 2023-06-24 18:06:36,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.529e+02 6.225e+02 8.349e+02 1.175e+03 2.397e+03, threshold=1.670e+03, percent-clipped=3.0 2023-06-24 18:06:51,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-24 18:07:17,746 INFO [train.py:996] (0/4) Epoch 10, batch 26250, loss[loss=0.2658, simple_loss=0.3385, pruned_loss=0.09658, over 21606.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3247, pruned_loss=0.08252, over 4291812.51 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:07:18,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-24 18:07:24,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1804212.0, ans=0.125 2023-06-24 18:07:45,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=12.0 2023-06-24 18:07:54,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1804332.0, ans=0.1 2023-06-24 18:08:11,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-24 18:08:51,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1804452.0, ans=0.125 2023-06-24 18:08:53,843 INFO [train.py:996] (0/4) Epoch 10, batch 26300, loss[loss=0.2187, simple_loss=0.2908, pruned_loss=0.07332, over 21925.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3218, pruned_loss=0.08341, over 4292091.08 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:09:46,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1804632.0, ans=0.2 2023-06-24 18:09:48,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1804632.0, ans=0.125 2023-06-24 18:09:51,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.353e+02 7.232e+02 9.217e+02 1.300e+03 2.808e+03, threshold=1.843e+03, percent-clipped=15.0 2023-06-24 18:10:27,968 INFO [train.py:996] (0/4) Epoch 10, batch 26350, loss[loss=0.2947, simple_loss=0.3587, pruned_loss=0.1153, over 21680.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3203, pruned_loss=0.08412, over 4291522.76 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:10:40,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=22.5 2023-06-24 18:11:04,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1804872.0, ans=0.125 2023-06-24 18:11:55,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1805052.0, ans=0.1 2023-06-24 18:12:00,229 INFO [train.py:996] (0/4) Epoch 10, batch 26400, loss[loss=0.2242, simple_loss=0.2791, pruned_loss=0.08463, over 21578.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3148, pruned_loss=0.08441, over 4281697.03 frames. ], batch size: 231, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:12:02,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1805112.0, ans=0.125 2023-06-24 18:12:21,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-24 18:12:57,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1805232.0, ans=0.125 2023-06-24 18:12:58,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1805232.0, ans=0.125 2023-06-24 18:13:03,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1805232.0, ans=0.125 2023-06-24 18:13:04,641 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.504e+02 7.130e+02 9.166e+02 1.346e+03 2.893e+03, threshold=1.833e+03, percent-clipped=10.0 2023-06-24 18:13:08,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1805292.0, ans=0.125 2023-06-24 18:13:23,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=22.5 2023-06-24 18:13:28,075 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:13:32,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1805352.0, ans=0.2 2023-06-24 18:13:44,723 INFO [train.py:996] (0/4) Epoch 10, batch 26450, loss[loss=0.2477, simple_loss=0.3413, pruned_loss=0.07706, over 21685.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3139, pruned_loss=0.08371, over 4276049.77 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:15:16,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1805652.0, ans=0.125 2023-06-24 18:15:18,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1805652.0, ans=0.125 2023-06-24 18:15:33,634 INFO [train.py:996] (0/4) Epoch 10, batch 26500, loss[loss=0.2908, simple_loss=0.3715, pruned_loss=0.105, over 21681.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3165, pruned_loss=0.0828, over 4267257.80 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:15:39,565 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-24 18:16:06,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-24 18:16:06,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.66 vs. limit=5.0 2023-06-24 18:16:19,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.291e+02 8.492e+02 1.713e+03 2.381e+03 4.815e+03, threshold=3.427e+03, percent-clipped=46.0 2023-06-24 18:16:22,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1805892.0, ans=0.125 2023-06-24 18:17:13,484 INFO [train.py:996] (0/4) Epoch 10, batch 26550, loss[loss=0.1881, simple_loss=0.2694, pruned_loss=0.05339, over 21558.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3138, pruned_loss=0.07982, over 4267414.51 frames. ], batch size: 212, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:17:20,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1806012.0, ans=0.5 2023-06-24 18:17:42,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1806072.0, ans=10.0 2023-06-24 18:18:11,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-24 18:18:28,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1806192.0, ans=0.125 2023-06-24 18:18:51,163 INFO [train.py:996] (0/4) Epoch 10, batch 26600, loss[loss=0.2365, simple_loss=0.3086, pruned_loss=0.08215, over 21280.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3136, pruned_loss=0.07682, over 4271624.34 frames. ], batch size: 176, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:18:53,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1806312.0, ans=0.0 2023-06-24 18:19:08,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1806372.0, ans=0.2 2023-06-24 18:19:10,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1806372.0, ans=0.125 2023-06-24 18:19:31,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.57 vs. limit=8.0 2023-06-24 18:19:50,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.252e+02 6.735e+02 9.367e+02 1.418e+03 2.947e+03, threshold=1.873e+03, percent-clipped=0.0 2023-06-24 18:20:05,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-24 18:20:27,318 INFO [train.py:996] (0/4) Epoch 10, batch 26650, loss[loss=0.1704, simple_loss=0.2444, pruned_loss=0.04816, over 21145.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3069, pruned_loss=0.07635, over 4270875.92 frames. ], batch size: 159, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:20:35,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1806612.0, ans=0.0 2023-06-24 18:20:36,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.71 vs. limit=6.0 2023-06-24 18:20:38,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1806612.0, ans=0.1 2023-06-24 18:21:35,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1806792.0, ans=0.0 2023-06-24 18:21:50,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1806852.0, ans=0.125 2023-06-24 18:21:51,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-24 18:22:03,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1806912.0, ans=0.125 2023-06-24 18:22:04,430 INFO [train.py:996] (0/4) Epoch 10, batch 26700, loss[loss=0.2423, simple_loss=0.3119, pruned_loss=0.08639, over 21877.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3005, pruned_loss=0.07367, over 4271337.29 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:22:06,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1806912.0, ans=0.125 2023-06-24 18:23:03,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 5.904e+02 8.895e+02 1.266e+03 2.611e+03, threshold=1.779e+03, percent-clipped=6.0 2023-06-24 18:23:26,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.88 vs. limit=22.5 2023-06-24 18:23:32,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-24 18:23:37,449 INFO [train.py:996] (0/4) Epoch 10, batch 26750, loss[loss=0.2548, simple_loss=0.337, pruned_loss=0.08631, over 21359.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2993, pruned_loss=0.07258, over 4274158.66 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:23:47,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1807212.0, ans=0.125 2023-06-24 18:24:13,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1807272.0, ans=0.09899494936611666 2023-06-24 18:24:27,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1807332.0, ans=0.035 2023-06-24 18:24:34,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1807332.0, ans=0.0 2023-06-24 18:25:08,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1807452.0, ans=0.125 2023-06-24 18:25:17,207 INFO [train.py:996] (0/4) Epoch 10, batch 26800, loss[loss=0.3077, simple_loss=0.3628, pruned_loss=0.1263, over 21415.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3085, pruned_loss=0.07716, over 4273535.37 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:26:14,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1807632.0, ans=0.2 2023-06-24 18:26:14,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1807632.0, ans=0.125 2023-06-24 18:26:16,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.147e+02 7.518e+02 9.832e+02 1.417e+03 2.844e+03, threshold=1.966e+03, percent-clipped=8.0 2023-06-24 18:26:38,503 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-24 18:26:39,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1807752.0, ans=0.125 2023-06-24 18:26:59,151 INFO [train.py:996] (0/4) Epoch 10, batch 26850, loss[loss=0.2189, simple_loss=0.2838, pruned_loss=0.07702, over 21893.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3106, pruned_loss=0.07977, over 4271127.61 frames. ], batch size: 107, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:27:35,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1807872.0, ans=0.1 2023-06-24 18:27:35,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1807872.0, ans=0.2 2023-06-24 18:27:46,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1807932.0, ans=0.0 2023-06-24 18:28:30,963 INFO [train.py:996] (0/4) Epoch 10, batch 26900, loss[loss=0.2278, simple_loss=0.2786, pruned_loss=0.08851, over 21506.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3019, pruned_loss=0.07857, over 4271140.74 frames. ], batch size: 442, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:29:23,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1808232.0, ans=0.0 2023-06-24 18:29:30,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 7.494e+02 9.472e+02 1.508e+03 3.136e+03, threshold=1.894e+03, percent-clipped=8.0 2023-06-24 18:29:41,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1808292.0, ans=0.125 2023-06-24 18:29:49,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1808352.0, ans=0.0 2023-06-24 18:29:51,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808352.0, ans=0.1 2023-06-24 18:30:07,728 INFO [train.py:996] (0/4) Epoch 10, batch 26950, loss[loss=0.2082, simple_loss=0.3021, pruned_loss=0.05709, over 19723.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3024, pruned_loss=0.07936, over 4273582.76 frames. ], batch size: 702, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:30:40,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1808472.0, ans=0.125 2023-06-24 18:31:54,866 INFO [train.py:996] (0/4) Epoch 10, batch 27000, loss[loss=0.2332, simple_loss=0.2977, pruned_loss=0.08436, over 20126.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3025, pruned_loss=0.07718, over 4276288.32 frames. ], batch size: 703, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:31:54,867 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 18:32:16,361 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2412, simple_loss=0.3374, pruned_loss=0.07247, over 1796401.00 frames. 2023-06-24 18:32:16,362 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 18:32:23,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1808712.0, ans=0.2 2023-06-24 18:32:47,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1808832.0, ans=0.025 2023-06-24 18:33:08,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.928e+02 6.448e+02 9.390e+02 1.351e+03 2.937e+03, threshold=1.878e+03, percent-clipped=11.0 2023-06-24 18:33:12,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1808892.0, ans=0.0 2023-06-24 18:33:55,950 INFO [train.py:996] (0/4) Epoch 10, batch 27050, loss[loss=0.2433, simple_loss=0.3157, pruned_loss=0.08546, over 21609.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3025, pruned_loss=0.07369, over 4277312.22 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:34:03,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1809012.0, ans=0.125 2023-06-24 18:34:18,050 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:34:42,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1809132.0, ans=0.1 2023-06-24 18:35:31,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1809312.0, ans=0.125 2023-06-24 18:35:32,528 INFO [train.py:996] (0/4) Epoch 10, batch 27100, loss[loss=0.2296, simple_loss=0.3239, pruned_loss=0.06764, over 21460.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3048, pruned_loss=0.07535, over 4284131.89 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:36:24,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.409e+02 6.272e+02 8.340e+02 1.180e+03 2.454e+03, threshold=1.668e+03, percent-clipped=3.0 2023-06-24 18:36:46,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1809492.0, ans=0.125 2023-06-24 18:36:46,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1809492.0, ans=0.2 2023-06-24 18:37:10,898 INFO [train.py:996] (0/4) Epoch 10, batch 27150, loss[loss=0.2265, simple_loss=0.323, pruned_loss=0.06502, over 21674.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3163, pruned_loss=0.07838, over 4286589.04 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:37:24,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-24 18:37:37,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1809672.0, ans=0.0 2023-06-24 18:38:29,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1809792.0, ans=0.125 2023-06-24 18:38:37,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1809852.0, ans=0.125 2023-06-24 18:38:43,361 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:38:49,498 INFO [train.py:996] (0/4) Epoch 10, batch 27200, loss[loss=0.2309, simple_loss=0.3099, pruned_loss=0.076, over 21592.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3234, pruned_loss=0.08085, over 4285201.04 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:39:02,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1809912.0, ans=0.0 2023-06-24 18:39:17,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1809972.0, ans=0.125 2023-06-24 18:39:20,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1809972.0, ans=0.125 2023-06-24 18:39:56,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 7.616e+02 1.186e+03 1.800e+03 4.357e+03, threshold=2.372e+03, percent-clipped=30.0 2023-06-24 18:40:07,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1810092.0, ans=0.125 2023-06-24 18:40:27,559 INFO [train.py:996] (0/4) Epoch 10, batch 27250, loss[loss=0.2309, simple_loss=0.3073, pruned_loss=0.07729, over 20610.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3264, pruned_loss=0.08469, over 4283839.12 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:41:11,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1810272.0, ans=0.125 2023-06-24 18:41:20,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1810332.0, ans=0.125 2023-06-24 18:41:55,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1810452.0, ans=0.2 2023-06-24 18:42:12,515 INFO [train.py:996] (0/4) Epoch 10, batch 27300, loss[loss=0.2784, simple_loss=0.3709, pruned_loss=0.0929, over 21534.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.328, pruned_loss=0.08539, over 4278152.65 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:42:15,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-24 18:43:18,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.724e+02 6.828e+02 8.478e+02 1.184e+03 2.294e+03, threshold=1.696e+03, percent-clipped=0.0 2023-06-24 18:43:45,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1810752.0, ans=0.125 2023-06-24 18:43:46,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1810752.0, ans=0.125 2023-06-24 18:43:55,772 INFO [train.py:996] (0/4) Epoch 10, batch 27350, loss[loss=0.2453, simple_loss=0.3253, pruned_loss=0.08264, over 21885.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3316, pruned_loss=0.08716, over 4279488.93 frames. ], batch size: 118, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:44:22,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1810872.0, ans=0.125 2023-06-24 18:44:37,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1810872.0, ans=0.07 2023-06-24 18:44:42,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1810872.0, ans=0.125 2023-06-24 18:45:01,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1810992.0, ans=0.125 2023-06-24 18:45:13,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1810992.0, ans=0.0 2023-06-24 18:45:31,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1811052.0, ans=0.125 2023-06-24 18:45:45,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1811052.0, ans=0.125 2023-06-24 18:45:46,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-24 18:45:48,118 INFO [train.py:996] (0/4) Epoch 10, batch 27400, loss[loss=0.233, simple_loss=0.2895, pruned_loss=0.08823, over 21372.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3251, pruned_loss=0.08579, over 4287086.36 frames. ], batch size: 177, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:45:49,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.71 vs. limit=15.0 2023-06-24 18:46:19,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1811172.0, ans=0.0 2023-06-24 18:46:36,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1811232.0, ans=0.125 2023-06-24 18:46:44,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1811232.0, ans=0.0 2023-06-24 18:46:46,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1811232.0, ans=0.0 2023-06-24 18:46:49,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 6.744e+02 9.251e+02 1.244e+03 3.904e+03, threshold=1.850e+03, percent-clipped=13.0 2023-06-24 18:47:09,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1811352.0, ans=0.2 2023-06-24 18:47:39,112 INFO [train.py:996] (0/4) Epoch 10, batch 27450, loss[loss=0.2333, simple_loss=0.3187, pruned_loss=0.07392, over 21300.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3184, pruned_loss=0.08342, over 4281250.54 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:47:42,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1811412.0, ans=0.0 2023-06-24 18:47:56,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1811412.0, ans=0.09899494936611666 2023-06-24 18:48:06,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1811472.0, ans=0.0 2023-06-24 18:48:45,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-06-24 18:49:19,538 INFO [train.py:996] (0/4) Epoch 10, batch 27500, loss[loss=0.2434, simple_loss=0.311, pruned_loss=0.08794, over 21859.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3179, pruned_loss=0.08436, over 4279736.33 frames. ], batch size: 371, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:49:43,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.57 vs. limit=10.0 2023-06-24 18:50:01,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1811772.0, ans=0.5 2023-06-24 18:50:26,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.516e+02 9.018e+02 1.642e+03 4.000e+03, threshold=1.804e+03, percent-clipped=22.0 2023-06-24 18:50:49,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1811952.0, ans=0.125 2023-06-24 18:51:04,432 INFO [train.py:996] (0/4) Epoch 10, batch 27550, loss[loss=0.2079, simple_loss=0.2756, pruned_loss=0.07007, over 21245.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3136, pruned_loss=0.08117, over 4288114.35 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 18:51:05,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1812012.0, ans=0.1 2023-06-24 18:51:19,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1812012.0, ans=0.1 2023-06-24 18:51:33,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1812072.0, ans=0.1 2023-06-24 18:51:35,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1812072.0, ans=0.125 2023-06-24 18:52:52,107 INFO [train.py:996] (0/4) Epoch 10, batch 27600, loss[loss=0.2514, simple_loss=0.3316, pruned_loss=0.08561, over 19960.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.306, pruned_loss=0.07915, over 4284939.29 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:53:48,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1812492.0, ans=0.125 2023-06-24 18:53:51,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.487e+02 7.022e+02 8.920e+02 1.196e+03 2.930e+03, threshold=1.784e+03, percent-clipped=9.0 2023-06-24 18:54:03,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1812492.0, ans=0.125 2023-06-24 18:54:07,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1812492.0, ans=0.0 2023-06-24 18:54:11,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1812552.0, ans=0.125 2023-06-24 18:54:15,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.87 vs. limit=12.0 2023-06-24 18:54:33,770 INFO [train.py:996] (0/4) Epoch 10, batch 27650, loss[loss=0.22, simple_loss=0.3063, pruned_loss=0.06684, over 21730.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3002, pruned_loss=0.0784, over 4277605.45 frames. ], batch size: 298, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:54:49,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1812612.0, ans=0.2 2023-06-24 18:54:52,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-24 18:54:59,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1812672.0, ans=0.2 2023-06-24 18:55:21,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-24 18:56:02,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-24 18:56:23,903 INFO [train.py:996] (0/4) Epoch 10, batch 27700, loss[loss=0.2293, simple_loss=0.3034, pruned_loss=0.07759, over 21466.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3015, pruned_loss=0.07768, over 4284767.38 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:56:31,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1812912.0, ans=0.0 2023-06-24 18:56:38,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1812912.0, ans=0.125 2023-06-24 18:56:46,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1812972.0, ans=0.0 2023-06-24 18:57:15,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-24 18:57:20,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 7.274e+02 1.072e+03 1.414e+03 3.667e+03, threshold=2.145e+03, percent-clipped=20.0 2023-06-24 18:57:33,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1813092.0, ans=0.2 2023-06-24 18:58:08,681 INFO [train.py:996] (0/4) Epoch 10, batch 27750, loss[loss=0.2364, simple_loss=0.3216, pruned_loss=0.07564, over 21393.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3036, pruned_loss=0.07713, over 4282199.24 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:59:12,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1813392.0, ans=0.2 2023-06-24 18:59:28,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1813452.0, ans=0.125 2023-06-24 18:59:36,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1813452.0, ans=0.0 2023-06-24 18:59:43,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-06-24 18:59:45,892 INFO [train.py:996] (0/4) Epoch 10, batch 27800, loss[loss=0.2553, simple_loss=0.3171, pruned_loss=0.09679, over 21832.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3048, pruned_loss=0.07818, over 4287802.62 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:00:09,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=12.0 2023-06-24 19:00:35,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1813632.0, ans=0.125 2023-06-24 19:00:49,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.608e+02 7.266e+02 8.786e+02 1.229e+03 3.001e+03, threshold=1.757e+03, percent-clipped=8.0 2023-06-24 19:00:50,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1813692.0, ans=0.125 2023-06-24 19:01:11,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1813692.0, ans=15.0 2023-06-24 19:01:40,627 INFO [train.py:996] (0/4) Epoch 10, batch 27850, loss[loss=0.2107, simple_loss=0.3039, pruned_loss=0.05874, over 21752.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3042, pruned_loss=0.07894, over 4295230.52 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:02:18,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1813932.0, ans=0.95 2023-06-24 19:03:33,938 INFO [train.py:996] (0/4) Epoch 10, batch 27900, loss[loss=0.2463, simple_loss=0.3407, pruned_loss=0.07593, over 21784.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3147, pruned_loss=0.08124, over 4294997.07 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:04:01,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1814172.0, ans=0.0 2023-06-24 19:04:08,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1814172.0, ans=0.2 2023-06-24 19:04:13,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1814232.0, ans=0.125 2023-06-24 19:04:37,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.993e+02 8.012e+02 1.137e+03 1.760e+03 3.581e+03, threshold=2.273e+03, percent-clipped=25.0 2023-06-24 19:05:04,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1814352.0, ans=0.125 2023-06-24 19:05:21,090 INFO [train.py:996] (0/4) Epoch 10, batch 27950, loss[loss=0.2273, simple_loss=0.3466, pruned_loss=0.05398, over 20804.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.313, pruned_loss=0.07723, over 4292897.38 frames. ], batch size: 607, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:05:25,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1814412.0, ans=0.0 2023-06-24 19:05:43,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1814472.0, ans=0.1 2023-06-24 19:06:02,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1814472.0, ans=0.125 2023-06-24 19:06:05,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1814532.0, ans=0.125 2023-06-24 19:06:09,099 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-24 19:06:23,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1814592.0, ans=0.125 2023-06-24 19:06:49,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1814652.0, ans=0.0 2023-06-24 19:07:07,523 INFO [train.py:996] (0/4) Epoch 10, batch 28000, loss[loss=0.2279, simple_loss=0.3015, pruned_loss=0.07718, over 21894.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3106, pruned_loss=0.07465, over 4292408.32 frames. ], batch size: 118, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:07:33,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1814772.0, ans=0.125 2023-06-24 19:07:53,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1814832.0, ans=0.05 2023-06-24 19:08:02,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1814832.0, ans=0.2 2023-06-24 19:08:13,046 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.030e+02 6.703e+02 9.141e+02 1.311e+03 2.491e+03, threshold=1.828e+03, percent-clipped=2.0 2023-06-24 19:08:51,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1814952.0, ans=0.1 2023-06-24 19:08:56,506 INFO [train.py:996] (0/4) Epoch 10, batch 28050, loss[loss=0.1484, simple_loss=0.1958, pruned_loss=0.05052, over 16594.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3078, pruned_loss=0.07655, over 4289955.72 frames. ], batch size: 62, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:09:07,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1815012.0, ans=0.125 2023-06-24 19:09:24,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1815072.0, ans=0.2 2023-06-24 19:09:32,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-24 19:09:41,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1815132.0, ans=0.0 2023-06-24 19:10:11,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1815192.0, ans=0.125 2023-06-24 19:10:37,080 INFO [train.py:996] (0/4) Epoch 10, batch 28100, loss[loss=0.2112, simple_loss=0.2802, pruned_loss=0.07113, over 22002.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3058, pruned_loss=0.07662, over 4281133.81 frames. ], batch size: 103, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:11:53,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.563e+02 7.316e+02 9.478e+02 1.492e+03 2.732e+03, threshold=1.896e+03, percent-clipped=11.0 2023-06-24 19:12:03,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1815492.0, ans=0.1 2023-06-24 19:12:21,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-06-24 19:12:28,325 INFO [train.py:996] (0/4) Epoch 10, batch 28150, loss[loss=0.2586, simple_loss=0.3562, pruned_loss=0.08055, over 19762.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2993, pruned_loss=0.07711, over 4267554.84 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:13:05,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=15.0 2023-06-24 19:14:02,374 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.35 vs. limit=6.0 2023-06-24 19:14:10,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2023-06-24 19:14:22,727 INFO [train.py:996] (0/4) Epoch 10, batch 28200, loss[loss=0.2233, simple_loss=0.2961, pruned_loss=0.0753, over 21732.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.298, pruned_loss=0.07841, over 4276823.46 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:14:55,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-24 19:15:28,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1816092.0, ans=0.2 2023-06-24 19:15:29,925 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.315e+02 7.583e+02 1.086e+03 1.711e+03 4.107e+03, threshold=2.171e+03, percent-clipped=18.0 2023-06-24 19:15:34,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.04 vs. limit=15.0 2023-06-24 19:15:50,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1816152.0, ans=0.0 2023-06-24 19:16:07,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1816152.0, ans=0.0 2023-06-24 19:16:09,716 INFO [train.py:996] (0/4) Epoch 10, batch 28250, loss[loss=0.2618, simple_loss=0.3122, pruned_loss=0.1057, over 21539.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3009, pruned_loss=0.08054, over 4280190.14 frames. ], batch size: 391, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:16:56,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1816332.0, ans=0.125 2023-06-24 19:17:03,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-06-24 19:17:32,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-24 19:17:49,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1816452.0, ans=0.2 2023-06-24 19:17:54,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=12.0 2023-06-24 19:17:57,274 INFO [train.py:996] (0/4) Epoch 10, batch 28300, loss[loss=0.1766, simple_loss=0.2961, pruned_loss=0.02858, over 20754.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2993, pruned_loss=0.07742, over 4277264.58 frames. ], batch size: 608, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:18:01,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1816512.0, ans=0.2 2023-06-24 19:18:30,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1816572.0, ans=0.0 2023-06-24 19:18:37,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1816572.0, ans=0.1 2023-06-24 19:19:00,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1816692.0, ans=0.125 2023-06-24 19:19:02,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.412e+02 7.987e+02 1.217e+03 2.083e+03 3.970e+03, threshold=2.435e+03, percent-clipped=20.0 2023-06-24 19:19:43,780 INFO [train.py:996] (0/4) Epoch 10, batch 28350, loss[loss=0.1847, simple_loss=0.2513, pruned_loss=0.05899, over 21841.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2953, pruned_loss=0.07161, over 4268567.75 frames. ], batch size: 98, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:21:12,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1817052.0, ans=0.125 2023-06-24 19:21:30,149 INFO [train.py:996] (0/4) Epoch 10, batch 28400, loss[loss=0.2005, simple_loss=0.2715, pruned_loss=0.06472, over 21739.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2913, pruned_loss=0.0717, over 4263853.63 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:21:58,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-24 19:22:01,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1817172.0, ans=0.125 2023-06-24 19:22:05,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1817172.0, ans=0.1 2023-06-24 19:22:33,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1817232.0, ans=0.125 2023-06-24 19:22:48,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.250e+02 7.623e+02 1.000e+03 1.515e+03 3.438e+03, threshold=2.000e+03, percent-clipped=5.0 2023-06-24 19:22:52,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817292.0, ans=0.1 2023-06-24 19:23:04,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1817352.0, ans=0.2 2023-06-24 19:23:22,177 INFO [train.py:996] (0/4) Epoch 10, batch 28450, loss[loss=0.2149, simple_loss=0.2874, pruned_loss=0.0712, over 20083.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2974, pruned_loss=0.0754, over 4252708.78 frames. ], batch size: 703, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:24:28,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1817532.0, ans=0.0 2023-06-24 19:25:10,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1817652.0, ans=0.125 2023-06-24 19:25:19,422 INFO [train.py:996] (0/4) Epoch 10, batch 28500, loss[loss=0.2264, simple_loss=0.3084, pruned_loss=0.07222, over 21773.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2998, pruned_loss=0.07766, over 4264461.17 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:25:25,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1817712.0, ans=0.125 2023-06-24 19:25:33,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1817712.0, ans=0.125 2023-06-24 19:25:33,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1817712.0, ans=0.2 2023-06-24 19:26:27,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.659e+02 7.035e+02 9.248e+02 1.262e+03 2.146e+03, threshold=1.850e+03, percent-clipped=2.0 2023-06-24 19:26:29,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817892.0, ans=0.1 2023-06-24 19:27:07,863 INFO [train.py:996] (0/4) Epoch 10, batch 28550, loss[loss=0.2697, simple_loss=0.3656, pruned_loss=0.08693, over 21648.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3079, pruned_loss=0.08045, over 4270964.42 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:27:20,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1818012.0, ans=0.2 2023-06-24 19:27:26,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1818072.0, ans=0.0 2023-06-24 19:27:34,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1818072.0, ans=0.0 2023-06-24 19:27:44,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.62 vs. limit=10.0 2023-06-24 19:28:35,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1818192.0, ans=0.035 2023-06-24 19:28:52,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1818252.0, ans=0.2 2023-06-24 19:28:54,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.08 vs. limit=10.0 2023-06-24 19:28:55,069 INFO [train.py:996] (0/4) Epoch 10, batch 28600, loss[loss=0.2443, simple_loss=0.3263, pruned_loss=0.08122, over 21346.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3162, pruned_loss=0.08359, over 4274550.00 frames. ], batch size: 131, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:30:08,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.757e+02 7.121e+02 1.044e+03 1.505e+03 3.342e+03, threshold=2.089e+03, percent-clipped=18.0 2023-06-24 19:30:19,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1818492.0, ans=0.125 2023-06-24 19:30:19,646 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-24 19:30:24,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1818552.0, ans=0.125 2023-06-24 19:30:42,110 INFO [train.py:996] (0/4) Epoch 10, batch 28650, loss[loss=0.2037, simple_loss=0.2586, pruned_loss=0.07436, over 21267.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3107, pruned_loss=0.08295, over 4272373.84 frames. ], batch size: 549, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:31:12,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1818672.0, ans=0.0 2023-06-24 19:31:59,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-24 19:32:14,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1818852.0, ans=0.0 2023-06-24 19:32:34,047 INFO [train.py:996] (0/4) Epoch 10, batch 28700, loss[loss=0.2371, simple_loss=0.3083, pruned_loss=0.08295, over 21299.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3099, pruned_loss=0.08375, over 4277467.17 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:33:13,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1819032.0, ans=0.0 2023-06-24 19:33:52,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.396e+02 6.260e+02 7.915e+02 1.085e+03 2.283e+03, threshold=1.583e+03, percent-clipped=3.0 2023-06-24 19:34:23,863 INFO [train.py:996] (0/4) Epoch 10, batch 28750, loss[loss=0.2181, simple_loss=0.3039, pruned_loss=0.0661, over 21683.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3083, pruned_loss=0.08394, over 4280897.73 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:34:39,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-24 19:34:50,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1819272.0, ans=0.125 2023-06-24 19:36:21,347 INFO [train.py:996] (0/4) Epoch 10, batch 28800, loss[loss=0.2634, simple_loss=0.3308, pruned_loss=0.09798, over 16993.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3124, pruned_loss=0.08367, over 4279035.00 frames. ], batch size: 60, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:36:38,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1819572.0, ans=0.125 2023-06-24 19:37:11,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1819632.0, ans=0.125 2023-06-24 19:37:29,866 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.249e+02 6.492e+02 9.041e+02 1.378e+03 2.887e+03, threshold=1.808e+03, percent-clipped=17.0 2023-06-24 19:37:56,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1819752.0, ans=0.1 2023-06-24 19:38:07,897 INFO [train.py:996] (0/4) Epoch 10, batch 28850, loss[loss=0.2352, simple_loss=0.3058, pruned_loss=0.08228, over 21885.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3146, pruned_loss=0.08569, over 4286734.77 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:38:28,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1819872.0, ans=0.04949747468305833 2023-06-24 19:38:57,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1819932.0, ans=0.025 2023-06-24 19:39:30,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1819992.0, ans=0.1 2023-06-24 19:39:43,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1820052.0, ans=0.2 2023-06-24 19:39:58,252 INFO [train.py:996] (0/4) Epoch 10, batch 28900, loss[loss=0.316, simple_loss=0.4339, pruned_loss=0.09908, over 19745.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.318, pruned_loss=0.08811, over 4282803.38 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:40:22,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1820172.0, ans=0.0 2023-06-24 19:40:37,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1820172.0, ans=0.05 2023-06-24 19:40:44,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1820232.0, ans=0.1 2023-06-24 19:41:22,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.789e+02 7.717e+02 1.074e+03 1.480e+03 3.570e+03, threshold=2.148e+03, percent-clipped=12.0 2023-06-24 19:41:40,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1820352.0, ans=0.125 2023-06-24 19:41:49,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1820352.0, ans=0.125 2023-06-24 19:41:56,150 INFO [train.py:996] (0/4) Epoch 10, batch 28950, loss[loss=0.2877, simple_loss=0.3722, pruned_loss=0.1016, over 21555.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3185, pruned_loss=0.08674, over 4276567.74 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:42:08,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1820412.0, ans=0.0 2023-06-24 19:42:33,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1820472.0, ans=0.0 2023-06-24 19:42:47,675 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:43:32,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1820652.0, ans=0.125 2023-06-24 19:43:53,429 INFO [train.py:996] (0/4) Epoch 10, batch 29000, loss[loss=0.2376, simple_loss=0.3203, pruned_loss=0.07749, over 21776.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.322, pruned_loss=0.08547, over 4277543.78 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:44:04,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1820712.0, ans=0.0 2023-06-24 19:44:51,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1820832.0, ans=0.0 2023-06-24 19:45:02,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.875e+02 6.963e+02 8.880e+02 1.390e+03 4.828e+03, threshold=1.776e+03, percent-clipped=11.0 2023-06-24 19:45:02,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1820892.0, ans=0.125 2023-06-24 19:45:19,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1820952.0, ans=0.1 2023-06-24 19:45:41,337 INFO [train.py:996] (0/4) Epoch 10, batch 29050, loss[loss=0.2261, simple_loss=0.2961, pruned_loss=0.07807, over 21286.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3194, pruned_loss=0.08544, over 4284012.26 frames. ], batch size: 176, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:45:53,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1821012.0, ans=0.0 2023-06-24 19:46:24,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1821132.0, ans=0.1 2023-06-24 19:46:56,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=22.5 2023-06-24 19:47:27,573 INFO [train.py:996] (0/4) Epoch 10, batch 29100, loss[loss=0.1712, simple_loss=0.2354, pruned_loss=0.05344, over 21488.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3105, pruned_loss=0.08319, over 4283264.40 frames. ], batch size: 212, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:47:48,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1821372.0, ans=0.0 2023-06-24 19:48:39,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1821492.0, ans=0.125 2023-06-24 19:48:40,596 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.869e+02 7.571e+02 1.020e+03 1.542e+03 3.510e+03, threshold=2.040e+03, percent-clipped=14.0 2023-06-24 19:49:10,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1821552.0, ans=0.0 2023-06-24 19:49:13,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1821552.0, ans=0.125 2023-06-24 19:49:15,525 INFO [train.py:996] (0/4) Epoch 10, batch 29150, loss[loss=0.2466, simple_loss=0.3392, pruned_loss=0.07695, over 21796.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3089, pruned_loss=0.08147, over 4281918.44 frames. ], batch size: 282, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:49:17,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1821612.0, ans=0.125 2023-06-24 19:50:00,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1821732.0, ans=0.1 2023-06-24 19:50:59,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1821852.0, ans=0.0 2023-06-24 19:51:04,099 INFO [train.py:996] (0/4) Epoch 10, batch 29200, loss[loss=0.1934, simple_loss=0.2639, pruned_loss=0.06145, over 21328.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3056, pruned_loss=0.08125, over 4275828.82 frames. ], batch size: 131, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:51:48,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1822032.0, ans=0.0 2023-06-24 19:52:10,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-24 19:52:17,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.365e+02 7.371e+02 1.140e+03 1.562e+03 2.800e+03, threshold=2.281e+03, percent-clipped=10.0 2023-06-24 19:52:53,173 INFO [train.py:996] (0/4) Epoch 10, batch 29250, loss[loss=0.2381, simple_loss=0.329, pruned_loss=0.07359, over 21721.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3034, pruned_loss=0.07856, over 4270424.52 frames. ], batch size: 352, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:52:56,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=12.0 2023-06-24 19:53:12,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1822212.0, ans=0.125 2023-06-24 19:53:50,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-24 19:54:41,036 INFO [train.py:996] (0/4) Epoch 10, batch 29300, loss[loss=0.2341, simple_loss=0.3017, pruned_loss=0.08327, over 21309.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.307, pruned_loss=0.07837, over 4278678.04 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:55:55,077 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=22.5 2023-06-24 19:56:02,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.304e+02 6.696e+02 9.053e+02 1.473e+03 3.162e+03, threshold=1.811e+03, percent-clipped=5.0 2023-06-24 19:56:30,779 INFO [train.py:996] (0/4) Epoch 10, batch 29350, loss[loss=0.2068, simple_loss=0.2905, pruned_loss=0.06151, over 21682.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3029, pruned_loss=0.07754, over 4269040.22 frames. ], batch size: 282, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:57:05,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1822872.0, ans=0.07 2023-06-24 19:57:52,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1822992.0, ans=0.0 2023-06-24 19:58:11,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1823052.0, ans=0.1 2023-06-24 19:58:31,672 INFO [train.py:996] (0/4) Epoch 10, batch 29400, loss[loss=0.2259, simple_loss=0.3204, pruned_loss=0.06566, over 21195.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3018, pruned_loss=0.07547, over 4256745.44 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:58:34,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2023-06-24 19:59:39,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.735e+02 7.217e+02 1.208e+03 1.784e+03 4.520e+03, threshold=2.416e+03, percent-clipped=24.0 2023-06-24 20:00:11,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1823352.0, ans=0.125 2023-06-24 20:00:18,175 INFO [train.py:996] (0/4) Epoch 10, batch 29450, loss[loss=0.2696, simple_loss=0.3453, pruned_loss=0.09692, over 21608.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3004, pruned_loss=0.07425, over 4266849.28 frames. ], batch size: 389, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 20:00:32,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1823412.0, ans=0.125 2023-06-24 20:01:57,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1823712.0, ans=0.1 2023-06-24 20:01:58,247 INFO [train.py:996] (0/4) Epoch 10, batch 29500, loss[loss=0.1932, simple_loss=0.2625, pruned_loss=0.06192, over 20225.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3039, pruned_loss=0.07687, over 4268343.45 frames. ], batch size: 703, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:02:18,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1823772.0, ans=0.1 2023-06-24 20:03:06,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.744e+02 7.201e+02 9.951e+02 1.281e+03 3.079e+03, threshold=1.990e+03, percent-clipped=2.0 2023-06-24 20:03:37,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1823952.0, ans=0.125 2023-06-24 20:03:38,685 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-304000.pt 2023-06-24 20:03:45,402 INFO [train.py:996] (0/4) Epoch 10, batch 29550, loss[loss=0.275, simple_loss=0.3316, pruned_loss=0.1092, over 21818.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3031, pruned_loss=0.07859, over 4279449.81 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:03:50,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.00 vs. limit=6.0 2023-06-24 20:04:25,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1824072.0, ans=0.0 2023-06-24 20:04:42,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1824132.0, ans=0.125 2023-06-24 20:05:38,103 INFO [train.py:996] (0/4) Epoch 10, batch 29600, loss[loss=0.2353, simple_loss=0.3153, pruned_loss=0.07767, over 21353.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3093, pruned_loss=0.08065, over 4283749.89 frames. ], batch size: 176, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:06:14,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1824372.0, ans=0.09899494936611666 2023-06-24 20:06:21,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1824432.0, ans=0.125 2023-06-24 20:06:54,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.014e+02 8.080e+02 1.124e+03 1.418e+03 4.047e+03, threshold=2.247e+03, percent-clipped=9.0 2023-06-24 20:07:12,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1824552.0, ans=0.125 2023-06-24 20:07:22,982 INFO [train.py:996] (0/4) Epoch 10, batch 29650, loss[loss=0.276, simple_loss=0.3413, pruned_loss=0.1054, over 21711.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.306, pruned_loss=0.07759, over 4288848.73 frames. ], batch size: 441, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:07:33,914 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:07:39,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1824672.0, ans=0.0 2023-06-24 20:08:23,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-24 20:08:30,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1824792.0, ans=0.125 2023-06-24 20:08:48,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1824852.0, ans=0.0 2023-06-24 20:09:01,091 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:09:10,418 INFO [train.py:996] (0/4) Epoch 10, batch 29700, loss[loss=0.2571, simple_loss=0.3604, pruned_loss=0.07687, over 21800.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3066, pruned_loss=0.07695, over 4295320.00 frames. ], batch size: 282, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:09:18,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-24 20:09:23,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.50 vs. limit=10.0 2023-06-24 20:09:39,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1824972.0, ans=0.04949747468305833 2023-06-24 20:09:59,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-24 20:10:28,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1825092.0, ans=0.125 2023-06-24 20:10:29,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.654e+02 7.202e+02 1.096e+03 1.682e+03 4.155e+03, threshold=2.193e+03, percent-clipped=13.0 2023-06-24 20:10:37,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1825152.0, ans=0.125 2023-06-24 20:10:52,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-24 20:10:58,068 INFO [train.py:996] (0/4) Epoch 10, batch 29750, loss[loss=0.2178, simple_loss=0.2901, pruned_loss=0.07272, over 21331.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3123, pruned_loss=0.0776, over 4292973.27 frames. ], batch size: 176, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:11:03,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1825212.0, ans=0.125 2023-06-24 20:11:22,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-24 20:12:21,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1825452.0, ans=0.0 2023-06-24 20:12:21,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1825452.0, ans=0.125 2023-06-24 20:12:44,358 INFO [train.py:996] (0/4) Epoch 10, batch 29800, loss[loss=0.2251, simple_loss=0.3093, pruned_loss=0.07049, over 21532.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3152, pruned_loss=0.07839, over 4289291.10 frames. ], batch size: 211, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:13:04,594 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-06-24 20:13:19,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1825572.0, ans=0.05 2023-06-24 20:13:59,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1825692.0, ans=0.125 2023-06-24 20:14:03,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.147e+02 7.156e+02 9.428e+02 1.298e+03 2.212e+03, threshold=1.886e+03, percent-clipped=2.0 2023-06-24 20:14:19,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1825752.0, ans=0.125 2023-06-24 20:14:32,501 INFO [train.py:996] (0/4) Epoch 10, batch 29850, loss[loss=0.2217, simple_loss=0.2927, pruned_loss=0.07536, over 21772.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3119, pruned_loss=0.07675, over 4282229.46 frames. ], batch size: 247, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:14:32,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1825812.0, ans=0.1 2023-06-24 20:15:05,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-24 20:15:07,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-24 20:15:16,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1825932.0, ans=0.0 2023-06-24 20:15:42,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1825992.0, ans=0.1 2023-06-24 20:16:13,645 INFO [train.py:996] (0/4) Epoch 10, batch 29900, loss[loss=0.1751, simple_loss=0.2235, pruned_loss=0.06336, over 20198.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3098, pruned_loss=0.07824, over 4284722.40 frames. ], batch size: 704, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:16:25,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1826112.0, ans=0.125 2023-06-24 20:16:54,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826232.0, ans=0.1 2023-06-24 20:17:23,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826292.0, ans=0.1 2023-06-24 20:17:37,204 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.094e+02 6.347e+02 7.852e+02 1.102e+03 2.302e+03, threshold=1.570e+03, percent-clipped=5.0 2023-06-24 20:18:07,373 INFO [train.py:996] (0/4) Epoch 10, batch 29950, loss[loss=0.1976, simple_loss=0.2543, pruned_loss=0.07045, over 20185.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3135, pruned_loss=0.08261, over 4278087.19 frames. ], batch size: 707, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:18:23,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-24 20:18:46,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826532.0, ans=0.1 2023-06-24 20:19:38,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826652.0, ans=0.1 2023-06-24 20:19:39,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=22.5 2023-06-24 20:19:41,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826652.0, ans=0.1 2023-06-24 20:19:54,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1826712.0, ans=0.0 2023-06-24 20:19:55,939 INFO [train.py:996] (0/4) Epoch 10, batch 30000, loss[loss=0.1939, simple_loss=0.2897, pruned_loss=0.049, over 21639.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.316, pruned_loss=0.08243, over 4271363.16 frames. ], batch size: 230, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:19:55,940 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 20:20:12,957 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([0.8854, 1.7411, 1.6481, 2.4997], device='cuda:0') 2023-06-24 20:20:14,320 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2483, simple_loss=0.3443, pruned_loss=0.07614, over 1796401.00 frames. 2023-06-24 20:20:14,321 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 20:20:26,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1826712.0, ans=0.2 2023-06-24 20:20:28,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1826712.0, ans=0.07 2023-06-24 20:20:30,215 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:21:30,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826892.0, ans=0.1 2023-06-24 20:21:40,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.194e+02 6.464e+02 9.828e+02 1.489e+03 3.469e+03, threshold=1.966e+03, percent-clipped=22.0 2023-06-24 20:22:02,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1826952.0, ans=0.125 2023-06-24 20:22:16,942 INFO [train.py:996] (0/4) Epoch 10, batch 30050, loss[loss=0.2369, simple_loss=0.3489, pruned_loss=0.0624, over 20801.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3169, pruned_loss=0.07838, over 4273715.28 frames. ], batch size: 607, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:23:06,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1827132.0, ans=0.0 2023-06-24 20:23:52,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1827252.0, ans=0.125 2023-06-24 20:23:59,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1827252.0, ans=0.05 2023-06-24 20:24:02,316 INFO [train.py:996] (0/4) Epoch 10, batch 30100, loss[loss=0.2553, simple_loss=0.3072, pruned_loss=0.1017, over 21209.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3162, pruned_loss=0.07867, over 4269020.69 frames. ], batch size: 176, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:24:17,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1827312.0, ans=0.125 2023-06-24 20:24:57,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1827432.0, ans=0.125 2023-06-24 20:25:17,443 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-24 20:25:20,864 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.518e+02 7.739e+02 1.138e+03 1.850e+03 3.841e+03, threshold=2.275e+03, percent-clipped=20.0 2023-06-24 20:25:32,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1827552.0, ans=0.125 2023-06-24 20:25:41,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1827552.0, ans=0.125 2023-06-24 20:25:41,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1827552.0, ans=0.09899494936611666 2023-06-24 20:25:46,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1827552.0, ans=0.0 2023-06-24 20:25:54,251 INFO [train.py:996] (0/4) Epoch 10, batch 30150, loss[loss=0.2862, simple_loss=0.3458, pruned_loss=0.1134, over 21281.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3137, pruned_loss=0.08129, over 4272018.92 frames. ], batch size: 143, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:26:03,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1827612.0, ans=15.0 2023-06-24 20:26:08,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1827612.0, ans=0.125 2023-06-24 20:26:14,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1827672.0, ans=0.125 2023-06-24 20:26:37,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1827732.0, ans=0.0 2023-06-24 20:27:09,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1827792.0, ans=0.125 2023-06-24 20:27:37,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1827852.0, ans=0.125 2023-06-24 20:27:43,930 INFO [train.py:996] (0/4) Epoch 10, batch 30200, loss[loss=0.2289, simple_loss=0.307, pruned_loss=0.07544, over 21412.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3159, pruned_loss=0.08042, over 4269055.04 frames. ], batch size: 211, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:28:11,523 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.08 vs. limit=22.5 2023-06-24 20:28:45,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1828032.0, ans=0.1 2023-06-24 20:28:45,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1828032.0, ans=0.125 2023-06-24 20:29:09,514 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.062e+02 7.107e+02 1.003e+03 1.591e+03 3.654e+03, threshold=2.006e+03, percent-clipped=8.0 2023-06-24 20:29:37,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1828212.0, ans=0.0 2023-06-24 20:29:38,223 INFO [train.py:996] (0/4) Epoch 10, batch 30250, loss[loss=0.2487, simple_loss=0.3078, pruned_loss=0.09482, over 21937.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3217, pruned_loss=0.08146, over 4273307.19 frames. ], batch size: 98, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:30:46,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1828392.0, ans=0.0 2023-06-24 20:30:57,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-24 20:31:24,958 INFO [train.py:996] (0/4) Epoch 10, batch 30300, loss[loss=0.2116, simple_loss=0.2752, pruned_loss=0.07399, over 21647.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3187, pruned_loss=0.08173, over 4270682.66 frames. ], batch size: 282, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:31:27,567 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=12.0 2023-06-24 20:31:49,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1828572.0, ans=0.0 2023-06-24 20:32:47,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.072e+02 7.345e+02 9.964e+02 1.414e+03 3.448e+03, threshold=1.993e+03, percent-clipped=9.0 2023-06-24 20:32:49,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1828692.0, ans=0.0 2023-06-24 20:33:21,549 INFO [train.py:996] (0/4) Epoch 10, batch 30350, loss[loss=0.2241, simple_loss=0.3028, pruned_loss=0.07267, over 21646.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3179, pruned_loss=0.08273, over 4263339.94 frames. ], batch size: 247, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:33:45,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-24 20:33:58,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1828932.0, ans=0.125 2023-06-24 20:34:04,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-24 20:34:13,623 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:34:50,315 INFO [train.py:996] (0/4) Epoch 10, batch 30400, loss[loss=0.2065, simple_loss=0.2551, pruned_loss=0.07896, over 20220.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3141, pruned_loss=0.08138, over 4257905.63 frames. ], batch size: 703, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:35:32,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1829232.0, ans=0.125 2023-06-24 20:35:32,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1829232.0, ans=0.125 2023-06-24 20:35:33,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1829232.0, ans=0.125 2023-06-24 20:35:57,900 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.475e+02 9.226e+02 1.502e+03 2.385e+03 9.827e+03, threshold=3.004e+03, percent-clipped=34.0 2023-06-24 20:36:18,164 INFO [train.py:996] (0/4) Epoch 10, batch 30450, loss[loss=0.2767, simple_loss=0.4031, pruned_loss=0.07515, over 19850.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3157, pruned_loss=0.08059, over 4199467.44 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:37:28,647 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-10.pt 2023-06-24 20:39:22,105 INFO [train.py:996] (0/4) Epoch 11, batch 0, loss[loss=0.242, simple_loss=0.3008, pruned_loss=0.09157, over 21498.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3008, pruned_loss=0.09157, over 21498.00 frames. ], batch size: 195, lr: 2.72e-03, grad_scale: 32.0 2023-06-24 20:39:22,107 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 20:39:38,859 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2455, simple_loss=0.3504, pruned_loss=0.07029, over 1796401.00 frames. 2023-06-24 20:39:38,860 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 20:40:07,364 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-24 20:40:52,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1829916.0, ans=0.5 2023-06-24 20:41:04,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.679e+02 1.420e+03 2.190e+03 4.363e+03 1.061e+04, threshold=4.380e+03, percent-clipped=34.0 2023-06-24 20:41:19,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1829976.0, ans=0.0 2023-06-24 20:41:20,426 INFO [train.py:996] (0/4) Epoch 11, batch 50, loss[loss=0.2864, simple_loss=0.366, pruned_loss=0.1034, over 21485.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.322, pruned_loss=0.08265, over 961736.41 frames. ], batch size: 471, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:41:56,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1830096.0, ans=0.125 2023-06-24 20:42:35,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1830216.0, ans=0.0 2023-06-24 20:42:56,723 INFO [train.py:996] (0/4) Epoch 11, batch 100, loss[loss=0.2083, simple_loss=0.2814, pruned_loss=0.06763, over 21847.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3329, pruned_loss=0.08485, over 1688681.54 frames. ], batch size: 118, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:43:08,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1830276.0, ans=0.125 2023-06-24 20:43:45,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1830396.0, ans=0.02 2023-06-24 20:43:45,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1830396.0, ans=0.125 2023-06-24 20:44:14,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-24 20:44:30,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.668e+02 7.799e+02 1.011e+03 1.345e+03 2.704e+03, threshold=2.023e+03, percent-clipped=0.0 2023-06-24 20:44:46,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1830516.0, ans=0.0 2023-06-24 20:44:50,621 INFO [train.py:996] (0/4) Epoch 11, batch 150, loss[loss=0.2037, simple_loss=0.2551, pruned_loss=0.0761, over 16366.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3368, pruned_loss=0.08494, over 2263693.75 frames. ], batch size: 64, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:44:58,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-24 20:45:21,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1830636.0, ans=0.0 2023-06-24 20:45:59,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1830756.0, ans=0.0 2023-06-24 20:46:31,239 INFO [train.py:996] (0/4) Epoch 11, batch 200, loss[loss=0.1991, simple_loss=0.2755, pruned_loss=0.06141, over 21868.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.333, pruned_loss=0.08303, over 2706876.45 frames. ], batch size: 283, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:47:10,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-24 20:47:41,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1831056.0, ans=0.125 2023-06-24 20:47:55,403 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.519e+02 7.230e+02 1.005e+03 1.517e+03 6.245e+03, threshold=2.009e+03, percent-clipped=15.0 2023-06-24 20:48:06,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1831116.0, ans=0.2 2023-06-24 20:48:15,620 INFO [train.py:996] (0/4) Epoch 11, batch 250, loss[loss=0.2729, simple_loss=0.3826, pruned_loss=0.08157, over 19781.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3294, pruned_loss=0.08383, over 3048692.27 frames. ], batch size: 703, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:48:17,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1831176.0, ans=0.0 2023-06-24 20:48:53,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1831296.0, ans=0.125 2023-06-24 20:50:00,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831476.0, ans=0.1 2023-06-24 20:50:01,007 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-24 20:50:01,299 INFO [train.py:996] (0/4) Epoch 11, batch 300, loss[loss=0.1946, simple_loss=0.2615, pruned_loss=0.06384, over 21092.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3234, pruned_loss=0.08312, over 3325667.98 frames. ], batch size: 607, lr: 2.72e-03, grad_scale: 8.0 2023-06-24 20:50:15,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1831476.0, ans=0.125 2023-06-24 20:50:18,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1831536.0, ans=0.125 2023-06-24 20:50:34,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1831536.0, ans=0.2 2023-06-24 20:51:06,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1831656.0, ans=0.125 2023-06-24 20:51:21,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1831716.0, ans=0.0 2023-06-24 20:51:30,482 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 7.499e+02 1.164e+03 1.692e+03 3.059e+03, threshold=2.329e+03, percent-clipped=16.0 2023-06-24 20:51:50,459 INFO [train.py:996] (0/4) Epoch 11, batch 350, loss[loss=0.2115, simple_loss=0.2752, pruned_loss=0.07393, over 21479.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3156, pruned_loss=0.08097, over 3519599.30 frames. ], batch size: 195, lr: 2.72e-03, grad_scale: 8.0 2023-06-24 20:52:02,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1831776.0, ans=0.125 2023-06-24 20:52:07,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1831836.0, ans=0.125 2023-06-24 20:52:24,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1831836.0, ans=0.1 2023-06-24 20:52:44,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1831896.0, ans=10.0 2023-06-24 20:53:14,079 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-24 20:53:25,874 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-24 20:53:30,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1832076.0, ans=0.0 2023-06-24 20:53:31,122 INFO [train.py:996] (0/4) Epoch 11, batch 400, loss[loss=0.2612, simple_loss=0.363, pruned_loss=0.07974, over 21847.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3105, pruned_loss=0.07852, over 3684911.47 frames. ], batch size: 316, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:53:39,583 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:53:42,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1832076.0, ans=0.0 2023-06-24 20:53:44,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1832076.0, ans=0.125 2023-06-24 20:53:49,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1832076.0, ans=0.0 2023-06-24 20:53:57,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1832136.0, ans=0.125 2023-06-24 20:54:02,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1832136.0, ans=0.0 2023-06-24 20:54:59,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1832316.0, ans=0.0 2023-06-24 20:55:08,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1832316.0, ans=0.125 2023-06-24 20:55:11,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.125e+02 8.397e+02 1.523e+03 1.983e+03 4.862e+03, threshold=3.046e+03, percent-clipped=16.0 2023-06-24 20:55:18,139 INFO [train.py:996] (0/4) Epoch 11, batch 450, loss[loss=0.1895, simple_loss=0.2562, pruned_loss=0.06139, over 21586.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3073, pruned_loss=0.0779, over 3817409.03 frames. ], batch size: 263, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:55:28,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1832376.0, ans=0.125 2023-06-24 20:55:33,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1832376.0, ans=0.0 2023-06-24 20:55:56,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1832436.0, ans=0.0 2023-06-24 20:56:38,752 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:57:08,352 INFO [train.py:996] (0/4) Epoch 11, batch 500, loss[loss=0.2407, simple_loss=0.3162, pruned_loss=0.08257, over 21294.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3055, pruned_loss=0.07692, over 3916689.52 frames. ], batch size: 159, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:57:10,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1832676.0, ans=0.0 2023-06-24 20:57:45,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1832796.0, ans=0.125 2023-06-24 20:58:04,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1832796.0, ans=0.125 2023-06-24 20:58:19,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1832856.0, ans=0.125 2023-06-24 20:58:40,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 9.876e+02 1.724e+03 2.578e+03 4.436e+03, threshold=3.448e+03, percent-clipped=13.0 2023-06-24 20:58:53,044 INFO [train.py:996] (0/4) Epoch 11, batch 550, loss[loss=0.2495, simple_loss=0.3219, pruned_loss=0.08856, over 21746.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3099, pruned_loss=0.07728, over 3998471.93 frames. ], batch size: 441, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:59:00,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1832976.0, ans=0.2 2023-06-24 21:00:38,860 INFO [train.py:996] (0/4) Epoch 11, batch 600, loss[loss=0.2167, simple_loss=0.2958, pruned_loss=0.06876, over 21675.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3152, pruned_loss=0.07738, over 4066134.54 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:01:33,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1833396.0, ans=0.2 2023-06-24 21:02:06,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1833516.0, ans=0.125 2023-06-24 21:02:13,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 7.057e+02 1.048e+03 1.641e+03 3.624e+03, threshold=2.096e+03, percent-clipped=2.0 2023-06-24 21:02:26,675 INFO [train.py:996] (0/4) Epoch 11, batch 650, loss[loss=0.2031, simple_loss=0.279, pruned_loss=0.06365, over 21685.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3167, pruned_loss=0.07792, over 4119146.90 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:02:33,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1833576.0, ans=0.125 2023-06-24 21:03:26,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1833756.0, ans=0.125 2023-06-24 21:03:52,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1833816.0, ans=0.125 2023-06-24 21:04:04,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-24 21:04:04,939 INFO [train.py:996] (0/4) Epoch 11, batch 700, loss[loss=0.2419, simple_loss=0.321, pruned_loss=0.08139, over 21841.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.317, pruned_loss=0.07915, over 4155892.57 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:05:03,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1833996.0, ans=0.2 2023-06-24 21:05:16,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1834056.0, ans=0.125 2023-06-24 21:05:37,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1834116.0, ans=0.125 2023-06-24 21:05:44,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.745e+02 9.389e+02 1.418e+03 2.158e+03 4.228e+03, threshold=2.836e+03, percent-clipped=28.0 2023-06-24 21:05:50,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1834176.0, ans=0.0 2023-06-24 21:05:51,304 INFO [train.py:996] (0/4) Epoch 11, batch 750, loss[loss=0.2551, simple_loss=0.3679, pruned_loss=0.07117, over 21728.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3146, pruned_loss=0.07931, over 4190056.32 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:06:04,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834176.0, ans=0.1 2023-06-24 21:06:09,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1834176.0, ans=0.025 2023-06-24 21:06:11,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834176.0, ans=0.1 2023-06-24 21:06:26,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1834236.0, ans=0.0 2023-06-24 21:06:39,292 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:07:03,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834356.0, ans=0.1 2023-06-24 21:07:40,984 INFO [train.py:996] (0/4) Epoch 11, batch 800, loss[loss=0.2071, simple_loss=0.2778, pruned_loss=0.06821, over 21367.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.313, pruned_loss=0.07938, over 4213703.90 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:07:50,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1834476.0, ans=0.0 2023-06-24 21:08:27,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1834596.0, ans=0.0 2023-06-24 21:08:56,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1834656.0, ans=0.125 2023-06-24 21:09:01,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1834656.0, ans=0.2 2023-06-24 21:09:21,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 7.606e+02 1.363e+03 2.031e+03 4.976e+03, threshold=2.727e+03, percent-clipped=7.0 2023-06-24 21:09:32,195 INFO [train.py:996] (0/4) Epoch 11, batch 850, loss[loss=0.2057, simple_loss=0.278, pruned_loss=0.06675, over 21889.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3106, pruned_loss=0.0791, over 4230825.18 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:09:55,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1834836.0, ans=0.125 2023-06-24 21:10:29,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1834896.0, ans=0.125 2023-06-24 21:11:20,176 INFO [train.py:996] (0/4) Epoch 11, batch 900, loss[loss=0.2264, simple_loss=0.2961, pruned_loss=0.07839, over 21758.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3075, pruned_loss=0.07868, over 4233369.65 frames. ], batch size: 391, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:12:52,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1835316.0, ans=0.0 2023-06-24 21:12:53,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-24 21:13:05,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 7.581e+02 9.805e+02 1.489e+03 3.191e+03, threshold=1.961e+03, percent-clipped=4.0 2023-06-24 21:13:08,503 INFO [train.py:996] (0/4) Epoch 11, batch 950, loss[loss=0.2207, simple_loss=0.2878, pruned_loss=0.07678, over 21252.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3051, pruned_loss=0.07855, over 4248749.64 frames. ], batch size: 159, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:13:11,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.63 vs. limit=15.0 2023-06-24 21:13:32,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1835436.0, ans=0.0 2023-06-24 21:13:47,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1835436.0, ans=0.125 2023-06-24 21:13:59,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1835496.0, ans=0.125 2023-06-24 21:14:20,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1835556.0, ans=0.125 2023-06-24 21:14:57,549 INFO [train.py:996] (0/4) Epoch 11, batch 1000, loss[loss=0.2316, simple_loss=0.3063, pruned_loss=0.07848, over 21865.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3041, pruned_loss=0.07796, over 4262949.07 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:15:07,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1835676.0, ans=0.1 2023-06-24 21:15:37,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1835736.0, ans=0.2 2023-06-24 21:16:07,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1835856.0, ans=0.2 2023-06-24 21:16:38,012 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:16:49,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 6.726e+02 9.371e+02 1.402e+03 3.411e+03, threshold=1.874e+03, percent-clipped=8.0 2023-06-24 21:16:51,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-24 21:16:53,262 INFO [train.py:996] (0/4) Epoch 11, batch 1050, loss[loss=0.2122, simple_loss=0.284, pruned_loss=0.07027, over 21418.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3044, pruned_loss=0.07778, over 4266879.75 frames. ], batch size: 194, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:16:53,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1835976.0, ans=0.0 2023-06-24 21:17:26,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1836036.0, ans=0.0 2023-06-24 21:17:31,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1836036.0, ans=0.2 2023-06-24 21:17:50,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-24 21:17:53,364 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2023-06-24 21:18:27,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1836216.0, ans=0.125 2023-06-24 21:18:39,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1836216.0, ans=0.0 2023-06-24 21:18:43,424 INFO [train.py:996] (0/4) Epoch 11, batch 1100, loss[loss=0.2476, simple_loss=0.3219, pruned_loss=0.08667, over 21442.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3067, pruned_loss=0.0789, over 4274196.72 frames. ], batch size: 548, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:19:16,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-06-24 21:19:18,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1836336.0, ans=0.1 2023-06-24 21:19:23,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1836396.0, ans=0.0 2023-06-24 21:19:25,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1836396.0, ans=0.0 2023-06-24 21:19:57,309 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:20:15,187 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-06-24 21:20:20,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1836516.0, ans=0.125 2023-06-24 21:20:26,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.856e+02 8.203e+02 1.251e+03 2.125e+03 4.416e+03, threshold=2.502e+03, percent-clipped=31.0 2023-06-24 21:20:36,481 INFO [train.py:996] (0/4) Epoch 11, batch 1150, loss[loss=0.2006, simple_loss=0.2748, pruned_loss=0.06326, over 21362.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3065, pruned_loss=0.07811, over 4279918.63 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:21:11,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1836636.0, ans=0.1 2023-06-24 21:21:25,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-24 21:22:25,724 INFO [train.py:996] (0/4) Epoch 11, batch 1200, loss[loss=0.2831, simple_loss=0.364, pruned_loss=0.1011, over 21583.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3091, pruned_loss=0.07839, over 4283076.98 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:22:30,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-06-24 21:22:43,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1836936.0, ans=0.125 2023-06-24 21:22:56,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-06-24 21:24:00,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1837116.0, ans=0.1 2023-06-24 21:24:05,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.099e+02 7.723e+02 1.059e+03 1.468e+03 2.676e+03, threshold=2.118e+03, percent-clipped=4.0 2023-06-24 21:24:14,381 INFO [train.py:996] (0/4) Epoch 11, batch 1250, loss[loss=0.2528, simple_loss=0.3196, pruned_loss=0.09297, over 21758.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3123, pruned_loss=0.07956, over 4283867.04 frames. ], batch size: 112, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:26:01,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1837416.0, ans=0.0 2023-06-24 21:26:04,442 INFO [train.py:996] (0/4) Epoch 11, batch 1300, loss[loss=0.2924, simple_loss=0.3566, pruned_loss=0.1141, over 21678.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3144, pruned_loss=0.08058, over 4282646.27 frames. ], batch size: 507, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:26:27,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.40 vs. limit=6.0 2023-06-24 21:27:10,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1837656.0, ans=0.04949747468305833 2023-06-24 21:27:31,374 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-24 21:27:52,139 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.439e+02 7.719e+02 9.846e+02 1.503e+03 2.792e+03, threshold=1.969e+03, percent-clipped=4.0 2023-06-24 21:27:53,895 INFO [train.py:996] (0/4) Epoch 11, batch 1350, loss[loss=0.2508, simple_loss=0.3291, pruned_loss=0.08621, over 21595.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.315, pruned_loss=0.08089, over 4291571.07 frames. ], batch size: 415, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:28:25,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-24 21:28:28,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-24 21:29:38,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1838016.0, ans=0.125 2023-06-24 21:29:43,487 INFO [train.py:996] (0/4) Epoch 11, batch 1400, loss[loss=0.3019, simple_loss=0.3615, pruned_loss=0.1212, over 21649.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3134, pruned_loss=0.08132, over 4290220.70 frames. ], batch size: 507, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:30:28,755 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-24 21:30:53,087 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-24 21:31:02,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1838256.0, ans=22.5 2023-06-24 21:31:17,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1838316.0, ans=0.2 2023-06-24 21:31:25,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1838316.0, ans=0.125 2023-06-24 21:31:31,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.111e+02 8.921e+02 1.291e+03 1.879e+03 3.355e+03, threshold=2.582e+03, percent-clipped=19.0 2023-06-24 21:31:33,641 INFO [train.py:996] (0/4) Epoch 11, batch 1450, loss[loss=0.2331, simple_loss=0.3157, pruned_loss=0.07521, over 21458.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3134, pruned_loss=0.08172, over 4291248.67 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:32:26,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1838496.0, ans=0.125 2023-06-24 21:32:30,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-24 21:33:21,342 INFO [train.py:996] (0/4) Epoch 11, batch 1500, loss[loss=0.205, simple_loss=0.3139, pruned_loss=0.04805, over 20941.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3157, pruned_loss=0.08326, over 4296224.86 frames. ], batch size: 608, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:33:30,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=15.0 2023-06-24 21:34:21,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1838796.0, ans=0.125 2023-06-24 21:34:21,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2023-06-24 21:35:08,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.229e+02 8.037e+02 1.041e+03 1.486e+03 3.371e+03, threshold=2.081e+03, percent-clipped=9.0 2023-06-24 21:35:10,393 INFO [train.py:996] (0/4) Epoch 11, batch 1550, loss[loss=0.2503, simple_loss=0.3181, pruned_loss=0.09124, over 21301.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3124, pruned_loss=0.08243, over 4301774.96 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:35:59,189 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-24 21:36:03,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1839096.0, ans=0.0 2023-06-24 21:36:44,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-24 21:37:01,955 INFO [train.py:996] (0/4) Epoch 11, batch 1600, loss[loss=0.2964, simple_loss=0.3612, pruned_loss=0.1158, over 21766.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3117, pruned_loss=0.08205, over 4289810.40 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:37:19,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1839276.0, ans=0.0 2023-06-24 21:37:22,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-24 21:37:46,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-24 21:38:23,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1839456.0, ans=0.125 2023-06-24 21:38:35,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1839456.0, ans=0.05 2023-06-24 21:38:59,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.877e+02 8.023e+02 1.190e+03 1.791e+03 3.601e+03, threshold=2.379e+03, percent-clipped=18.0 2023-06-24 21:39:01,037 INFO [train.py:996] (0/4) Epoch 11, batch 1650, loss[loss=0.2382, simple_loss=0.3087, pruned_loss=0.0838, over 21813.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3088, pruned_loss=0.08014, over 4290444.10 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:39:23,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1839636.0, ans=0.05 2023-06-24 21:40:18,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1839756.0, ans=0.2 2023-06-24 21:40:24,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1839756.0, ans=0.125 2023-06-24 21:40:50,796 INFO [train.py:996] (0/4) Epoch 11, batch 1700, loss[loss=0.2918, simple_loss=0.351, pruned_loss=0.1163, over 21449.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3116, pruned_loss=0.08061, over 4294089.98 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:41:27,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1839936.0, ans=0.0 2023-06-24 21:42:00,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-24 21:42:17,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1840056.0, ans=0.1 2023-06-24 21:42:47,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.610e+02 1.287e+03 1.983e+03 3.488e+03, threshold=2.574e+03, percent-clipped=18.0 2023-06-24 21:42:49,520 INFO [train.py:996] (0/4) Epoch 11, batch 1750, loss[loss=0.1973, simple_loss=0.2887, pruned_loss=0.05291, over 21716.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3118, pruned_loss=0.07936, over 4291585.13 frames. ], batch size: 298, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:42:49,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1840176.0, ans=0.04949747468305833 2023-06-24 21:43:09,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1840176.0, ans=0.07 2023-06-24 21:43:40,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1840296.0, ans=0.125 2023-06-24 21:43:45,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1840296.0, ans=0.015 2023-06-24 21:43:47,035 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:44:30,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1840416.0, ans=0.125 2023-06-24 21:44:39,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1840416.0, ans=0.125 2023-06-24 21:44:41,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1840416.0, ans=0.1 2023-06-24 21:44:41,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1840416.0, ans=0.125 2023-06-24 21:44:50,790 INFO [train.py:996] (0/4) Epoch 11, batch 1800, loss[loss=0.1671, simple_loss=0.2396, pruned_loss=0.04734, over 21280.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3095, pruned_loss=0.07693, over 4293983.40 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:45:53,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1840656.0, ans=0.0 2023-06-24 21:45:58,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1840656.0, ans=0.125 2023-06-24 21:45:59,180 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-24 21:46:40,513 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.892e+02 1.019e+03 1.831e+03 4.064e+03, threshold=2.037e+03, percent-clipped=9.0 2023-06-24 21:46:48,440 INFO [train.py:996] (0/4) Epoch 11, batch 1850, loss[loss=0.2269, simple_loss=0.314, pruned_loss=0.06991, over 19987.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.309, pruned_loss=0.07434, over 4289114.11 frames. ], batch size: 702, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:47:48,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1840956.0, ans=0.0 2023-06-24 21:47:48,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1840956.0, ans=0.125 2023-06-24 21:47:50,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1840956.0, ans=0.1 2023-06-24 21:48:00,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-24 21:48:32,727 INFO [train.py:996] (0/4) Epoch 11, batch 1900, loss[loss=0.2068, simple_loss=0.2711, pruned_loss=0.07129, over 21224.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3087, pruned_loss=0.07405, over 4284041.91 frames. ], batch size: 608, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:48:38,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1841076.0, ans=0.1 2023-06-24 21:48:44,763 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:49:01,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1841136.0, ans=0.0 2023-06-24 21:50:20,929 INFO [train.py:996] (0/4) Epoch 11, batch 1950, loss[loss=0.2752, simple_loss=0.3791, pruned_loss=0.08568, over 21192.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3082, pruned_loss=0.07522, over 4280773.69 frames. ], batch size: 548, lr: 2.71e-03, grad_scale: 4.0 2023-06-24 21:50:22,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 9.605e+02 1.769e+03 2.616e+03 5.034e+03, threshold=3.539e+03, percent-clipped=42.0 2023-06-24 21:51:18,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1841556.0, ans=0.5 2023-06-24 21:51:28,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1841556.0, ans=0.2 2023-06-24 21:51:37,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-06-24 21:51:43,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1841616.0, ans=0.0 2023-06-24 21:52:01,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1841616.0, ans=0.2 2023-06-24 21:52:05,569 INFO [train.py:996] (0/4) Epoch 11, batch 2000, loss[loss=0.1749, simple_loss=0.245, pruned_loss=0.05242, over 21827.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3033, pruned_loss=0.07405, over 4271010.76 frames. ], batch size: 102, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:52:16,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1841676.0, ans=0.04949747468305833 2023-06-24 21:52:43,237 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:52:51,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1841796.0, ans=0.2 2023-06-24 21:53:55,942 INFO [train.py:996] (0/4) Epoch 11, batch 2050, loss[loss=0.2411, simple_loss=0.3083, pruned_loss=0.08694, over 21869.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3027, pruned_loss=0.07396, over 4274786.58 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:53:57,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.961e+02 9.295e+02 1.430e+03 2.343e+03 5.111e+03, threshold=2.860e+03, percent-clipped=7.0 2023-06-24 21:54:06,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1841976.0, ans=0.125 2023-06-24 21:54:09,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1841976.0, ans=0.0 2023-06-24 21:54:11,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1842036.0, ans=0.2 2023-06-24 21:54:15,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-24 21:55:47,691 INFO [train.py:996] (0/4) Epoch 11, batch 2100, loss[loss=0.2482, simple_loss=0.3338, pruned_loss=0.08132, over 21896.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.305, pruned_loss=0.07544, over 4280237.09 frames. ], batch size: 371, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:55:55,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1842276.0, ans=0.0 2023-06-24 21:56:17,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1842336.0, ans=0.125 2023-06-24 21:56:26,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1842396.0, ans=0.125 2023-06-24 21:56:27,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.83 vs. limit=8.0 2023-06-24 21:57:01,107 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-24 21:57:38,459 INFO [train.py:996] (0/4) Epoch 11, batch 2150, loss[loss=0.2329, simple_loss=0.3089, pruned_loss=0.07844, over 15770.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3101, pruned_loss=0.07739, over 4264701.88 frames. ], batch size: 60, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:57:39,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.865e+02 8.663e+02 1.127e+03 1.659e+03 3.855e+03, threshold=2.253e+03, percent-clipped=2.0 2023-06-24 21:57:42,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-24 21:58:30,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1842696.0, ans=0.0 2023-06-24 21:58:52,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-24 21:58:53,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1842756.0, ans=0.125 2023-06-24 21:59:26,330 INFO [train.py:996] (0/4) Epoch 11, batch 2200, loss[loss=0.2774, simple_loss=0.3374, pruned_loss=0.1087, over 21432.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3102, pruned_loss=0.07765, over 4262181.32 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:59:31,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1842876.0, ans=0.125 2023-06-24 22:00:03,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1842936.0, ans=0.2 2023-06-24 22:00:53,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1843056.0, ans=0.0 2023-06-24 22:01:16,626 INFO [train.py:996] (0/4) Epoch 11, batch 2250, loss[loss=0.2119, simple_loss=0.2879, pruned_loss=0.06794, over 21799.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3105, pruned_loss=0.0767, over 4259151.68 frames. ], batch size: 351, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:01:18,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 9.027e+02 1.396e+03 1.956e+03 3.592e+03, threshold=2.793e+03, percent-clipped=17.0 2023-06-24 22:01:49,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-24 22:01:51,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1843236.0, ans=0.0 2023-06-24 22:03:05,563 INFO [train.py:996] (0/4) Epoch 11, batch 2300, loss[loss=0.1915, simple_loss=0.2644, pruned_loss=0.05935, over 21674.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3052, pruned_loss=0.07518, over 4252088.16 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:03:39,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1843536.0, ans=0.0 2023-06-24 22:04:57,300 INFO [train.py:996] (0/4) Epoch 11, batch 2350, loss[loss=0.2401, simple_loss=0.3034, pruned_loss=0.08837, over 21954.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.299, pruned_loss=0.07507, over 4252868.90 frames. ], batch size: 103, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:04:58,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.967e+02 8.353e+02 1.301e+03 1.765e+03 5.491e+03, threshold=2.603e+03, percent-clipped=6.0 2023-06-24 22:05:05,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1843776.0, ans=0.0 2023-06-24 22:05:20,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1843836.0, ans=0.0 2023-06-24 22:05:20,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-24 22:05:22,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-24 22:06:10,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1843896.0, ans=0.09899494936611666 2023-06-24 22:06:47,716 INFO [train.py:996] (0/4) Epoch 11, batch 2400, loss[loss=0.2756, simple_loss=0.3731, pruned_loss=0.08906, over 16829.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.301, pruned_loss=0.07666, over 4258926.42 frames. ], batch size: 60, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:07:39,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1844196.0, ans=0.07 2023-06-24 22:08:17,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1844256.0, ans=0.1 2023-06-24 22:08:27,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1844316.0, ans=0.0 2023-06-24 22:08:29,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1844316.0, ans=0.125 2023-06-24 22:08:44,097 INFO [train.py:996] (0/4) Epoch 11, batch 2450, loss[loss=0.2067, simple_loss=0.2718, pruned_loss=0.07077, over 21547.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3055, pruned_loss=0.08012, over 4264358.33 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:08:45,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.999e+02 9.036e+02 1.390e+03 1.907e+03 3.347e+03, threshold=2.779e+03, percent-clipped=7.0 2023-06-24 22:09:07,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1844436.0, ans=0.0 2023-06-24 22:09:12,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1844436.0, ans=0.0 2023-06-24 22:09:16,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1844436.0, ans=0.125 2023-06-24 22:09:16,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-24 22:10:07,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1844616.0, ans=0.125 2023-06-24 22:10:24,546 INFO [train.py:996] (0/4) Epoch 11, batch 2500, loss[loss=0.2277, simple_loss=0.3024, pruned_loss=0.07653, over 21783.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3043, pruned_loss=0.08, over 4274628.29 frames. ], batch size: 112, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:11:22,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1844796.0, ans=0.0 2023-06-24 22:12:02,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1844916.0, ans=0.1 2023-06-24 22:12:09,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1844916.0, ans=0.125 2023-06-24 22:12:18,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1844916.0, ans=0.04949747468305833 2023-06-24 22:12:20,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-24 22:12:21,390 INFO [train.py:996] (0/4) Epoch 11, batch 2550, loss[loss=0.2795, simple_loss=0.3988, pruned_loss=0.08015, over 19702.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3043, pruned_loss=0.07897, over 4268211.58 frames. ], batch size: 702, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:12:22,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.374e+02 8.761e+02 1.237e+03 1.691e+03 3.223e+03, threshold=2.475e+03, percent-clipped=6.0 2023-06-24 22:12:24,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-24 22:12:29,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-24 22:13:37,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=1845156.0, ans=12.0 2023-06-24 22:13:39,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-24 22:13:52,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-24 22:14:00,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1845216.0, ans=0.125 2023-06-24 22:14:11,179 INFO [train.py:996] (0/4) Epoch 11, batch 2600, loss[loss=0.2642, simple_loss=0.3387, pruned_loss=0.09487, over 21486.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3052, pruned_loss=0.07995, over 4269504.23 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:15:11,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1845396.0, ans=0.125 2023-06-24 22:15:25,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1845456.0, ans=0.125 2023-06-24 22:15:45,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2023-06-24 22:15:55,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1845516.0, ans=0.035 2023-06-24 22:16:00,066 INFO [train.py:996] (0/4) Epoch 11, batch 2650, loss[loss=0.2659, simple_loss=0.3275, pruned_loss=0.1021, over 21614.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3067, pruned_loss=0.08177, over 4274380.79 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:16:01,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 1.066e+03 1.667e+03 2.223e+03 5.089e+03, threshold=3.334e+03, percent-clipped=18.0 2023-06-24 22:16:04,636 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=12.0 2023-06-24 22:16:28,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1845636.0, ans=0.125 2023-06-24 22:16:59,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1845696.0, ans=0.1 2023-06-24 22:17:46,131 INFO [train.py:996] (0/4) Epoch 11, batch 2700, loss[loss=0.2065, simple_loss=0.2851, pruned_loss=0.06397, over 21787.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3047, pruned_loss=0.08124, over 4280752.51 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:18:43,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1845996.0, ans=0.015 2023-06-24 22:19:06,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1846056.0, ans=0.1 2023-06-24 22:19:36,537 INFO [train.py:996] (0/4) Epoch 11, batch 2750, loss[loss=0.2678, simple_loss=0.3341, pruned_loss=0.1008, over 21849.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3051, pruned_loss=0.08137, over 4281230.82 frames. ], batch size: 112, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:19:38,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.314e+02 7.487e+02 1.146e+03 1.660e+03 3.901e+03, threshold=2.292e+03, percent-clipped=2.0 2023-06-24 22:19:45,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1846176.0, ans=0.125 2023-06-24 22:20:10,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1846236.0, ans=0.125 2023-06-24 22:20:21,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1846296.0, ans=0.125 2023-06-24 22:20:23,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=1846296.0, ans=12.0 2023-06-24 22:20:24,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1846296.0, ans=0.2 2023-06-24 22:20:30,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1846296.0, ans=0.2 2023-06-24 22:21:19,871 INFO [train.py:996] (0/4) Epoch 11, batch 2800, loss[loss=0.2312, simple_loss=0.361, pruned_loss=0.05069, over 19725.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3103, pruned_loss=0.08261, over 4278463.10 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:21:46,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1846476.0, ans=0.125 2023-06-24 22:22:39,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1846656.0, ans=0.1 2023-06-24 22:22:39,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1846656.0, ans=0.1 2023-06-24 22:22:40,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1846656.0, ans=0.125 2023-06-24 22:22:42,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1846656.0, ans=0.125 2023-06-24 22:23:10,964 INFO [train.py:996] (0/4) Epoch 11, batch 2850, loss[loss=0.2066, simple_loss=0.2893, pruned_loss=0.06196, over 21852.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3101, pruned_loss=0.08281, over 4282365.28 frames. ], batch size: 372, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:23:19,723 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.356e+02 9.385e+02 1.588e+03 2.448e+03 5.122e+03, threshold=3.175e+03, percent-clipped=28.0 2023-06-24 22:23:32,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-24 22:23:33,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1846776.0, ans=0.125 2023-06-24 22:23:58,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1846896.0, ans=0.0 2023-06-24 22:24:03,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1846896.0, ans=0.2 2023-06-24 22:24:59,832 INFO [train.py:996] (0/4) Epoch 11, batch 2900, loss[loss=0.249, simple_loss=0.3242, pruned_loss=0.08687, over 21895.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3104, pruned_loss=0.08254, over 4289208.71 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:25:05,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1847076.0, ans=0.125 2023-06-24 22:25:41,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1847136.0, ans=0.2 2023-06-24 22:26:03,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1847256.0, ans=0.125 2023-06-24 22:26:48,173 INFO [train.py:996] (0/4) Epoch 11, batch 2950, loss[loss=0.2298, simple_loss=0.3071, pruned_loss=0.0763, over 21148.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3123, pruned_loss=0.08334, over 4292587.91 frames. ], batch size: 143, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:26:51,477 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.380e+02 7.849e+02 1.003e+03 1.596e+03 3.041e+03, threshold=2.006e+03, percent-clipped=1.0 2023-06-24 22:27:25,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1847436.0, ans=0.0 2023-06-24 22:27:34,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-24 22:27:43,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1847496.0, ans=0.0 2023-06-24 22:28:15,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1847616.0, ans=0.1 2023-06-24 22:28:39,736 INFO [train.py:996] (0/4) Epoch 11, batch 3000, loss[loss=0.2752, simple_loss=0.3497, pruned_loss=0.1003, over 21505.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3155, pruned_loss=0.08222, over 4293188.98 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:28:39,737 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 22:29:02,933 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2533, simple_loss=0.3467, pruned_loss=0.07995, over 1796401.00 frames. 2023-06-24 22:29:02,933 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-24 22:29:03,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1847676.0, ans=0.0 2023-06-24 22:29:06,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-24 22:29:40,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.45 vs. limit=6.0 2023-06-24 22:30:50,715 INFO [train.py:996] (0/4) Epoch 11, batch 3050, loss[loss=0.2066, simple_loss=0.2867, pruned_loss=0.06322, over 21770.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3161, pruned_loss=0.08075, over 4290762.74 frames. ], batch size: 247, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:30:56,009 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 9.249e+02 1.451e+03 2.091e+03 4.098e+03, threshold=2.902e+03, percent-clipped=32.0 2023-06-24 22:30:56,307 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-308000.pt 2023-06-24 22:31:12,323 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-24 22:32:35,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1848216.0, ans=0.0 2023-06-24 22:32:38,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1848276.0, ans=0.125 2023-06-24 22:32:40,009 INFO [train.py:996] (0/4) Epoch 11, batch 3100, loss[loss=0.2178, simple_loss=0.3077, pruned_loss=0.06397, over 21809.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3165, pruned_loss=0.08007, over 4295198.45 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:32:44,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1848276.0, ans=0.125 2023-06-24 22:32:49,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1848276.0, ans=0.1 2023-06-24 22:33:09,733 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.30 vs. limit=12.0 2023-06-24 22:33:21,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1848396.0, ans=0.2 2023-06-24 22:34:30,843 INFO [train.py:996] (0/4) Epoch 11, batch 3150, loss[loss=0.245, simple_loss=0.3264, pruned_loss=0.08182, over 21757.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3171, pruned_loss=0.07998, over 4299177.50 frames. ], batch size: 332, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:34:31,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1848576.0, ans=0.0 2023-06-24 22:34:41,497 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.651e+02 8.004e+02 1.417e+03 1.894e+03 2.816e+03, threshold=2.834e+03, percent-clipped=0.0 2023-06-24 22:35:00,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1848636.0, ans=0.125 2023-06-24 22:35:02,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1848636.0, ans=0.2 2023-06-24 22:35:32,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1848696.0, ans=0.1 2023-06-24 22:35:34,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1848696.0, ans=0.125 2023-06-24 22:35:55,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=12.0 2023-06-24 22:36:21,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1848816.0, ans=0.125 2023-06-24 22:36:27,240 INFO [train.py:996] (0/4) Epoch 11, batch 3200, loss[loss=0.2017, simple_loss=0.2895, pruned_loss=0.05698, over 21453.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3191, pruned_loss=0.08026, over 4295050.92 frames. ], batch size: 194, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:36:31,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1848876.0, ans=0.1 2023-06-24 22:37:44,385 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-24 22:37:48,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1849056.0, ans=0.125 2023-06-24 22:37:57,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1849116.0, ans=0.125 2023-06-24 22:38:15,097 INFO [train.py:996] (0/4) Epoch 11, batch 3250, loss[loss=0.2381, simple_loss=0.3138, pruned_loss=0.08122, over 21622.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3203, pruned_loss=0.08144, over 4290876.62 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:38:20,076 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.217e+02 9.282e+02 1.304e+03 1.953e+03 5.530e+03, threshold=2.608e+03, percent-clipped=11.0 2023-06-24 22:38:34,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1849176.0, ans=0.0 2023-06-24 22:38:50,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1849236.0, ans=0.2 2023-06-24 22:38:50,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1849236.0, ans=0.2 2023-06-24 22:39:22,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1849296.0, ans=0.025 2023-06-24 22:39:57,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1849416.0, ans=0.125 2023-06-24 22:40:04,431 INFO [train.py:996] (0/4) Epoch 11, batch 3300, loss[loss=0.3002, simple_loss=0.357, pruned_loss=0.1216, over 21378.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3159, pruned_loss=0.08171, over 4282493.24 frames. ], batch size: 507, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:40:57,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1849596.0, ans=0.0 2023-06-24 22:41:09,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-24 22:41:54,791 INFO [train.py:996] (0/4) Epoch 11, batch 3350, loss[loss=0.3327, simple_loss=0.3951, pruned_loss=0.1352, over 21424.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.318, pruned_loss=0.08163, over 4279828.41 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:42:01,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.009e+02 8.020e+02 1.184e+03 1.979e+03 5.260e+03, threshold=2.368e+03, percent-clipped=15.0 2023-06-24 22:43:07,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1849896.0, ans=0.0 2023-06-24 22:43:23,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1849956.0, ans=0.1 2023-06-24 22:43:27,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-24 22:43:50,819 INFO [train.py:996] (0/4) Epoch 11, batch 3400, loss[loss=0.2131, simple_loss=0.2938, pruned_loss=0.06618, over 21862.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3174, pruned_loss=0.08198, over 4281147.33 frames. ], batch size: 372, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:44:59,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1850256.0, ans=0.0 2023-06-24 22:45:40,247 INFO [train.py:996] (0/4) Epoch 11, batch 3450, loss[loss=0.2457, simple_loss=0.2924, pruned_loss=0.09948, over 21488.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3141, pruned_loss=0.08201, over 4281562.66 frames. ], batch size: 510, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:45:52,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.862e+02 7.621e+02 1.155e+03 1.643e+03 3.444e+03, threshold=2.310e+03, percent-clipped=7.0 2023-06-24 22:45:58,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1850376.0, ans=0.125 2023-06-24 22:46:06,453 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:46:15,724 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-24 22:46:38,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.58 vs. limit=10.0 2023-06-24 22:46:42,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1850496.0, ans=0.125 2023-06-24 22:47:35,899 INFO [train.py:996] (0/4) Epoch 11, batch 3500, loss[loss=0.2095, simple_loss=0.2853, pruned_loss=0.06682, over 21798.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3203, pruned_loss=0.08475, over 4278858.88 frames. ], batch size: 107, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:47:59,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1850736.0, ans=0.5 2023-06-24 22:48:19,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1850736.0, ans=0.125 2023-06-24 22:48:25,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1850796.0, ans=0.125 2023-06-24 22:48:28,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1850796.0, ans=0.0 2023-06-24 22:48:41,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1850856.0, ans=0.1 2023-06-24 22:49:01,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1850916.0, ans=0.0 2023-06-24 22:49:32,550 INFO [train.py:996] (0/4) Epoch 11, batch 3550, loss[loss=0.2303, simple_loss=0.3008, pruned_loss=0.07984, over 21450.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3217, pruned_loss=0.08559, over 4281091.46 frames. ], batch size: 389, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:49:39,363 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.955e+02 9.872e+02 1.548e+03 2.414e+03 6.693e+03, threshold=3.097e+03, percent-clipped=26.0 2023-06-24 22:50:01,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1851036.0, ans=0.125 2023-06-24 22:50:19,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1851096.0, ans=0.125 2023-06-24 22:50:23,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-24 22:50:35,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1851156.0, ans=0.1 2023-06-24 22:50:51,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1851216.0, ans=0.0 2023-06-24 22:51:22,534 INFO [train.py:996] (0/4) Epoch 11, batch 3600, loss[loss=0.1905, simple_loss=0.2556, pruned_loss=0.06275, over 21672.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3161, pruned_loss=0.08499, over 4283689.75 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:51:42,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-24 22:51:44,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1851336.0, ans=0.125 2023-06-24 22:53:14,653 INFO [train.py:996] (0/4) Epoch 11, batch 3650, loss[loss=0.2256, simple_loss=0.3156, pruned_loss=0.06786, over 20803.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3168, pruned_loss=0.0853, over 4279403.91 frames. ], batch size: 608, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:53:21,488 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.826e+02 7.978e+02 1.076e+03 1.568e+03 3.181e+03, threshold=2.152e+03, percent-clipped=1.0 2023-06-24 22:54:20,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1851756.0, ans=0.1 2023-06-24 22:54:53,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1851816.0, ans=0.125 2023-06-24 22:55:01,501 INFO [train.py:996] (0/4) Epoch 11, batch 3700, loss[loss=0.2202, simple_loss=0.3024, pruned_loss=0.06899, over 21856.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3149, pruned_loss=0.08437, over 4277543.62 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:55:54,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1851996.0, ans=0.0 2023-06-24 22:55:58,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-24 22:56:40,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1852116.0, ans=0.125 2023-06-24 22:56:55,550 INFO [train.py:996] (0/4) Epoch 11, batch 3750, loss[loss=0.1901, simple_loss=0.2659, pruned_loss=0.05715, over 21636.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3132, pruned_loss=0.0838, over 4287163.55 frames. ], batch size: 230, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:57:02,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.268e+02 7.373e+02 1.096e+03 1.771e+03 3.259e+03, threshold=2.192e+03, percent-clipped=16.0 2023-06-24 22:57:29,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1852296.0, ans=0.0 2023-06-24 22:57:39,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1852296.0, ans=0.0 2023-06-24 22:58:17,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1852356.0, ans=0.125 2023-06-24 22:58:42,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-24 22:58:44,915 INFO [train.py:996] (0/4) Epoch 11, batch 3800, loss[loss=0.2715, simple_loss=0.3453, pruned_loss=0.09889, over 21816.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.31, pruned_loss=0.08186, over 4289971.86 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:58:57,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1852476.0, ans=0.125 2023-06-24 22:58:59,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-24 22:59:04,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.42 vs. limit=15.0 2023-06-24 22:59:31,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1852596.0, ans=0.0 2023-06-24 23:00:25,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-24 23:00:32,138 INFO [train.py:996] (0/4) Epoch 11, batch 3850, loss[loss=0.1992, simple_loss=0.2869, pruned_loss=0.05574, over 20103.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3078, pruned_loss=0.08179, over 4278084.35 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:00:39,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.734e+02 8.310e+02 1.331e+03 1.906e+03 3.711e+03, threshold=2.662e+03, percent-clipped=19.0 2023-06-24 23:00:40,422 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-06-24 23:01:25,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-24 23:01:26,491 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-24 23:01:37,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-24 23:01:45,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1852956.0, ans=0.0 2023-06-24 23:02:19,614 INFO [train.py:996] (0/4) Epoch 11, batch 3900, loss[loss=0.2173, simple_loss=0.2793, pruned_loss=0.07768, over 21283.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3026, pruned_loss=0.08117, over 4279306.26 frames. ], batch size: 176, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:02:42,529 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-24 23:02:49,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1853136.0, ans=0.2 2023-06-24 23:03:30,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1853256.0, ans=0.125 2023-06-24 23:03:48,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1853316.0, ans=0.125 2023-06-24 23:04:11,762 INFO [train.py:996] (0/4) Epoch 11, batch 3950, loss[loss=0.2206, simple_loss=0.3137, pruned_loss=0.06372, over 21679.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3063, pruned_loss=0.08056, over 4278678.94 frames. ], batch size: 414, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:04:18,256 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 6.470e+02 9.111e+02 1.353e+03 4.725e+03, threshold=1.822e+03, percent-clipped=4.0 2023-06-24 23:04:47,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1853436.0, ans=0.125 2023-06-24 23:04:56,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1853496.0, ans=0.2 2023-06-24 23:05:53,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1853616.0, ans=0.0 2023-06-24 23:06:01,592 INFO [train.py:996] (0/4) Epoch 11, batch 4000, loss[loss=0.1938, simple_loss=0.2578, pruned_loss=0.06496, over 21199.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3004, pruned_loss=0.07717, over 4273880.04 frames. ], batch size: 144, lr: 2.70e-03, grad_scale: 32.0 2023-06-24 23:06:05,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1853676.0, ans=0.1 2023-06-24 23:06:14,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1853676.0, ans=0.125 2023-06-24 23:06:35,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1853736.0, ans=0.125 2023-06-24 23:06:41,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1853796.0, ans=0.0 2023-06-24 23:06:59,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1853796.0, ans=0.125 2023-06-24 23:07:05,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1853856.0, ans=0.1 2023-06-24 23:07:26,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-24 23:07:49,291 INFO [train.py:996] (0/4) Epoch 11, batch 4050, loss[loss=0.233, simple_loss=0.3296, pruned_loss=0.06825, over 21620.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3013, pruned_loss=0.07607, over 4270177.77 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:07:56,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1853976.0, ans=0.0 2023-06-24 23:07:57,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.864e+02 8.238e+02 1.474e+03 2.566e+03 6.233e+03, threshold=2.948e+03, percent-clipped=38.0 2023-06-24 23:07:59,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1853976.0, ans=0.0 2023-06-24 23:08:15,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1854036.0, ans=0.125 2023-06-24 23:08:47,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-24 23:09:15,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1854156.0, ans=0.05 2023-06-24 23:09:37,376 INFO [train.py:996] (0/4) Epoch 11, batch 4100, loss[loss=0.2635, simple_loss=0.3254, pruned_loss=0.1008, over 21828.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3022, pruned_loss=0.07702, over 4274785.96 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:09:52,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1854276.0, ans=0.0 2023-06-24 23:10:54,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1854456.0, ans=0.0 2023-06-24 23:11:25,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-24 23:11:27,846 INFO [train.py:996] (0/4) Epoch 11, batch 4150, loss[loss=0.2583, simple_loss=0.3234, pruned_loss=0.09663, over 20004.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3033, pruned_loss=0.07457, over 4271293.36 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:11:41,361 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:11:44,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.391e+02 9.658e+02 1.367e+03 3.515e+03, threshold=1.932e+03, percent-clipped=2.0 2023-06-24 23:12:13,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1854636.0, ans=0.07 2023-06-24 23:12:15,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1854696.0, ans=0.125 2023-06-24 23:12:41,128 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:12:47,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1854756.0, ans=0.0 2023-06-24 23:13:27,399 INFO [train.py:996] (0/4) Epoch 11, batch 4200, loss[loss=0.2088, simple_loss=0.3091, pruned_loss=0.05425, over 19800.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3025, pruned_loss=0.07423, over 4268402.99 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:14:30,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1854996.0, ans=0.125 2023-06-24 23:15:24,080 INFO [train.py:996] (0/4) Epoch 11, batch 4250, loss[loss=0.3083, simple_loss=0.3786, pruned_loss=0.1191, over 21763.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.308, pruned_loss=0.07566, over 4266772.59 frames. ], batch size: 118, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:15:32,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.310e+02 8.743e+02 1.334e+03 2.102e+03 4.812e+03, threshold=2.669e+03, percent-clipped=26.0 2023-06-24 23:17:15,307 INFO [train.py:996] (0/4) Epoch 11, batch 4300, loss[loss=0.2475, simple_loss=0.3377, pruned_loss=0.07866, over 21651.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3128, pruned_loss=0.07739, over 4266704.14 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:17:52,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-24 23:18:27,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1855656.0, ans=0.125 2023-06-24 23:18:32,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1855656.0, ans=0.025 2023-06-24 23:19:05,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1855716.0, ans=0.2 2023-06-24 23:19:05,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1855716.0, ans=0.05 2023-06-24 23:19:09,589 INFO [train.py:996] (0/4) Epoch 11, batch 4350, loss[loss=0.2227, simple_loss=0.3366, pruned_loss=0.05442, over 21659.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.313, pruned_loss=0.07712, over 4269626.42 frames. ], batch size: 414, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 23:19:25,424 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 8.181e+02 1.022e+03 1.627e+03 5.028e+03, threshold=2.045e+03, percent-clipped=6.0 2023-06-24 23:19:47,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1855896.0, ans=0.1 2023-06-24 23:20:07,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1855896.0, ans=0.0 2023-06-24 23:20:49,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1856016.0, ans=0.0 2023-06-24 23:20:50,273 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=15.0 2023-06-24 23:21:06,800 INFO [train.py:996] (0/4) Epoch 11, batch 4400, loss[loss=0.2246, simple_loss=0.2876, pruned_loss=0.08074, over 21201.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3081, pruned_loss=0.07706, over 4267029.01 frames. ], batch size: 608, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:21:20,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856076.0, ans=0.1 2023-06-24 23:21:21,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1856076.0, ans=0.2 2023-06-24 23:21:58,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-06-24 23:22:13,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.61 vs. limit=10.0 2023-06-24 23:22:57,262 INFO [train.py:996] (0/4) Epoch 11, batch 4450, loss[loss=0.3116, simple_loss=0.397, pruned_loss=0.1131, over 21678.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3158, pruned_loss=0.07829, over 4268502.03 frames. ], batch size: 389, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:23:07,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.276e+02 9.671e+02 1.476e+03 2.549e+03 6.148e+03, threshold=2.952e+03, percent-clipped=35.0 2023-06-24 23:23:12,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-24 23:23:17,997 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.13 vs. limit=15.0 2023-06-24 23:24:47,361 INFO [train.py:996] (0/4) Epoch 11, batch 4500, loss[loss=0.2201, simple_loss=0.294, pruned_loss=0.07308, over 21764.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3191, pruned_loss=0.07993, over 4275923.95 frames. ], batch size: 247, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:25:35,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1856796.0, ans=0.035 2023-06-24 23:25:37,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1856796.0, ans=0.125 2023-06-24 23:25:42,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-24 23:26:13,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-24 23:26:34,496 INFO [train.py:996] (0/4) Epoch 11, batch 4550, loss[loss=0.2877, simple_loss=0.3607, pruned_loss=0.1073, over 21216.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3219, pruned_loss=0.08024, over 4280906.42 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:26:44,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.220e+02 1.037e+03 1.526e+03 2.248e+03 5.276e+03, threshold=3.053e+03, percent-clipped=11.0 2023-06-24 23:27:43,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1857156.0, ans=0.1 2023-06-24 23:28:23,427 INFO [train.py:996] (0/4) Epoch 11, batch 4600, loss[loss=0.2109, simple_loss=0.2957, pruned_loss=0.06301, over 21868.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3251, pruned_loss=0.08274, over 4281861.65 frames. ], batch size: 371, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:28:51,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1857336.0, ans=0.1 2023-06-24 23:29:07,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-24 23:29:39,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1857456.0, ans=0.0 2023-06-24 23:30:04,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1857516.0, ans=0.125 2023-06-24 23:30:12,166 INFO [train.py:996] (0/4) Epoch 11, batch 4650, loss[loss=0.19, simple_loss=0.2688, pruned_loss=0.05559, over 21810.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3191, pruned_loss=0.08107, over 4281913.60 frames. ], batch size: 351, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:30:29,046 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.698e+02 1.029e+03 1.673e+03 3.855e+03, threshold=2.058e+03, percent-clipped=3.0 2023-06-24 23:30:38,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1857636.0, ans=0.05 2023-06-24 23:31:47,690 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-24 23:32:07,109 INFO [train.py:996] (0/4) Epoch 11, batch 4700, loss[loss=0.1981, simple_loss=0.2682, pruned_loss=0.064, over 21769.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.311, pruned_loss=0.07934, over 4282958.15 frames. ], batch size: 351, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:33:01,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1857996.0, ans=0.0 2023-06-24 23:33:04,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1857996.0, ans=0.02 2023-06-24 23:33:32,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1858116.0, ans=0.125 2023-06-24 23:33:33,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1858116.0, ans=0.125 2023-06-24 23:33:44,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1858116.0, ans=0.04949747468305833 2023-06-24 23:33:47,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1858176.0, ans=0.2 2023-06-24 23:33:48,531 INFO [train.py:996] (0/4) Epoch 11, batch 4750, loss[loss=0.2718, simple_loss=0.3262, pruned_loss=0.1087, over 21499.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3095, pruned_loss=0.08009, over 4277096.32 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:33:49,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1858176.0, ans=0.125 2023-06-24 23:34:05,961 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.601e+02 8.320e+02 1.239e+03 2.079e+03 4.364e+03, threshold=2.479e+03, percent-clipped=25.0 2023-06-24 23:34:57,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1858296.0, ans=0.125 2023-06-24 23:35:04,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1858356.0, ans=0.035 2023-06-24 23:35:42,521 INFO [train.py:996] (0/4) Epoch 11, batch 4800, loss[loss=0.2325, simple_loss=0.3031, pruned_loss=0.08091, over 21843.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3093, pruned_loss=0.08041, over 4284879.63 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 32.0 2023-06-24 23:36:46,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1858596.0, ans=0.125 2023-06-24 23:36:48,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1858596.0, ans=0.1 2023-06-24 23:36:49,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1858596.0, ans=0.5 2023-06-24 23:37:23,169 INFO [train.py:996] (0/4) Epoch 11, batch 4850, loss[loss=0.196, simple_loss=0.2686, pruned_loss=0.06169, over 21724.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3072, pruned_loss=0.08001, over 4282644.78 frames. ], batch size: 247, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:37:35,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1858776.0, ans=0.125 2023-06-24 23:37:41,870 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.322e+02 1.130e+03 1.666e+03 2.337e+03 4.462e+03, threshold=3.333e+03, percent-clipped=23.0 2023-06-24 23:38:06,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.90 vs. limit=22.5 2023-06-24 23:39:00,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1859016.0, ans=0.1 2023-06-24 23:39:15,327 INFO [train.py:996] (0/4) Epoch 11, batch 4900, loss[loss=0.2506, simple_loss=0.3189, pruned_loss=0.09113, over 21447.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3069, pruned_loss=0.07989, over 4283085.43 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:39:32,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1859076.0, ans=0.125 2023-06-24 23:39:37,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1859136.0, ans=0.125 2023-06-24 23:39:42,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859136.0, ans=0.1 2023-06-24 23:41:05,228 INFO [train.py:996] (0/4) Epoch 11, batch 4950, loss[loss=0.1958, simple_loss=0.289, pruned_loss=0.05124, over 21407.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3107, pruned_loss=0.07876, over 4279984.39 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:41:07,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1859376.0, ans=0.0 2023-06-24 23:41:23,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.980e+02 1.108e+03 1.676e+03 3.345e+03, threshold=2.216e+03, percent-clipped=1.0 2023-06-24 23:42:31,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1859616.0, ans=0.2 2023-06-24 23:42:53,681 INFO [train.py:996] (0/4) Epoch 11, batch 5000, loss[loss=0.2698, simple_loss=0.3378, pruned_loss=0.1009, over 21799.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3091, pruned_loss=0.07574, over 4280689.57 frames. ], batch size: 112, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:43:08,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859676.0, ans=0.1 2023-06-24 23:44:13,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1859856.0, ans=0.2 2023-06-24 23:44:39,758 INFO [train.py:996] (0/4) Epoch 11, batch 5050, loss[loss=0.208, simple_loss=0.281, pruned_loss=0.06755, over 21637.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3095, pruned_loss=0.07729, over 4278754.12 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:44:47,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1859976.0, ans=10.0 2023-06-24 23:44:48,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1859976.0, ans=0.125 2023-06-24 23:44:57,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.469e+02 1.066e+03 1.616e+03 3.471e+03, threshold=2.133e+03, percent-clipped=8.0 2023-06-24 23:46:14,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1860216.0, ans=0.125 2023-06-24 23:46:14,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1860216.0, ans=0.0 2023-06-24 23:46:26,276 INFO [train.py:996] (0/4) Epoch 11, batch 5100, loss[loss=0.2186, simple_loss=0.2887, pruned_loss=0.07424, over 21793.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3079, pruned_loss=0.07768, over 4289448.50 frames. ], batch size: 247, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:47:04,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1860336.0, ans=0.2 2023-06-24 23:47:05,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1860336.0, ans=0.1 2023-06-24 23:47:19,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1860396.0, ans=0.035 2023-06-24 23:47:28,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-24 23:47:36,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1860456.0, ans=0.0 2023-06-24 23:47:58,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1860516.0, ans=0.05 2023-06-24 23:48:21,829 INFO [train.py:996] (0/4) Epoch 11, batch 5150, loss[loss=0.2398, simple_loss=0.2994, pruned_loss=0.09011, over 21345.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3044, pruned_loss=0.07748, over 4290622.44 frames. ], batch size: 144, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:48:24,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1860576.0, ans=0.125 2023-06-24 23:48:24,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1860576.0, ans=0.125 2023-06-24 23:48:25,942 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:48:26,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-24 23:48:34,360 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.000e+02 7.764e+02 1.031e+03 1.609e+03 3.475e+03, threshold=2.061e+03, percent-clipped=12.0 2023-06-24 23:50:02,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1860816.0, ans=0.125 2023-06-24 23:50:11,786 INFO [train.py:996] (0/4) Epoch 11, batch 5200, loss[loss=0.2238, simple_loss=0.2996, pruned_loss=0.07403, over 21027.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3055, pruned_loss=0.07785, over 4287060.41 frames. ], batch size: 607, lr: 2.69e-03, grad_scale: 32.0 2023-06-24 23:50:53,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1860936.0, ans=0.1 2023-06-24 23:51:15,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1861056.0, ans=0.025 2023-06-24 23:51:19,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-24 23:51:41,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1861116.0, ans=0.125 2023-06-24 23:51:58,034 INFO [train.py:996] (0/4) Epoch 11, batch 5250, loss[loss=0.2098, simple_loss=0.3034, pruned_loss=0.05812, over 21634.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3095, pruned_loss=0.0766, over 4290402.32 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:52:01,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1861176.0, ans=0.125 2023-06-24 23:52:18,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 9.518e+02 1.553e+03 2.129e+03 4.596e+03, threshold=3.106e+03, percent-clipped=26.0 2023-06-24 23:52:56,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1861296.0, ans=0.1 2023-06-24 23:53:09,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1861356.0, ans=0.125 2023-06-24 23:53:38,144 INFO [train.py:996] (0/4) Epoch 11, batch 5300, loss[loss=0.2279, simple_loss=0.2966, pruned_loss=0.07965, over 21336.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3073, pruned_loss=0.07673, over 4287826.37 frames. ], batch size: 144, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:54:50,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1861656.0, ans=0.125 2023-06-24 23:55:22,270 INFO [train.py:996] (0/4) Epoch 11, batch 5350, loss[loss=0.2203, simple_loss=0.2953, pruned_loss=0.07267, over 21952.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3077, pruned_loss=0.07871, over 4287250.87 frames. ], batch size: 113, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:55:31,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1861776.0, ans=0.1 2023-06-24 23:55:35,085 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.646e+02 7.558e+02 1.125e+03 1.569e+03 2.899e+03, threshold=2.250e+03, percent-clipped=0.0 2023-06-24 23:56:35,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.49 vs. limit=15.0 2023-06-24 23:57:01,687 INFO [train.py:996] (0/4) Epoch 11, batch 5400, loss[loss=0.2386, simple_loss=0.3028, pruned_loss=0.08715, over 21580.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3065, pruned_loss=0.07979, over 4290801.46 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 8.0 2023-06-24 23:57:27,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1862076.0, ans=0.125 2023-06-24 23:57:46,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1862196.0, ans=0.5 2023-06-24 23:58:02,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1862256.0, ans=0.0 2023-06-24 23:58:21,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1862256.0, ans=0.1 2023-06-24 23:58:31,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1862316.0, ans=0.2 2023-06-24 23:58:48,766 INFO [train.py:996] (0/4) Epoch 11, batch 5450, loss[loss=0.2749, simple_loss=0.3685, pruned_loss=0.09067, over 21666.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3076, pruned_loss=0.07818, over 4295507.39 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 8.0 2023-06-24 23:59:10,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.863e+02 8.635e+02 1.460e+03 2.379e+03 5.903e+03, threshold=2.920e+03, percent-clipped=27.0 2023-06-25 00:00:45,473 INFO [train.py:996] (0/4) Epoch 11, batch 5500, loss[loss=0.2478, simple_loss=0.349, pruned_loss=0.07331, over 21288.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3121, pruned_loss=0.07479, over 4290040.28 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:01:38,528 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-25 00:01:47,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-25 00:02:33,302 INFO [train.py:996] (0/4) Epoch 11, batch 5550, loss[loss=0.2252, simple_loss=0.3214, pruned_loss=0.06448, over 21629.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3123, pruned_loss=0.07203, over 4288689.82 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:02:48,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.684e+02 8.321e+02 1.311e+03 1.956e+03 3.720e+03, threshold=2.623e+03, percent-clipped=7.0 2023-06-25 00:03:00,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1863036.0, ans=0.125 2023-06-25 00:03:14,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1863036.0, ans=0.125 2023-06-25 00:03:18,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1863096.0, ans=0.0 2023-06-25 00:03:20,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1863096.0, ans=0.125 2023-06-25 00:04:20,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1863276.0, ans=0.125 2023-06-25 00:04:21,212 INFO [train.py:996] (0/4) Epoch 11, batch 5600, loss[loss=0.2171, simple_loss=0.3229, pruned_loss=0.05565, over 21169.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3112, pruned_loss=0.06922, over 4280580.18 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:04:56,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1863336.0, ans=0.125 2023-06-25 00:05:16,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1863396.0, ans=0.125 2023-06-25 00:05:38,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1863456.0, ans=0.125 2023-06-25 00:05:51,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1863516.0, ans=0.125 2023-06-25 00:06:05,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1863576.0, ans=0.125 2023-06-25 00:06:06,176 INFO [train.py:996] (0/4) Epoch 11, batch 5650, loss[loss=0.242, simple_loss=0.3157, pruned_loss=0.0842, over 21872.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3132, pruned_loss=0.07075, over 4281523.57 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:06:29,800 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.83 vs. limit=10.0 2023-06-25 00:06:32,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.004e+02 8.541e+02 1.292e+03 2.009e+03 3.827e+03, threshold=2.583e+03, percent-clipped=13.0 2023-06-25 00:06:53,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-25 00:07:23,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1863756.0, ans=0.125 2023-06-25 00:07:25,723 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-25 00:07:32,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1863756.0, ans=10.0 2023-06-25 00:07:41,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1863816.0, ans=0.2 2023-06-25 00:07:46,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1863816.0, ans=0.125 2023-06-25 00:07:57,690 INFO [train.py:996] (0/4) Epoch 11, batch 5700, loss[loss=0.2089, simple_loss=0.2818, pruned_loss=0.06794, over 21550.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3135, pruned_loss=0.07379, over 4281845.05 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:09:09,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-25 00:09:53,495 INFO [train.py:996] (0/4) Epoch 11, batch 5750, loss[loss=0.1754, simple_loss=0.2657, pruned_loss=0.04252, over 21687.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3086, pruned_loss=0.07026, over 4280627.43 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:10:08,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.527e+02 8.365e+02 1.283e+03 1.865e+03 4.523e+03, threshold=2.566e+03, percent-clipped=10.0 2023-06-25 00:10:31,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1864236.0, ans=0.0 2023-06-25 00:10:47,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1864296.0, ans=0.125 2023-06-25 00:11:09,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1864356.0, ans=0.2 2023-06-25 00:11:34,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1864416.0, ans=0.125 2023-06-25 00:11:39,496 INFO [train.py:996] (0/4) Epoch 11, batch 5800, loss[loss=0.2167, simple_loss=0.3122, pruned_loss=0.06061, over 21731.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3097, pruned_loss=0.06899, over 4271659.24 frames. ], batch size: 298, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:11:46,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1864476.0, ans=0.125 2023-06-25 00:12:00,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-25 00:12:51,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1864656.0, ans=0.125 2023-06-25 00:13:02,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1864656.0, ans=0.2 2023-06-25 00:13:32,922 INFO [train.py:996] (0/4) Epoch 11, batch 5850, loss[loss=0.2138, simple_loss=0.3196, pruned_loss=0.05399, over 21625.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3082, pruned_loss=0.06552, over 4278252.64 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:13:53,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 6.927e+02 1.116e+03 1.995e+03 4.965e+03, threshold=2.231e+03, percent-clipped=19.0 2023-06-25 00:14:08,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-25 00:14:40,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1864956.0, ans=0.125 2023-06-25 00:15:17,043 INFO [train.py:996] (0/4) Epoch 11, batch 5900, loss[loss=0.2536, simple_loss=0.3228, pruned_loss=0.09218, over 20046.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2994, pruned_loss=0.06016, over 4282599.64 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:15:36,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-25 00:16:16,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.46 vs. limit=5.0 2023-06-25 00:17:06,597 INFO [train.py:996] (0/4) Epoch 11, batch 5950, loss[loss=0.2339, simple_loss=0.2988, pruned_loss=0.08453, over 21358.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2987, pruned_loss=0.06423, over 4282236.98 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:17:21,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.131e+02 6.600e+02 8.461e+02 1.275e+03 2.602e+03, threshold=1.692e+03, percent-clipped=3.0 2023-06-25 00:17:41,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1865496.0, ans=0.125 2023-06-25 00:18:19,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1865556.0, ans=0.125 2023-06-25 00:18:34,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1865616.0, ans=0.2 2023-06-25 00:18:51,547 INFO [train.py:996] (0/4) Epoch 11, batch 6000, loss[loss=0.2394, simple_loss=0.2953, pruned_loss=0.09179, over 21517.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2943, pruned_loss=0.06716, over 4271925.90 frames. ], batch size: 391, lr: 2.69e-03, grad_scale: 32.0 2023-06-25 00:18:51,548 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 00:19:04,438 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.9549, 1.8753, 1.6134, 2.0727, 1.7129, 1.9546, 1.8733, 1.8239], device='cuda:0') 2023-06-25 00:19:08,581 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2642, simple_loss=0.3568, pruned_loss=0.08578, over 1796401.00 frames. 2023-06-25 00:19:08,582 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 00:19:55,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1865796.0, ans=0.125 2023-06-25 00:20:53,372 INFO [train.py:996] (0/4) Epoch 11, batch 6050, loss[loss=0.2295, simple_loss=0.2844, pruned_loss=0.08723, over 21374.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2904, pruned_loss=0.06929, over 4263286.40 frames. ], batch size: 160, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:21:01,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-25 00:21:18,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.947e+02 8.062e+02 1.043e+03 1.359e+03 2.248e+03, threshold=2.086e+03, percent-clipped=5.0 2023-06-25 00:22:20,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1866156.0, ans=0.0 2023-06-25 00:22:39,250 INFO [train.py:996] (0/4) Epoch 11, batch 6100, loss[loss=0.2031, simple_loss=0.2893, pruned_loss=0.05843, over 21591.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2902, pruned_loss=0.06831, over 4268419.62 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:23:12,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1866336.0, ans=0.0 2023-06-25 00:23:48,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1866456.0, ans=0.125 2023-06-25 00:24:04,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1866456.0, ans=0.125 2023-06-25 00:24:21,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1866516.0, ans=0.0 2023-06-25 00:24:27,323 INFO [train.py:996] (0/4) Epoch 11, batch 6150, loss[loss=0.212, simple_loss=0.2825, pruned_loss=0.07069, over 21844.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.294, pruned_loss=0.07075, over 4265426.18 frames. ], batch size: 98, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:24:42,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1866576.0, ans=0.0 2023-06-25 00:24:58,603 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.664e+02 7.679e+02 1.290e+03 1.928e+03 3.741e+03, threshold=2.581e+03, percent-clipped=18.0 2023-06-25 00:25:29,219 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-06-25 00:25:31,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1866696.0, ans=0.0 2023-06-25 00:25:47,972 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-25 00:25:57,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-25 00:26:19,968 INFO [train.py:996] (0/4) Epoch 11, batch 6200, loss[loss=0.2311, simple_loss=0.3193, pruned_loss=0.07149, over 21798.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.297, pruned_loss=0.07185, over 4273403.77 frames. ], batch size: 298, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:26:41,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1866936.0, ans=0.125 2023-06-25 00:26:49,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1866936.0, ans=0.1 2023-06-25 00:27:06,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-25 00:27:10,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-25 00:28:06,351 INFO [train.py:996] (0/4) Epoch 11, batch 6250, loss[loss=0.2319, simple_loss=0.3424, pruned_loss=0.0607, over 21716.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3026, pruned_loss=0.07119, over 4282340.43 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:28:29,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1867236.0, ans=0.1 2023-06-25 00:28:31,525 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.849e+02 8.847e+02 1.490e+03 2.226e+03 5.467e+03, threshold=2.981e+03, percent-clipped=18.0 2023-06-25 00:28:39,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-25 00:28:57,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.82 vs. limit=6.0 2023-06-25 00:29:52,772 INFO [train.py:996] (0/4) Epoch 11, batch 6300, loss[loss=0.284, simple_loss=0.3332, pruned_loss=0.1174, over 21766.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3044, pruned_loss=0.07053, over 4281899.93 frames. ], batch size: 507, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:30:51,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1867596.0, ans=0.1 2023-06-25 00:31:44,189 INFO [train.py:996] (0/4) Epoch 11, batch 6350, loss[loss=0.2957, simple_loss=0.3565, pruned_loss=0.1174, over 21589.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3082, pruned_loss=0.07547, over 4286024.63 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:32:06,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.04 vs. limit=22.5 2023-06-25 00:32:08,044 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.095e+02 6.705e+02 8.360e+02 1.250e+03 2.332e+03, threshold=1.672e+03, percent-clipped=0.0 2023-06-25 00:32:49,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1867956.0, ans=0.125 2023-06-25 00:32:52,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1867956.0, ans=0.125 2023-06-25 00:33:37,568 INFO [train.py:996] (0/4) Epoch 11, batch 6400, loss[loss=0.2569, simple_loss=0.3261, pruned_loss=0.09387, over 21406.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.313, pruned_loss=0.07918, over 4289399.99 frames. ], batch size: 549, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:33:42,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.62 vs. limit=10.0 2023-06-25 00:34:10,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1868136.0, ans=0.95 2023-06-25 00:34:26,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1868196.0, ans=0.125 2023-06-25 00:35:26,311 INFO [train.py:996] (0/4) Epoch 11, batch 6450, loss[loss=0.2147, simple_loss=0.2984, pruned_loss=0.06548, over 21686.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3159, pruned_loss=0.07834, over 4285368.52 frames. ], batch size: 282, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:35:51,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 9.176e+02 1.134e+03 1.706e+03 4.418e+03, threshold=2.268e+03, percent-clipped=27.0 2023-06-25 00:36:09,523 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:36:29,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-25 00:36:49,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1868556.0, ans=0.1 2023-06-25 00:37:13,882 INFO [train.py:996] (0/4) Epoch 11, batch 6500, loss[loss=0.2358, simple_loss=0.2965, pruned_loss=0.08762, over 21831.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3104, pruned_loss=0.07713, over 4287216.16 frames. ], batch size: 372, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:37:26,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1868676.0, ans=0.2 2023-06-25 00:38:14,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1868856.0, ans=0.0 2023-06-25 00:38:56,941 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:38:59,862 INFO [train.py:996] (0/4) Epoch 11, batch 6550, loss[loss=0.2675, simple_loss=0.3399, pruned_loss=0.09761, over 21746.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3096, pruned_loss=0.07539, over 4282110.46 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:39:24,193 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 9.229e+02 1.425e+03 2.181e+03 3.625e+03, threshold=2.850e+03, percent-clipped=21.0 2023-06-25 00:39:25,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.71 vs. limit=22.5 2023-06-25 00:39:51,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1869096.0, ans=0.125 2023-06-25 00:40:47,116 INFO [train.py:996] (0/4) Epoch 11, batch 6600, loss[loss=0.1943, simple_loss=0.2594, pruned_loss=0.06463, over 21773.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3038, pruned_loss=0.07518, over 4278509.77 frames. ], batch size: 317, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:41:24,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1869336.0, ans=0.125 2023-06-25 00:42:27,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1869516.0, ans=0.025 2023-06-25 00:42:36,121 INFO [train.py:996] (0/4) Epoch 11, batch 6650, loss[loss=0.2072, simple_loss=0.2726, pruned_loss=0.07093, over 21302.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2965, pruned_loss=0.07285, over 4270609.73 frames. ], batch size: 160, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:43:06,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 5.753e+02 7.174e+02 1.040e+03 2.181e+03, threshold=1.435e+03, percent-clipped=0.0 2023-06-25 00:43:39,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1869696.0, ans=0.125 2023-06-25 00:44:22,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1869816.0, ans=0.1 2023-06-25 00:44:32,444 INFO [train.py:996] (0/4) Epoch 11, batch 6700, loss[loss=0.1907, simple_loss=0.2666, pruned_loss=0.05746, over 21602.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2902, pruned_loss=0.07162, over 4273710.01 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:44:36,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1869876.0, ans=0.2 2023-06-25 00:44:52,343 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:45:04,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1869936.0, ans=0.0 2023-06-25 00:45:46,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1870056.0, ans=0.0 2023-06-25 00:45:59,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1870116.0, ans=0.2 2023-06-25 00:46:14,657 INFO [train.py:996] (0/4) Epoch 11, batch 6750, loss[loss=0.2599, simple_loss=0.3211, pruned_loss=0.09938, over 21838.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2888, pruned_loss=0.07185, over 4275492.83 frames. ], batch size: 371, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:46:36,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870236.0, ans=0.1 2023-06-25 00:46:46,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 8.208e+02 1.148e+03 1.600e+03 3.333e+03, threshold=2.296e+03, percent-clipped=33.0 2023-06-25 00:47:16,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1870296.0, ans=0.125 2023-06-25 00:47:42,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1870416.0, ans=0.125 2023-06-25 00:47:56,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1870416.0, ans=0.125 2023-06-25 00:47:59,236 INFO [train.py:996] (0/4) Epoch 11, batch 6800, loss[loss=0.2166, simple_loss=0.2826, pruned_loss=0.07532, over 21646.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2913, pruned_loss=0.0735, over 4275498.71 frames. ], batch size: 298, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:48:46,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1870596.0, ans=0.125 2023-06-25 00:49:03,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870596.0, ans=0.1 2023-06-25 00:49:19,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1870656.0, ans=15.0 2023-06-25 00:49:42,007 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:49:44,496 INFO [train.py:996] (0/4) Epoch 11, batch 6850, loss[loss=0.2485, simple_loss=0.3047, pruned_loss=0.09614, over 21579.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2921, pruned_loss=0.0757, over 4276488.41 frames. ], batch size: 473, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:50:02,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1870776.0, ans=0.125 2023-06-25 00:50:16,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 8.303e+02 1.235e+03 2.153e+03 3.729e+03, threshold=2.471e+03, percent-clipped=22.0 2023-06-25 00:51:23,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1871016.0, ans=0.125 2023-06-25 00:51:31,199 INFO [train.py:996] (0/4) Epoch 11, batch 6900, loss[loss=0.2264, simple_loss=0.3268, pruned_loss=0.06298, over 21689.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2924, pruned_loss=0.0756, over 4283923.66 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:52:00,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1871136.0, ans=0.125 2023-06-25 00:52:07,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1871136.0, ans=0.125 2023-06-25 00:52:57,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1871256.0, ans=0.125 2023-06-25 00:53:27,066 INFO [train.py:996] (0/4) Epoch 11, batch 6950, loss[loss=0.1997, simple_loss=0.3077, pruned_loss=0.0458, over 21693.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2952, pruned_loss=0.07364, over 4283844.20 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:53:38,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1871376.0, ans=0.0 2023-06-25 00:53:40,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1871376.0, ans=0.07 2023-06-25 00:53:51,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1871436.0, ans=0.0 2023-06-25 00:53:53,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.630e+02 7.235e+02 1.015e+03 1.522e+03 6.325e+03, threshold=2.030e+03, percent-clipped=9.0 2023-06-25 00:53:54,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1871436.0, ans=0.125 2023-06-25 00:54:44,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1871556.0, ans=0.125 2023-06-25 00:55:08,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-25 00:55:15,825 INFO [train.py:996] (0/4) Epoch 11, batch 7000, loss[loss=0.2237, simple_loss=0.2894, pruned_loss=0.07904, over 21773.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2977, pruned_loss=0.075, over 4263211.51 frames. ], batch size: 371, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:57:10,150 INFO [train.py:996] (0/4) Epoch 11, batch 7050, loss[loss=0.2089, simple_loss=0.29, pruned_loss=0.06383, over 21609.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2956, pruned_loss=0.07393, over 4264415.17 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:57:15,144 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-312000.pt 2023-06-25 00:57:37,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.311e+02 8.822e+02 1.310e+03 1.745e+03 4.662e+03, threshold=2.619e+03, percent-clipped=19.0 2023-06-25 00:58:07,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1872096.0, ans=0.0 2023-06-25 00:58:15,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-25 00:58:41,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1872216.0, ans=0.125 2023-06-25 00:59:02,561 INFO [train.py:996] (0/4) Epoch 11, batch 7100, loss[loss=0.2719, simple_loss=0.3381, pruned_loss=0.1029, over 21207.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3014, pruned_loss=0.07626, over 4267564.61 frames. ], batch size: 143, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:59:15,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-25 01:00:20,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1872456.0, ans=0.125 2023-06-25 01:00:41,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-25 01:00:53,327 INFO [train.py:996] (0/4) Epoch 11, batch 7150, loss[loss=0.2855, simple_loss=0.3512, pruned_loss=0.1099, over 21417.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2974, pruned_loss=0.0733, over 4271004.26 frames. ], batch size: 471, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:01:14,503 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.65 vs. limit=12.0 2023-06-25 01:01:19,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-25 01:01:25,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.968e+02 7.662e+02 1.147e+03 1.671e+03 2.803e+03, threshold=2.294e+03, percent-clipped=2.0 2023-06-25 01:01:37,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-25 01:02:49,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-25 01:02:51,325 INFO [train.py:996] (0/4) Epoch 11, batch 7200, loss[loss=0.217, simple_loss=0.2808, pruned_loss=0.0766, over 21837.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3006, pruned_loss=0.0755, over 4268293.57 frames. ], batch size: 373, lr: 2.69e-03, grad_scale: 32.0 2023-06-25 01:02:57,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-25 01:03:05,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1872876.0, ans=0.125 2023-06-25 01:03:15,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=12.0 2023-06-25 01:03:25,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1872936.0, ans=0.2 2023-06-25 01:03:30,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1872996.0, ans=0.0 2023-06-25 01:03:50,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1872996.0, ans=0.2 2023-06-25 01:04:40,400 INFO [train.py:996] (0/4) Epoch 11, batch 7250, loss[loss=0.1997, simple_loss=0.2623, pruned_loss=0.06859, over 21156.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2977, pruned_loss=0.07572, over 4270167.49 frames. ], batch size: 143, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:04:41,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1873176.0, ans=0.125 2023-06-25 01:04:47,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1873176.0, ans=0.2 2023-06-25 01:05:01,389 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.15 vs. limit=15.0 2023-06-25 01:05:02,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1873236.0, ans=0.0 2023-06-25 01:05:06,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.525e+02 1.021e+03 1.447e+03 2.035e+03 4.041e+03, threshold=2.893e+03, percent-clipped=18.0 2023-06-25 01:05:12,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1873236.0, ans=0.125 2023-06-25 01:06:24,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1873416.0, ans=0.1 2023-06-25 01:06:27,166 INFO [train.py:996] (0/4) Epoch 11, batch 7300, loss[loss=0.2051, simple_loss=0.269, pruned_loss=0.0706, over 21655.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2921, pruned_loss=0.07523, over 4269132.41 frames. ], batch size: 333, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:06:46,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1873476.0, ans=0.125 2023-06-25 01:07:44,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1873656.0, ans=0.0 2023-06-25 01:07:56,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1873716.0, ans=0.125 2023-06-25 01:08:04,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-25 01:08:16,369 INFO [train.py:996] (0/4) Epoch 11, batch 7350, loss[loss=0.2704, simple_loss=0.3313, pruned_loss=0.1047, over 21731.00 frames. ], tot_loss[loss=0.222, simple_loss=0.291, pruned_loss=0.07652, over 4267133.00 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:08:43,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.949e+02 8.143e+02 1.181e+03 1.694e+03 4.027e+03, threshold=2.361e+03, percent-clipped=4.0 2023-06-25 01:09:27,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1873956.0, ans=0.125 2023-06-25 01:09:30,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1873956.0, ans=0.125 2023-06-25 01:10:04,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.93 vs. limit=10.0 2023-06-25 01:10:05,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1874076.0, ans=0.125 2023-06-25 01:10:11,703 INFO [train.py:996] (0/4) Epoch 11, batch 7400, loss[loss=0.1959, simple_loss=0.2678, pruned_loss=0.06201, over 21297.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2963, pruned_loss=0.07828, over 4273253.11 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:10:27,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874136.0, ans=0.1 2023-06-25 01:10:47,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1874136.0, ans=0.125 2023-06-25 01:12:03,188 INFO [train.py:996] (0/4) Epoch 11, batch 7450, loss[loss=0.2163, simple_loss=0.2793, pruned_loss=0.07671, over 21588.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2947, pruned_loss=0.07712, over 4268778.98 frames. ], batch size: 231, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:12:33,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.846e+02 7.768e+02 1.010e+03 1.629e+03 4.953e+03, threshold=2.020e+03, percent-clipped=6.0 2023-06-25 01:12:52,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1874496.0, ans=0.1 2023-06-25 01:13:17,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-25 01:13:26,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1874556.0, ans=0.0 2023-06-25 01:13:53,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1874676.0, ans=0.0 2023-06-25 01:13:54,441 INFO [train.py:996] (0/4) Epoch 11, batch 7500, loss[loss=0.2262, simple_loss=0.333, pruned_loss=0.05971, over 21678.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3008, pruned_loss=0.07849, over 4267852.02 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:14:08,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1874676.0, ans=0.2 2023-06-25 01:14:10,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1874736.0, ans=0.0 2023-06-25 01:14:32,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1874736.0, ans=0.0 2023-06-25 01:14:48,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1874796.0, ans=0.125 2023-06-25 01:14:53,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1874796.0, ans=0.2 2023-06-25 01:15:43,444 INFO [train.py:996] (0/4) Epoch 11, batch 7550, loss[loss=0.2053, simple_loss=0.2716, pruned_loss=0.0695, over 21226.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3077, pruned_loss=0.0782, over 4273352.60 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:15:54,156 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.63 vs. limit=6.0 2023-06-25 01:16:17,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 9.851e+02 1.650e+03 2.404e+03 5.031e+03, threshold=3.301e+03, percent-clipped=35.0 2023-06-25 01:16:32,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1875096.0, ans=0.05 2023-06-25 01:16:48,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1875156.0, ans=0.125 2023-06-25 01:17:12,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1875216.0, ans=0.125 2023-06-25 01:17:15,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=22.5 2023-06-25 01:17:29,926 INFO [train.py:996] (0/4) Epoch 11, batch 7600, loss[loss=0.2016, simple_loss=0.3181, pruned_loss=0.0426, over 20818.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3062, pruned_loss=0.07589, over 4276597.61 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 32.0 2023-06-25 01:18:03,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1875336.0, ans=0.2 2023-06-25 01:18:03,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1875336.0, ans=0.125 2023-06-25 01:18:18,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1875396.0, ans=0.2 2023-06-25 01:18:26,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1875396.0, ans=0.07 2023-06-25 01:18:37,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1875456.0, ans=0.0 2023-06-25 01:19:14,344 INFO [train.py:996] (0/4) Epoch 11, batch 7650, loss[loss=0.2289, simple_loss=0.2898, pruned_loss=0.08401, over 21906.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3062, pruned_loss=0.07768, over 4279929.26 frames. ], batch size: 283, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:19:44,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.184e+02 7.567e+02 1.161e+03 1.543e+03 3.222e+03, threshold=2.322e+03, percent-clipped=0.0 2023-06-25 01:19:46,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1875636.0, ans=10.0 2023-06-25 01:19:50,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1875636.0, ans=0.2 2023-06-25 01:20:13,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1875756.0, ans=0.125 2023-06-25 01:20:15,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1875756.0, ans=0.2 2023-06-25 01:20:34,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875756.0, ans=0.1 2023-06-25 01:20:36,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1875756.0, ans=0.125 2023-06-25 01:20:56,027 INFO [train.py:996] (0/4) Epoch 11, batch 7700, loss[loss=0.2709, simple_loss=0.3445, pruned_loss=0.09861, over 21330.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3109, pruned_loss=0.08137, over 4283769.92 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:20:58,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1875876.0, ans=0.125 2023-06-25 01:21:49,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1875996.0, ans=0.2 2023-06-25 01:22:04,533 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-25 01:22:13,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1876056.0, ans=0.2 2023-06-25 01:22:14,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1876056.0, ans=0.0 2023-06-25 01:22:39,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1876116.0, ans=0.0 2023-06-25 01:22:42,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1876116.0, ans=0.125 2023-06-25 01:22:45,825 INFO [train.py:996] (0/4) Epoch 11, batch 7750, loss[loss=0.2658, simple_loss=0.3717, pruned_loss=0.07989, over 21867.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.318, pruned_loss=0.08075, over 4274892.06 frames. ], batch size: 372, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:22:47,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1876176.0, ans=0.125 2023-06-25 01:23:10,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.928e+02 1.247e+03 1.821e+03 3.792e+03, threshold=2.494e+03, percent-clipped=12.0 2023-06-25 01:24:02,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1876356.0, ans=0.125 2023-06-25 01:24:31,897 INFO [train.py:996] (0/4) Epoch 11, batch 7800, loss[loss=0.2109, simple_loss=0.2758, pruned_loss=0.07302, over 21514.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.318, pruned_loss=0.0812, over 4267711.06 frames. ], batch size: 195, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:25:09,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1876536.0, ans=0.1 2023-06-25 01:25:29,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1876656.0, ans=0.125 2023-06-25 01:26:02,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1876716.0, ans=0.125 2023-06-25 01:26:04,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-25 01:26:15,686 INFO [train.py:996] (0/4) Epoch 11, batch 7850, loss[loss=0.2193, simple_loss=0.2745, pruned_loss=0.08206, over 21208.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3087, pruned_loss=0.08028, over 4260097.45 frames. ], batch size: 144, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:26:20,414 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.06 vs. limit=22.5 2023-06-25 01:26:23,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1876776.0, ans=0.125 2023-06-25 01:26:46,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.852e+02 8.105e+02 1.212e+03 1.898e+03 4.667e+03, threshold=2.425e+03, percent-clipped=9.0 2023-06-25 01:27:54,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1877016.0, ans=0.125 2023-06-25 01:28:06,258 INFO [train.py:996] (0/4) Epoch 11, batch 7900, loss[loss=0.2082, simple_loss=0.2667, pruned_loss=0.07487, over 21338.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3057, pruned_loss=0.07883, over 4256302.63 frames. ], batch size: 177, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:28:39,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1877136.0, ans=0.125 2023-06-25 01:28:39,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1877136.0, ans=0.1 2023-06-25 01:29:05,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1877196.0, ans=0.125 2023-06-25 01:29:18,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1877256.0, ans=10.0 2023-06-25 01:29:57,773 INFO [train.py:996] (0/4) Epoch 11, batch 7950, loss[loss=0.2281, simple_loss=0.3132, pruned_loss=0.07152, over 20789.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3106, pruned_loss=0.07812, over 4266425.76 frames. ], batch size: 609, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:30:35,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.494e+02 9.486e+02 1.599e+03 2.410e+03 5.026e+03, threshold=3.197e+03, percent-clipped=23.0 2023-06-25 01:31:19,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1877556.0, ans=0.2 2023-06-25 01:31:20,414 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-25 01:31:43,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-25 01:32:03,952 INFO [train.py:996] (0/4) Epoch 11, batch 8000, loss[loss=0.3525, simple_loss=0.4108, pruned_loss=0.1471, over 21366.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3151, pruned_loss=0.08056, over 4273500.13 frames. ], batch size: 507, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:32:11,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1877676.0, ans=0.1 2023-06-25 01:32:17,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1877676.0, ans=0.125 2023-06-25 01:33:13,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1877856.0, ans=0.125 2023-06-25 01:33:40,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1877916.0, ans=0.0 2023-06-25 01:33:57,005 INFO [train.py:996] (0/4) Epoch 11, batch 8050, loss[loss=0.2653, simple_loss=0.3539, pruned_loss=0.08842, over 21664.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3172, pruned_loss=0.08034, over 4269178.73 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:34:34,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 8.572e+02 1.267e+03 1.861e+03 4.173e+03, threshold=2.534e+03, percent-clipped=4.0 2023-06-25 01:34:51,881 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-25 01:35:25,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1878156.0, ans=0.125 2023-06-25 01:35:26,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1878216.0, ans=10.0 2023-06-25 01:35:45,787 INFO [train.py:996] (0/4) Epoch 11, batch 8100, loss[loss=0.2242, simple_loss=0.293, pruned_loss=0.07773, over 21553.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3158, pruned_loss=0.08043, over 4271227.75 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:35:46,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1878276.0, ans=0.1 2023-06-25 01:36:10,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1878276.0, ans=0.2 2023-06-25 01:36:47,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1878396.0, ans=0.0 2023-06-25 01:37:48,345 INFO [train.py:996] (0/4) Epoch 11, batch 8150, loss[loss=0.2532, simple_loss=0.3479, pruned_loss=0.07926, over 21678.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3219, pruned_loss=0.08179, over 4275458.47 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:38:17,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1878636.0, ans=0.0 2023-06-25 01:38:17,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.023e+02 7.751e+02 1.218e+03 2.122e+03 5.445e+03, threshold=2.437e+03, percent-clipped=16.0 2023-06-25 01:38:38,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1878696.0, ans=0.125 2023-06-25 01:39:02,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1878756.0, ans=0.1 2023-06-25 01:39:30,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=12.0 2023-06-25 01:39:39,824 INFO [train.py:996] (0/4) Epoch 11, batch 8200, loss[loss=0.2236, simple_loss=0.2871, pruned_loss=0.08011, over 21878.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.314, pruned_loss=0.0793, over 4273042.59 frames. ], batch size: 373, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:39:42,437 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.99 vs. limit=10.0 2023-06-25 01:40:36,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1878996.0, ans=0.0 2023-06-25 01:40:44,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1879056.0, ans=0.125 2023-06-25 01:41:24,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1879116.0, ans=0.0 2023-06-25 01:41:28,965 INFO [train.py:996] (0/4) Epoch 11, batch 8250, loss[loss=0.2233, simple_loss=0.2996, pruned_loss=0.07351, over 21270.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3108, pruned_loss=0.07911, over 4266341.88 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:41:36,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1879176.0, ans=0.2 2023-06-25 01:41:59,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1879236.0, ans=0.0 2023-06-25 01:42:00,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.306e+02 1.035e+03 1.633e+03 3.565e+03, threshold=2.069e+03, percent-clipped=11.0 2023-06-25 01:42:19,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1879296.0, ans=0.125 2023-06-25 01:42:41,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1879356.0, ans=0.1 2023-06-25 01:43:01,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1879416.0, ans=0.125 2023-06-25 01:43:17,173 INFO [train.py:996] (0/4) Epoch 11, batch 8300, loss[loss=0.2093, simple_loss=0.3055, pruned_loss=0.05655, over 21683.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3103, pruned_loss=0.07687, over 4271144.75 frames. ], batch size: 351, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:44:02,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1879596.0, ans=0.125 2023-06-25 01:44:24,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1879656.0, ans=0.125 2023-06-25 01:45:04,806 INFO [train.py:996] (0/4) Epoch 11, batch 8350, loss[loss=0.2168, simple_loss=0.301, pruned_loss=0.06628, over 21657.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3105, pruned_loss=0.07611, over 4266894.61 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:45:29,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-25 01:45:30,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1879836.0, ans=0.1 2023-06-25 01:45:44,603 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 7.785e+02 1.165e+03 1.706e+03 3.630e+03, threshold=2.331e+03, percent-clipped=15.0 2023-06-25 01:45:55,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1879896.0, ans=0.125 2023-06-25 01:46:31,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1879956.0, ans=0.1 2023-06-25 01:46:32,590 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-25 01:46:53,170 INFO [train.py:996] (0/4) Epoch 11, batch 8400, loss[loss=0.1676, simple_loss=0.2626, pruned_loss=0.03629, over 21736.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3045, pruned_loss=0.07211, over 4260572.46 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:47:17,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=22.5 2023-06-25 01:47:40,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1880196.0, ans=0.1 2023-06-25 01:48:41,844 INFO [train.py:996] (0/4) Epoch 11, batch 8450, loss[loss=0.238, simple_loss=0.3055, pruned_loss=0.08525, over 21904.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3064, pruned_loss=0.0729, over 4263347.43 frames. ], batch size: 124, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:48:48,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1880376.0, ans=0.125 2023-06-25 01:49:02,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1880436.0, ans=0.125 2023-06-25 01:49:13,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1880436.0, ans=0.125 2023-06-25 01:49:19,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-25 01:49:20,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.263e+02 6.433e+02 1.170e+03 1.916e+03 4.574e+03, threshold=2.341e+03, percent-clipped=17.0 2023-06-25 01:49:21,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1880436.0, ans=0.0 2023-06-25 01:49:21,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-25 01:49:32,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1880496.0, ans=0.2 2023-06-25 01:49:50,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1880556.0, ans=0.125 2023-06-25 01:50:30,063 INFO [train.py:996] (0/4) Epoch 11, batch 8500, loss[loss=0.2192, simple_loss=0.2825, pruned_loss=0.07793, over 21725.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3024, pruned_loss=0.07414, over 4265236.59 frames. ], batch size: 316, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:50:30,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1880676.0, ans=0.2 2023-06-25 01:51:04,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1880736.0, ans=0.1 2023-06-25 01:51:06,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1880736.0, ans=0.1 2023-06-25 01:52:07,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1880916.0, ans=0.0 2023-06-25 01:52:10,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1880916.0, ans=0.125 2023-06-25 01:52:18,602 INFO [train.py:996] (0/4) Epoch 11, batch 8550, loss[loss=0.2077, simple_loss=0.2852, pruned_loss=0.06508, over 21262.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3045, pruned_loss=0.07636, over 4270613.55 frames. ], batch size: 144, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:52:29,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1880976.0, ans=0.0 2023-06-25 01:52:36,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1880976.0, ans=0.125 2023-06-25 01:52:46,976 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:52:56,693 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.844e+02 6.860e+02 9.469e+02 1.395e+03 3.551e+03, threshold=1.894e+03, percent-clipped=10.0 2023-06-25 01:53:10,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1881096.0, ans=0.125 2023-06-25 01:53:17,823 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-06-25 01:53:39,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=8.0 2023-06-25 01:53:44,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1881156.0, ans=0.125 2023-06-25 01:53:57,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-25 01:54:20,740 INFO [train.py:996] (0/4) Epoch 11, batch 8600, loss[loss=0.2371, simple_loss=0.3131, pruned_loss=0.08056, over 21726.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3113, pruned_loss=0.07825, over 4265211.43 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:54:34,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1881276.0, ans=0.07 2023-06-25 01:55:02,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1881396.0, ans=0.0 2023-06-25 01:55:23,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1881456.0, ans=0.125 2023-06-25 01:56:09,648 INFO [train.py:996] (0/4) Epoch 11, batch 8650, loss[loss=0.1857, simple_loss=0.2786, pruned_loss=0.0464, over 21760.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3178, pruned_loss=0.07895, over 4263798.79 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:56:43,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.081e+02 8.530e+02 1.308e+03 2.199e+03 5.345e+03, threshold=2.615e+03, percent-clipped=30.0 2023-06-25 01:57:09,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1881696.0, ans=0.0 2023-06-25 01:57:17,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1881756.0, ans=0.2 2023-06-25 01:57:31,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1881816.0, ans=0.1 2023-06-25 01:57:52,035 INFO [train.py:996] (0/4) Epoch 11, batch 8700, loss[loss=0.1972, simple_loss=0.2623, pruned_loss=0.06599, over 21767.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3113, pruned_loss=0.07649, over 4260345.51 frames. ], batch size: 371, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:58:22,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1881936.0, ans=0.125 2023-06-25 01:59:07,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1882056.0, ans=0.2 2023-06-25 01:59:38,929 INFO [train.py:996] (0/4) Epoch 11, batch 8750, loss[loss=0.2613, simple_loss=0.3083, pruned_loss=0.1072, over 21704.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3089, pruned_loss=0.07708, over 4267511.66 frames. ], batch size: 508, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:59:40,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1882176.0, ans=0.125 2023-06-25 02:00:11,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1882236.0, ans=0.1 2023-06-25 02:00:25,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.601e+02 8.552e+02 1.572e+03 2.395e+03 4.841e+03, threshold=3.145e+03, percent-clipped=19.0 2023-06-25 02:01:12,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1882416.0, ans=0.125 2023-06-25 02:01:14,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1882416.0, ans=0.125 2023-06-25 02:01:17,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1882416.0, ans=0.125 2023-06-25 02:01:32,860 INFO [train.py:996] (0/4) Epoch 11, batch 8800, loss[loss=0.2507, simple_loss=0.3382, pruned_loss=0.08154, over 21775.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.317, pruned_loss=0.07974, over 4272577.49 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:01:44,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1882476.0, ans=0.125 2023-06-25 02:03:27,980 INFO [train.py:996] (0/4) Epoch 11, batch 8850, loss[loss=0.2318, simple_loss=0.3237, pruned_loss=0.0699, over 21767.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3239, pruned_loss=0.08265, over 4277939.82 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:03:49,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1882836.0, ans=0.125 2023-06-25 02:03:58,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1882836.0, ans=0.09899494936611666 2023-06-25 02:04:04,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 8.533e+02 1.157e+03 2.147e+03 4.267e+03, threshold=2.313e+03, percent-clipped=8.0 2023-06-25 02:04:08,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1882896.0, ans=0.2 2023-06-25 02:04:12,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1882896.0, ans=0.1 2023-06-25 02:05:00,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1883016.0, ans=0.125 2023-06-25 02:05:17,246 INFO [train.py:996] (0/4) Epoch 11, batch 8900, loss[loss=0.2134, simple_loss=0.2942, pruned_loss=0.06628, over 21702.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3168, pruned_loss=0.08145, over 4279837.58 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:05:40,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1883136.0, ans=0.05 2023-06-25 02:05:42,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1883136.0, ans=0.5 2023-06-25 02:05:48,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-25 02:06:18,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1883196.0, ans=0.1 2023-06-25 02:07:08,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1883316.0, ans=0.0 2023-06-25 02:07:13,528 INFO [train.py:996] (0/4) Epoch 11, batch 8950, loss[loss=0.2135, simple_loss=0.2815, pruned_loss=0.07274, over 21348.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3182, pruned_loss=0.08095, over 4273850.94 frames. ], batch size: 194, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:07:18,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1883376.0, ans=0.1 2023-06-25 02:07:42,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-25 02:07:48,289 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.752e+02 7.956e+02 1.198e+03 2.154e+03 4.592e+03, threshold=2.397e+03, percent-clipped=22.0 2023-06-25 02:08:21,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1883556.0, ans=0.0 2023-06-25 02:08:31,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1883616.0, ans=0.125 2023-06-25 02:08:35,312 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-25 02:08:38,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.54 vs. limit=15.0 2023-06-25 02:08:52,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1883616.0, ans=0.125 2023-06-25 02:08:55,075 INFO [train.py:996] (0/4) Epoch 11, batch 9000, loss[loss=0.2256, simple_loss=0.2976, pruned_loss=0.07681, over 21714.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.311, pruned_loss=0.08055, over 4273700.60 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:08:55,077 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 02:09:12,570 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2589, simple_loss=0.3526, pruned_loss=0.08262, over 1796401.00 frames. 2023-06-25 02:09:12,571 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 02:09:15,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-25 02:09:22,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1883676.0, ans=0.125 2023-06-25 02:11:00,337 INFO [train.py:996] (0/4) Epoch 11, batch 9050, loss[loss=0.1844, simple_loss=0.2676, pruned_loss=0.0506, over 21729.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3065, pruned_loss=0.07687, over 4280771.71 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:11:00,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1883976.0, ans=0.0 2023-06-25 02:11:44,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.917e+02 7.004e+02 1.025e+03 1.804e+03 4.936e+03, threshold=2.049e+03, percent-clipped=10.0 2023-06-25 02:12:17,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-25 02:12:49,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1884276.0, ans=0.0 2023-06-25 02:12:50,838 INFO [train.py:996] (0/4) Epoch 11, batch 9100, loss[loss=0.2254, simple_loss=0.321, pruned_loss=0.06491, over 21721.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3125, pruned_loss=0.07878, over 4282078.93 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:12:56,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1884276.0, ans=0.0 2023-06-25 02:12:59,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-25 02:13:41,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-25 02:14:40,245 INFO [train.py:996] (0/4) Epoch 11, batch 9150, loss[loss=0.247, simple_loss=0.3737, pruned_loss=0.06013, over 19809.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3149, pruned_loss=0.07569, over 4272684.08 frames. ], batch size: 702, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:14:42,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1884576.0, ans=0.0 2023-06-25 02:15:21,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 1.034e+03 1.434e+03 2.123e+03 3.847e+03, threshold=2.868e+03, percent-clipped=26.0 2023-06-25 02:16:33,383 INFO [train.py:996] (0/4) Epoch 11, batch 9200, loss[loss=0.241, simple_loss=0.3209, pruned_loss=0.08058, over 21314.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3182, pruned_loss=0.07557, over 4273515.42 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:16:35,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-25 02:16:48,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1884876.0, ans=0.125 2023-06-25 02:17:38,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1885056.0, ans=0.125 2023-06-25 02:17:46,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1885056.0, ans=0.2 2023-06-25 02:18:20,294 INFO [train.py:996] (0/4) Epoch 11, batch 9250, loss[loss=0.2495, simple_loss=0.3167, pruned_loss=0.09109, over 21792.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3216, pruned_loss=0.07964, over 4274443.04 frames. ], batch size: 124, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:18:44,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-25 02:18:56,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.130e+02 8.312e+02 1.043e+03 1.613e+03 4.110e+03, threshold=2.085e+03, percent-clipped=7.0 2023-06-25 02:19:14,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1885296.0, ans=0.0 2023-06-25 02:19:27,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-25 02:19:59,325 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-25 02:20:14,375 INFO [train.py:996] (0/4) Epoch 11, batch 9300, loss[loss=0.2636, simple_loss=0.3517, pruned_loss=0.08771, over 21862.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3146, pruned_loss=0.07908, over 4273551.87 frames. ], batch size: 372, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:21:14,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1885596.0, ans=0.2 2023-06-25 02:21:53,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1885716.0, ans=0.125 2023-06-25 02:21:55,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1885716.0, ans=0.0 2023-06-25 02:22:02,531 INFO [train.py:996] (0/4) Epoch 11, batch 9350, loss[loss=0.2517, simple_loss=0.3341, pruned_loss=0.08468, over 21603.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3195, pruned_loss=0.07928, over 4274216.84 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:22:10,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-25 02:22:41,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.377e+02 8.449e+02 1.377e+03 2.044e+03 3.190e+03, threshold=2.753e+03, percent-clipped=23.0 2023-06-25 02:23:39,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1886016.0, ans=0.04949747468305833 2023-06-25 02:23:52,748 INFO [train.py:996] (0/4) Epoch 11, batch 9400, loss[loss=0.1969, simple_loss=0.2732, pruned_loss=0.06029, over 21769.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3214, pruned_loss=0.07977, over 4272816.28 frames. ], batch size: 351, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:24:03,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1886076.0, ans=0.0 2023-06-25 02:24:24,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.82 vs. limit=22.5 2023-06-25 02:24:46,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1886196.0, ans=0.2 2023-06-25 02:24:56,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-25 02:25:04,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1886256.0, ans=0.07 2023-06-25 02:25:21,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1886316.0, ans=0.1 2023-06-25 02:25:31,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1886316.0, ans=0.0 2023-06-25 02:25:44,631 INFO [train.py:996] (0/4) Epoch 11, batch 9450, loss[loss=0.2317, simple_loss=0.2877, pruned_loss=0.08788, over 21412.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3126, pruned_loss=0.0785, over 4274777.84 frames. ], batch size: 475, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:25:56,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-25 02:25:57,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1886376.0, ans=0.0 2023-06-25 02:26:20,767 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 9.191e+02 1.408e+03 2.175e+03 4.648e+03, threshold=2.816e+03, percent-clipped=10.0 2023-06-25 02:26:26,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1886496.0, ans=0.1 2023-06-25 02:26:52,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1886556.0, ans=0.125 2023-06-25 02:27:05,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1886556.0, ans=0.2 2023-06-25 02:27:33,367 INFO [train.py:996] (0/4) Epoch 11, batch 9500, loss[loss=0.1855, simple_loss=0.2654, pruned_loss=0.05283, over 21622.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3052, pruned_loss=0.07717, over 4279850.31 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:27:37,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1886676.0, ans=10.0 2023-06-25 02:27:38,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1886676.0, ans=0.09899494936611666 2023-06-25 02:28:01,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1886736.0, ans=0.0 2023-06-25 02:28:06,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1886736.0, ans=0.0 2023-06-25 02:29:22,397 INFO [train.py:996] (0/4) Epoch 11, batch 9550, loss[loss=0.2737, simple_loss=0.3558, pruned_loss=0.09576, over 21751.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3097, pruned_loss=0.07924, over 4279000.29 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:29:57,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.824e+02 8.918e+02 1.397e+03 2.020e+03 4.656e+03, threshold=2.794e+03, percent-clipped=11.0 2023-06-25 02:29:58,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1887036.0, ans=0.2 2023-06-25 02:30:05,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1887096.0, ans=0.0 2023-06-25 02:30:27,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1887156.0, ans=0.125 2023-06-25 02:30:44,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1887156.0, ans=0.125 2023-06-25 02:31:06,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-25 02:31:08,650 INFO [train.py:996] (0/4) Epoch 11, batch 9600, loss[loss=0.2248, simple_loss=0.297, pruned_loss=0.07631, over 21888.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3118, pruned_loss=0.08018, over 4285417.13 frames. ], batch size: 107, lr: 2.68e-03, grad_scale: 32.0 2023-06-25 02:31:10,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-25 02:31:37,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1887336.0, ans=0.1 2023-06-25 02:31:37,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1887336.0, ans=0.125 2023-06-25 02:31:57,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1887396.0, ans=0.125 2023-06-25 02:32:06,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-25 02:32:43,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1887516.0, ans=0.0 2023-06-25 02:32:45,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1887516.0, ans=0.125 2023-06-25 02:32:47,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1887516.0, ans=0.035 2023-06-25 02:32:53,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1887516.0, ans=0.05 2023-06-25 02:32:56,755 INFO [train.py:996] (0/4) Epoch 11, batch 9650, loss[loss=0.3139, simple_loss=0.3733, pruned_loss=0.1272, over 21774.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3135, pruned_loss=0.08058, over 4288498.88 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:32:59,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1887576.0, ans=0.125 2023-06-25 02:33:05,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-06-25 02:33:34,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 8.589e+02 1.260e+03 1.923e+03 2.986e+03, threshold=2.520e+03, percent-clipped=3.0 2023-06-25 02:34:05,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1887756.0, ans=0.0 2023-06-25 02:34:25,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1887756.0, ans=0.0 2023-06-25 02:34:28,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1887816.0, ans=0.0 2023-06-25 02:34:28,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1887816.0, ans=0.1 2023-06-25 02:34:29,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-25 02:34:45,503 INFO [train.py:996] (0/4) Epoch 11, batch 9700, loss[loss=0.2092, simple_loss=0.2845, pruned_loss=0.06688, over 21849.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3183, pruned_loss=0.08184, over 4282007.21 frames. ], batch size: 107, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:35:00,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1887876.0, ans=0.0 2023-06-25 02:35:11,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1887936.0, ans=0.0 2023-06-25 02:35:15,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1887936.0, ans=0.0 2023-06-25 02:35:34,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1887996.0, ans=0.125 2023-06-25 02:35:46,411 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-25 02:36:02,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1888056.0, ans=0.0 2023-06-25 02:36:17,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1888116.0, ans=0.125 2023-06-25 02:36:33,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.41 vs. limit=15.0 2023-06-25 02:36:34,161 INFO [train.py:996] (0/4) Epoch 11, batch 9750, loss[loss=0.2622, simple_loss=0.3426, pruned_loss=0.09095, over 16777.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3117, pruned_loss=0.08103, over 4263453.27 frames. ], batch size: 68, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:36:39,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-25 02:37:09,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.320e+02 8.182e+02 1.091e+03 1.675e+03 6.818e+03, threshold=2.183e+03, percent-clipped=8.0 2023-06-25 02:37:36,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1888296.0, ans=0.0 2023-06-25 02:37:43,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1888356.0, ans=0.125 2023-06-25 02:38:19,351 INFO [train.py:996] (0/4) Epoch 11, batch 9800, loss[loss=0.2426, simple_loss=0.3123, pruned_loss=0.08642, over 21739.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3118, pruned_loss=0.0807, over 4261762.07 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:38:38,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1888536.0, ans=0.0 2023-06-25 02:38:54,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1888536.0, ans=0.0 2023-06-25 02:39:30,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1888656.0, ans=0.0 2023-06-25 02:39:30,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1888656.0, ans=0.125 2023-06-25 02:39:40,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1888656.0, ans=0.125 2023-06-25 02:40:05,051 INFO [train.py:996] (0/4) Epoch 11, batch 9850, loss[loss=0.2255, simple_loss=0.2914, pruned_loss=0.07976, over 21785.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3088, pruned_loss=0.08089, over 4266235.98 frames. ], batch size: 371, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:40:36,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.05 vs. limit=6.0 2023-06-25 02:40:37,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1888836.0, ans=0.0 2023-06-25 02:40:42,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 6.727e+02 9.053e+02 1.353e+03 2.861e+03, threshold=1.811e+03, percent-clipped=2.0 2023-06-25 02:41:42,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1889016.0, ans=0.0 2023-06-25 02:41:43,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1889016.0, ans=0.125 2023-06-25 02:41:53,379 INFO [train.py:996] (0/4) Epoch 11, batch 9900, loss[loss=0.2896, simple_loss=0.3438, pruned_loss=0.1177, over 21259.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3041, pruned_loss=0.07995, over 4259096.32 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:42:33,856 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:42:40,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2023-06-25 02:42:54,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1889196.0, ans=0.0 2023-06-25 02:43:01,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.73 vs. limit=15.0 2023-06-25 02:43:40,143 INFO [train.py:996] (0/4) Epoch 11, batch 9950, loss[loss=0.2432, simple_loss=0.3041, pruned_loss=0.09118, over 21863.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3048, pruned_loss=0.08229, over 4261913.15 frames. ], batch size: 98, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:43:57,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1889376.0, ans=0.09899494936611666 2023-06-25 02:44:20,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1889436.0, ans=0.125 2023-06-25 02:44:23,434 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.842e+02 7.849e+02 1.088e+03 1.572e+03 3.841e+03, threshold=2.175e+03, percent-clipped=17.0 2023-06-25 02:44:42,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1889496.0, ans=0.0 2023-06-25 02:44:47,412 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-25 02:45:33,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889616.0, ans=0.1 2023-06-25 02:45:36,552 INFO [train.py:996] (0/4) Epoch 11, batch 10000, loss[loss=0.231, simple_loss=0.2968, pruned_loss=0.0826, over 21834.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3005, pruned_loss=0.08094, over 4256004.74 frames. ], batch size: 118, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 02:46:07,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1889736.0, ans=0.1 2023-06-25 02:46:27,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-25 02:47:25,696 INFO [train.py:996] (0/4) Epoch 11, batch 10050, loss[loss=0.2513, simple_loss=0.327, pruned_loss=0.08782, over 21745.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3032, pruned_loss=0.08145, over 4262628.01 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:47:26,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1889976.0, ans=0.125 2023-06-25 02:47:38,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1889976.0, ans=0.0 2023-06-25 02:47:38,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1889976.0, ans=0.125 2023-06-25 02:47:38,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1889976.0, ans=0.125 2023-06-25 02:48:13,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.178e+02 7.347e+02 1.195e+03 1.566e+03 3.839e+03, threshold=2.391e+03, percent-clipped=10.0 2023-06-25 02:49:14,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1890276.0, ans=0.2 2023-06-25 02:49:16,352 INFO [train.py:996] (0/4) Epoch 11, batch 10100, loss[loss=0.1933, simple_loss=0.2623, pruned_loss=0.06216, over 21315.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3017, pruned_loss=0.07931, over 4263626.89 frames. ], batch size: 194, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:49:16,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1890276.0, ans=0.0 2023-06-25 02:49:48,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1890336.0, ans=0.125 2023-06-25 02:50:10,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1890396.0, ans=0.2 2023-06-25 02:51:00,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1890516.0, ans=0.0 2023-06-25 02:51:12,565 INFO [train.py:996] (0/4) Epoch 11, batch 10150, loss[loss=0.2374, simple_loss=0.3111, pruned_loss=0.08185, over 21726.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3072, pruned_loss=0.08085, over 4250276.17 frames. ], batch size: 282, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:51:56,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1890636.0, ans=0.125 2023-06-25 02:51:59,206 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.141e+02 7.484e+02 1.008e+03 1.435e+03 3.139e+03, threshold=2.017e+03, percent-clipped=8.0 2023-06-25 02:52:18,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1890756.0, ans=0.125 2023-06-25 02:52:23,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1890756.0, ans=0.125 2023-06-25 02:52:54,899 INFO [train.py:996] (0/4) Epoch 11, batch 10200, loss[loss=0.2248, simple_loss=0.3176, pruned_loss=0.06599, over 21709.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3063, pruned_loss=0.07899, over 4246393.82 frames. ], batch size: 351, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:53:17,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1890876.0, ans=0.1 2023-06-25 02:53:29,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1890936.0, ans=0.0 2023-06-25 02:53:36,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1890936.0, ans=0.1 2023-06-25 02:54:18,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891056.0, ans=0.1 2023-06-25 02:54:43,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1891116.0, ans=0.125 2023-06-25 02:54:47,861 INFO [train.py:996] (0/4) Epoch 11, batch 10250, loss[loss=0.2636, simple_loss=0.3722, pruned_loss=0.07751, over 19968.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.301, pruned_loss=0.07297, over 4258522.36 frames. ], batch size: 703, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:55:38,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.531e+02 8.201e+02 1.215e+03 1.712e+03 3.588e+03, threshold=2.431e+03, percent-clipped=17.0 2023-06-25 02:56:12,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1891356.0, ans=0.09899494936611666 2023-06-25 02:56:46,817 INFO [train.py:996] (0/4) Epoch 11, batch 10300, loss[loss=0.2326, simple_loss=0.3212, pruned_loss=0.07204, over 21414.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3046, pruned_loss=0.0741, over 4265663.57 frames. ], batch size: 211, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:57:24,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1891536.0, ans=10.0 2023-06-25 02:57:28,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1891596.0, ans=0.125 2023-06-25 02:57:47,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1891596.0, ans=0.2 2023-06-25 02:58:02,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1891656.0, ans=0.0 2023-06-25 02:58:18,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-25 02:58:18,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-25 02:58:38,076 INFO [train.py:996] (0/4) Epoch 11, batch 10350, loss[loss=0.2846, simple_loss=0.3589, pruned_loss=0.1052, over 21420.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.308, pruned_loss=0.07507, over 4270867.05 frames. ], batch size: 507, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 02:59:01,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1891836.0, ans=0.1 2023-06-25 02:59:01,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1891836.0, ans=0.0 2023-06-25 02:59:25,194 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.545e+02 9.168e+02 1.323e+03 1.995e+03 3.228e+03, threshold=2.646e+03, percent-clipped=12.0 2023-06-25 02:59:37,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1891896.0, ans=0.0 2023-06-25 02:59:57,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891956.0, ans=0.1 2023-06-25 03:00:14,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1892016.0, ans=0.1 2023-06-25 03:00:32,839 INFO [train.py:996] (0/4) Epoch 11, batch 10400, loss[loss=0.1545, simple_loss=0.1948, pruned_loss=0.05714, over 21716.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3017, pruned_loss=0.07433, over 4263283.93 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:00:33,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1892076.0, ans=0.05 2023-06-25 03:00:37,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1892076.0, ans=0.0 2023-06-25 03:00:42,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1892076.0, ans=0.125 2023-06-25 03:01:03,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1892136.0, ans=0.125 2023-06-25 03:01:11,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1892196.0, ans=0.1 2023-06-25 03:01:41,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1892256.0, ans=0.125 2023-06-25 03:02:20,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1892316.0, ans=0.125 2023-06-25 03:02:23,367 INFO [train.py:996] (0/4) Epoch 11, batch 10450, loss[loss=0.2167, simple_loss=0.2966, pruned_loss=0.06842, over 21458.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3054, pruned_loss=0.07672, over 4259575.06 frames. ], batch size: 211, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:02:43,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1892376.0, ans=0.2 2023-06-25 03:03:04,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.499e+02 9.525e+02 1.455e+03 2.411e+03 5.571e+03, threshold=2.910e+03, percent-clipped=19.0 2023-06-25 03:04:01,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1892616.0, ans=0.125 2023-06-25 03:04:08,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1892616.0, ans=0.1 2023-06-25 03:04:09,219 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-25 03:04:11,719 INFO [train.py:996] (0/4) Epoch 11, batch 10500, loss[loss=0.2026, simple_loss=0.2651, pruned_loss=0.07009, over 21504.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3048, pruned_loss=0.07489, over 4252640.23 frames. ], batch size: 212, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:04:35,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1892736.0, ans=0.0 2023-06-25 03:04:35,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1892736.0, ans=0.125 2023-06-25 03:05:19,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1892856.0, ans=0.0 2023-06-25 03:05:33,125 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:05:57,606 INFO [train.py:996] (0/4) Epoch 11, batch 10550, loss[loss=0.1982, simple_loss=0.2575, pruned_loss=0.06947, over 21619.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.298, pruned_loss=0.07462, over 4244457.18 frames. ], batch size: 231, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:06:39,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 7.309e+02 9.989e+02 1.510e+03 3.276e+03, threshold=1.998e+03, percent-clipped=4.0 2023-06-25 03:07:14,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1893156.0, ans=0.0 2023-06-25 03:07:17,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1893156.0, ans=0.125 2023-06-25 03:07:29,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=15.0 2023-06-25 03:07:30,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1893216.0, ans=0.0 2023-06-25 03:07:51,042 INFO [train.py:996] (0/4) Epoch 11, batch 10600, loss[loss=0.2287, simple_loss=0.3312, pruned_loss=0.06308, over 21475.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2947, pruned_loss=0.07424, over 4239818.74 frames. ], batch size: 471, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:08:22,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1893336.0, ans=0.09899494936611666 2023-06-25 03:08:49,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1893396.0, ans=0.0 2023-06-25 03:08:52,017 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:09:14,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1893456.0, ans=0.125 2023-06-25 03:09:28,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1893516.0, ans=0.0 2023-06-25 03:09:28,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1893516.0, ans=0.0 2023-06-25 03:09:39,378 INFO [train.py:996] (0/4) Epoch 11, batch 10650, loss[loss=0.2309, simple_loss=0.3099, pruned_loss=0.07595, over 21672.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2961, pruned_loss=0.0717, over 4247128.76 frames. ], batch size: 414, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:10:10,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1893636.0, ans=0.125 2023-06-25 03:10:22,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1893696.0, ans=0.0 2023-06-25 03:10:23,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.435e+02 7.785e+02 1.184e+03 1.890e+03 4.480e+03, threshold=2.368e+03, percent-clipped=23.0 2023-06-25 03:11:01,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1893756.0, ans=0.1 2023-06-25 03:11:22,671 INFO [train.py:996] (0/4) Epoch 11, batch 10700, loss[loss=0.1802, simple_loss=0.2547, pruned_loss=0.05282, over 21447.00 frames. ], tot_loss[loss=0.218, simple_loss=0.294, pruned_loss=0.07097, over 4232987.50 frames. ], batch size: 212, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:11:23,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1893876.0, ans=0.1 2023-06-25 03:11:36,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1893876.0, ans=0.1 2023-06-25 03:12:04,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1893936.0, ans=0.0 2023-06-25 03:13:02,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-25 03:13:10,003 INFO [train.py:996] (0/4) Epoch 11, batch 10750, loss[loss=0.2987, simple_loss=0.401, pruned_loss=0.09824, over 21340.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3055, pruned_loss=0.07592, over 4239351.82 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:13:25,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1894176.0, ans=0.1 2023-06-25 03:13:38,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1894236.0, ans=0.125 2023-06-25 03:13:41,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1894236.0, ans=0.125 2023-06-25 03:13:55,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1894236.0, ans=0.125 2023-06-25 03:14:06,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.572e+02 8.726e+02 1.242e+03 1.937e+03 5.296e+03, threshold=2.484e+03, percent-clipped=18.0 2023-06-25 03:15:11,146 INFO [train.py:996] (0/4) Epoch 11, batch 10800, loss[loss=0.2541, simple_loss=0.3305, pruned_loss=0.08878, over 21723.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3105, pruned_loss=0.07717, over 4244301.54 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:15:39,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1894536.0, ans=0.125 2023-06-25 03:16:20,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1894656.0, ans=0.0 2023-06-25 03:16:59,873 INFO [train.py:996] (0/4) Epoch 11, batch 10850, loss[loss=0.1911, simple_loss=0.2627, pruned_loss=0.05977, over 21384.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3137, pruned_loss=0.07889, over 4255974.14 frames. ], batch size: 194, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:17:28,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1894836.0, ans=0.125 2023-06-25 03:17:40,188 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:17:48,732 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.535e+02 7.569e+02 9.387e+02 1.863e+03 6.222e+03, threshold=1.877e+03, percent-clipped=9.0 2023-06-25 03:18:50,663 INFO [train.py:996] (0/4) Epoch 11, batch 10900, loss[loss=0.184, simple_loss=0.276, pruned_loss=0.04604, over 21569.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3066, pruned_loss=0.07686, over 4252942.51 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:19:52,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-25 03:20:03,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1895256.0, ans=0.0 2023-06-25 03:20:30,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1895316.0, ans=0.05 2023-06-25 03:20:37,867 INFO [train.py:996] (0/4) Epoch 11, batch 10950, loss[loss=0.2034, simple_loss=0.2725, pruned_loss=0.06716, over 21939.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3008, pruned_loss=0.07419, over 4256122.48 frames. ], batch size: 125, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:21:26,513 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 7.087e+02 9.989e+02 1.560e+03 2.958e+03, threshold=1.998e+03, percent-clipped=15.0 2023-06-25 03:21:28,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1895496.0, ans=0.0 2023-06-25 03:21:35,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895496.0, ans=0.1 2023-06-25 03:22:03,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-25 03:22:25,934 INFO [train.py:996] (0/4) Epoch 11, batch 11000, loss[loss=0.2248, simple_loss=0.299, pruned_loss=0.07525, over 19987.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2996, pruned_loss=0.07453, over 4256905.53 frames. ], batch size: 703, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:22:31,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1895676.0, ans=0.2 2023-06-25 03:23:21,763 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:23:54,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1895916.0, ans=0.125 2023-06-25 03:24:12,499 INFO [train.py:996] (0/4) Epoch 11, batch 11050, loss[loss=0.2037, simple_loss=0.2648, pruned_loss=0.07128, over 21577.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2968, pruned_loss=0.07584, over 4272365.49 frames. ], batch size: 247, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:24:17,756 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-316000.pt 2023-06-25 03:24:23,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895976.0, ans=0.1 2023-06-25 03:24:24,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1895976.0, ans=0.125 2023-06-25 03:24:34,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-25 03:24:46,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1896036.0, ans=0.1 2023-06-25 03:24:57,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 7.118e+02 9.875e+02 1.339e+03 2.675e+03, threshold=1.975e+03, percent-clipped=6.0 2023-06-25 03:25:07,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1896096.0, ans=0.125 2023-06-25 03:25:54,935 INFO [train.py:996] (0/4) Epoch 11, batch 11100, loss[loss=0.2669, simple_loss=0.3462, pruned_loss=0.09376, over 21429.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2949, pruned_loss=0.07588, over 4265649.53 frames. ], batch size: 471, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:26:14,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1896276.0, ans=0.125 2023-06-25 03:26:30,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1896336.0, ans=0.0 2023-06-25 03:26:44,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1896396.0, ans=0.125 2023-06-25 03:27:41,694 INFO [train.py:996] (0/4) Epoch 11, batch 11150, loss[loss=0.2425, simple_loss=0.3414, pruned_loss=0.07181, over 21618.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2939, pruned_loss=0.07596, over 4270461.99 frames. ], batch size: 414, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:28:11,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1896636.0, ans=0.04949747468305833 2023-06-25 03:28:31,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.186e+02 9.135e+02 1.372e+03 3.865e+03, threshold=1.827e+03, percent-clipped=12.0 2023-06-25 03:28:34,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-25 03:29:13,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1896816.0, ans=0.125 2023-06-25 03:29:19,912 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:29:31,024 INFO [train.py:996] (0/4) Epoch 11, batch 11200, loss[loss=0.2153, simple_loss=0.2819, pruned_loss=0.07441, over 21171.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2926, pruned_loss=0.07565, over 4264683.17 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:30:11,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1896936.0, ans=0.0 2023-06-25 03:31:19,603 INFO [train.py:996] (0/4) Epoch 11, batch 11250, loss[loss=0.2141, simple_loss=0.2979, pruned_loss=0.06512, over 21734.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2934, pruned_loss=0.07651, over 4268862.55 frames. ], batch size: 391, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:31:53,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1897236.0, ans=0.1 2023-06-25 03:32:07,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 7.485e+02 1.049e+03 1.491e+03 3.670e+03, threshold=2.098e+03, percent-clipped=11.0 2023-06-25 03:32:16,181 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-25 03:32:18,994 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:33:07,313 INFO [train.py:996] (0/4) Epoch 11, batch 11300, loss[loss=0.2306, simple_loss=0.3009, pruned_loss=0.08016, over 21637.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2933, pruned_loss=0.07603, over 4265777.54 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:33:51,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1897596.0, ans=0.0 2023-06-25 03:33:54,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-25 03:34:21,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1897656.0, ans=0.125 2023-06-25 03:34:49,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1897716.0, ans=0.125 2023-06-25 03:34:54,365 INFO [train.py:996] (0/4) Epoch 11, batch 11350, loss[loss=0.1963, simple_loss=0.277, pruned_loss=0.05783, over 21244.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2944, pruned_loss=0.07487, over 4268712.22 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:35:10,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1897776.0, ans=0.1 2023-06-25 03:35:13,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.58 vs. limit=15.0 2023-06-25 03:35:47,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.585e+02 7.865e+02 1.156e+03 1.769e+03 3.739e+03, threshold=2.312e+03, percent-clipped=14.0 2023-06-25 03:36:09,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1897956.0, ans=0.05 2023-06-25 03:36:12,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1897956.0, ans=15.0 2023-06-25 03:36:33,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-25 03:36:51,925 INFO [train.py:996] (0/4) Epoch 11, batch 11400, loss[loss=0.2368, simple_loss=0.3232, pruned_loss=0.07518, over 21725.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3012, pruned_loss=0.07711, over 4264904.27 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:37:28,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-25 03:37:56,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1898256.0, ans=0.0 2023-06-25 03:38:30,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1898316.0, ans=0.0 2023-06-25 03:38:39,477 INFO [train.py:996] (0/4) Epoch 11, batch 11450, loss[loss=0.2507, simple_loss=0.3247, pruned_loss=0.08838, over 21740.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3029, pruned_loss=0.07572, over 4275665.75 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:38:54,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1898376.0, ans=0.2 2023-06-25 03:39:33,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.188e+02 7.976e+02 1.094e+03 1.671e+03 3.367e+03, threshold=2.188e+03, percent-clipped=9.0 2023-06-25 03:40:08,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1898616.0, ans=0.125 2023-06-25 03:40:16,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1898616.0, ans=0.0 2023-06-25 03:40:17,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.57 vs. limit=10.0 2023-06-25 03:40:29,734 INFO [train.py:996] (0/4) Epoch 11, batch 11500, loss[loss=0.2578, simple_loss=0.3508, pruned_loss=0.08239, over 21668.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3083, pruned_loss=0.07811, over 4280218.65 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:41:39,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1898856.0, ans=0.0 2023-06-25 03:41:55,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1898856.0, ans=0.125 2023-06-25 03:42:25,758 INFO [train.py:996] (0/4) Epoch 11, batch 11550, loss[loss=0.2236, simple_loss=0.3201, pruned_loss=0.0636, over 21697.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3151, pruned_loss=0.07857, over 4281838.84 frames. ], batch size: 247, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:43:21,054 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.334e+02 7.903e+02 1.066e+03 1.850e+03 4.952e+03, threshold=2.132e+03, percent-clipped=19.0 2023-06-25 03:43:35,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1899156.0, ans=0.07 2023-06-25 03:43:51,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1899156.0, ans=0.125 2023-06-25 03:43:55,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=12.0 2023-06-25 03:44:08,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1899216.0, ans=0.125 2023-06-25 03:44:16,546 INFO [train.py:996] (0/4) Epoch 11, batch 11600, loss[loss=0.2556, simple_loss=0.3543, pruned_loss=0.07848, over 21673.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3295, pruned_loss=0.0806, over 4282455.11 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:44:29,161 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:44:36,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.70 vs. limit=10.0 2023-06-25 03:45:42,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-25 03:45:43,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1899456.0, ans=0.0 2023-06-25 03:46:03,271 INFO [train.py:996] (0/4) Epoch 11, batch 11650, loss[loss=0.248, simple_loss=0.3201, pruned_loss=0.08798, over 21623.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3366, pruned_loss=0.08228, over 4282643.18 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:46:15,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-25 03:46:28,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1899576.0, ans=0.125 2023-06-25 03:47:01,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.300e+02 9.276e+02 1.301e+03 2.293e+03 3.963e+03, threshold=2.603e+03, percent-clipped=26.0 2023-06-25 03:47:15,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1899756.0, ans=0.1 2023-06-25 03:47:37,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1899816.0, ans=0.025 2023-06-25 03:47:55,814 INFO [train.py:996] (0/4) Epoch 11, batch 11700, loss[loss=0.2103, simple_loss=0.2744, pruned_loss=0.0731, over 21860.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3278, pruned_loss=0.08134, over 4279510.87 frames. ], batch size: 107, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:48:17,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1899936.0, ans=0.125 2023-06-25 03:49:19,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1900116.0, ans=0.07 2023-06-25 03:49:26,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1900116.0, ans=0.125 2023-06-25 03:49:42,937 INFO [train.py:996] (0/4) Epoch 11, batch 11750, loss[loss=0.2246, simple_loss=0.2916, pruned_loss=0.07875, over 21673.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3177, pruned_loss=0.0807, over 4278668.52 frames. ], batch size: 247, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:49:55,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1900176.0, ans=0.125 2023-06-25 03:50:11,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1900236.0, ans=0.125 2023-06-25 03:50:36,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 8.041e+02 1.029e+03 1.302e+03 3.025e+03, threshold=2.058e+03, percent-clipped=2.0 2023-06-25 03:51:31,757 INFO [train.py:996] (0/4) Epoch 11, batch 11800, loss[loss=0.2458, simple_loss=0.3096, pruned_loss=0.09097, over 21287.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3175, pruned_loss=0.0818, over 4276078.28 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:52:35,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.50 vs. limit=15.0 2023-06-25 03:52:41,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1900656.0, ans=0.125 2023-06-25 03:53:19,768 INFO [train.py:996] (0/4) Epoch 11, batch 11850, loss[loss=0.2256, simple_loss=0.3057, pruned_loss=0.07275, over 21323.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3188, pruned_loss=0.0806, over 4281387.29 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:54:16,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1900896.0, ans=0.125 2023-06-25 03:54:17,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.516e+02 7.061e+02 9.969e+02 1.583e+03 3.889e+03, threshold=1.994e+03, percent-clipped=10.0 2023-06-25 03:54:58,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1901016.0, ans=0.125 2023-06-25 03:55:06,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1901016.0, ans=0.0 2023-06-25 03:55:15,599 INFO [train.py:996] (0/4) Epoch 11, batch 11900, loss[loss=0.2574, simple_loss=0.3438, pruned_loss=0.08553, over 21440.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3201, pruned_loss=0.07866, over 4279235.50 frames. ], batch size: 507, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:55:43,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-25 03:55:55,029 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-25 03:56:01,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1901196.0, ans=0.125 2023-06-25 03:57:11,038 INFO [train.py:996] (0/4) Epoch 11, batch 11950, loss[loss=0.1811, simple_loss=0.2631, pruned_loss=0.04958, over 21276.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3219, pruned_loss=0.07647, over 4271056.47 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:57:36,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1901436.0, ans=0.125 2023-06-25 03:57:38,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1901436.0, ans=0.125 2023-06-25 03:57:56,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.868e+02 8.366e+02 1.305e+03 1.891e+03 4.761e+03, threshold=2.610e+03, percent-clipped=24.0 2023-06-25 03:57:56,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1901496.0, ans=0.1 2023-06-25 03:58:43,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1901616.0, ans=0.0 2023-06-25 03:58:52,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-25 03:58:53,140 INFO [train.py:996] (0/4) Epoch 11, batch 12000, loss[loss=0.2176, simple_loss=0.2821, pruned_loss=0.07654, over 15899.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3187, pruned_loss=0.07532, over 4263081.61 frames. ], batch size: 60, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:58:53,142 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 03:59:11,384 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2587, simple_loss=0.3514, pruned_loss=0.08303, over 1796401.00 frames. 2023-06-25 03:59:11,385 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 04:00:00,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=12.0 2023-06-25 04:00:03,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1901796.0, ans=0.125 2023-06-25 04:00:06,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1901796.0, ans=0.0 2023-06-25 04:00:11,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1901856.0, ans=0.125 2023-06-25 04:00:33,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1901916.0, ans=0.125 2023-06-25 04:00:42,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1901916.0, ans=0.025 2023-06-25 04:00:50,836 INFO [train.py:996] (0/4) Epoch 11, batch 12050, loss[loss=0.2326, simple_loss=0.2964, pruned_loss=0.08439, over 21564.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3134, pruned_loss=0.07656, over 4269285.50 frames. ], batch size: 212, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:01:17,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.75 vs. limit=22.5 2023-06-25 04:01:39,041 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-06-25 04:01:44,088 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.395e+02 7.721e+02 1.099e+03 1.708e+03 2.830e+03, threshold=2.199e+03, percent-clipped=2.0 2023-06-25 04:01:48,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1902096.0, ans=0.125 2023-06-25 04:01:54,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1902156.0, ans=0.125 2023-06-25 04:02:23,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1902216.0, ans=0.0 2023-06-25 04:02:41,837 INFO [train.py:996] (0/4) Epoch 11, batch 12100, loss[loss=0.301, simple_loss=0.3637, pruned_loss=0.1191, over 21440.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3159, pruned_loss=0.08063, over 4276255.82 frames. ], batch size: 471, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:02:54,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1902276.0, ans=0.0 2023-06-25 04:02:57,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-25 04:03:27,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1902396.0, ans=0.09899494936611666 2023-06-25 04:03:47,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1902456.0, ans=0.125 2023-06-25 04:03:47,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1902456.0, ans=0.2 2023-06-25 04:04:25,683 INFO [train.py:996] (0/4) Epoch 11, batch 12150, loss[loss=0.2216, simple_loss=0.3215, pruned_loss=0.06082, over 21719.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.317, pruned_loss=0.07921, over 4267632.37 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:05:14,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1902636.0, ans=0.125 2023-06-25 04:05:28,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.672e+02 1.025e+03 1.712e+03 2.364e+03 4.484e+03, threshold=3.424e+03, percent-clipped=30.0 2023-06-25 04:06:03,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1902816.0, ans=0.0 2023-06-25 04:06:12,850 INFO [train.py:996] (0/4) Epoch 11, batch 12200, loss[loss=0.2031, simple_loss=0.2767, pruned_loss=0.06472, over 21590.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3159, pruned_loss=0.07803, over 4264374.08 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:06:49,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-25 04:07:07,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1902996.0, ans=0.1 2023-06-25 04:07:58,641 INFO [train.py:996] (0/4) Epoch 11, batch 12250, loss[loss=0.1693, simple_loss=0.2378, pruned_loss=0.05043, over 16372.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3063, pruned_loss=0.07457, over 4255457.40 frames. ], batch size: 61, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:08:59,371 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.457e+02 7.371e+02 1.190e+03 1.577e+03 4.141e+03, threshold=2.380e+03, percent-clipped=2.0 2023-06-25 04:09:13,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1903356.0, ans=0.125 2023-06-25 04:09:44,662 INFO [train.py:996] (0/4) Epoch 11, batch 12300, loss[loss=0.2015, simple_loss=0.2934, pruned_loss=0.05482, over 21765.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.299, pruned_loss=0.06943, over 4251040.44 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:09:46,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1903476.0, ans=0.125 2023-06-25 04:09:47,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1903476.0, ans=0.1 2023-06-25 04:10:27,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1903536.0, ans=0.0 2023-06-25 04:10:49,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1903596.0, ans=0.0 2023-06-25 04:11:03,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1903656.0, ans=0.125 2023-06-25 04:11:30,629 INFO [train.py:996] (0/4) Epoch 11, batch 12350, loss[loss=0.2793, simple_loss=0.3497, pruned_loss=0.1045, over 21776.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3037, pruned_loss=0.07049, over 4251097.35 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:11:51,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1903836.0, ans=0.0 2023-06-25 04:12:31,426 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.609e+02 8.670e+02 1.217e+03 1.964e+03 4.834e+03, threshold=2.433e+03, percent-clipped=16.0 2023-06-25 04:12:55,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1903956.0, ans=0.125 2023-06-25 04:13:16,324 INFO [train.py:996] (0/4) Epoch 11, batch 12400, loss[loss=0.2543, simple_loss=0.3235, pruned_loss=0.09257, over 21736.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3064, pruned_loss=0.07464, over 4260949.15 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:13:19,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1904076.0, ans=0.125 2023-06-25 04:13:45,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-25 04:14:20,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-25 04:14:41,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1904256.0, ans=0.2 2023-06-25 04:14:56,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1904316.0, ans=0.2 2023-06-25 04:15:07,779 INFO [train.py:996] (0/4) Epoch 11, batch 12450, loss[loss=0.275, simple_loss=0.3547, pruned_loss=0.09765, over 21831.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3103, pruned_loss=0.07762, over 4275286.31 frames. ], batch size: 124, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:15:49,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1904436.0, ans=0.1 2023-06-25 04:15:50,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.34 vs. limit=15.0 2023-06-25 04:15:56,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1904496.0, ans=0.0 2023-06-25 04:16:10,967 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 6.913e+02 8.483e+02 1.165e+03 2.704e+03, threshold=1.697e+03, percent-clipped=3.0 2023-06-25 04:16:28,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1904556.0, ans=0.1 2023-06-25 04:17:03,402 INFO [train.py:996] (0/4) Epoch 11, batch 12500, loss[loss=0.2995, simple_loss=0.3872, pruned_loss=0.1059, over 21712.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3213, pruned_loss=0.08098, over 4280866.11 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:17:27,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1904736.0, ans=0.0 2023-06-25 04:17:59,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=22.5 2023-06-25 04:18:05,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1904796.0, ans=0.125 2023-06-25 04:18:48,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1904916.0, ans=0.0 2023-06-25 04:19:02,302 INFO [train.py:996] (0/4) Epoch 11, batch 12550, loss[loss=0.1848, simple_loss=0.2999, pruned_loss=0.03485, over 20736.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3241, pruned_loss=0.08283, over 4277291.75 frames. ], batch size: 608, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:19:43,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1905096.0, ans=0.0 2023-06-25 04:19:44,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1905096.0, ans=0.1 2023-06-25 04:20:07,424 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.343e+02 7.503e+02 1.080e+03 1.641e+03 3.839e+03, threshold=2.159e+03, percent-clipped=20.0 2023-06-25 04:20:14,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1905156.0, ans=0.0 2023-06-25 04:20:43,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-25 04:20:52,629 INFO [train.py:996] (0/4) Epoch 11, batch 12600, loss[loss=0.1988, simple_loss=0.2868, pruned_loss=0.05537, over 21427.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3228, pruned_loss=0.08067, over 4281160.57 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:20:53,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1905276.0, ans=0.125 2023-06-25 04:21:07,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1905276.0, ans=0.125 2023-06-25 04:21:45,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1905396.0, ans=0.0 2023-06-25 04:22:33,080 INFO [train.py:996] (0/4) Epoch 11, batch 12650, loss[loss=0.2266, simple_loss=0.3396, pruned_loss=0.05678, over 19825.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3153, pruned_loss=0.0767, over 4283042.55 frames. ], batch size: 702, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:22:33,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1905576.0, ans=0.1 2023-06-25 04:22:45,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1905576.0, ans=0.125 2023-06-25 04:23:14,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1905696.0, ans=0.025 2023-06-25 04:23:16,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1905696.0, ans=0.125 2023-06-25 04:23:18,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-25 04:23:35,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-25 04:23:37,030 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.529e+02 6.454e+02 1.042e+03 1.689e+03 3.142e+03, threshold=2.085e+03, percent-clipped=12.0 2023-06-25 04:24:23,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1905816.0, ans=0.2 2023-06-25 04:24:28,109 INFO [train.py:996] (0/4) Epoch 11, batch 12700, loss[loss=0.2516, simple_loss=0.321, pruned_loss=0.09108, over 21632.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3147, pruned_loss=0.07926, over 4288967.29 frames. ], batch size: 415, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:24:45,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-25 04:24:54,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1905936.0, ans=0.2 2023-06-25 04:25:43,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1906056.0, ans=0.125 2023-06-25 04:26:13,882 INFO [train.py:996] (0/4) Epoch 11, batch 12750, loss[loss=0.1957, simple_loss=0.2904, pruned_loss=0.05046, over 21812.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3136, pruned_loss=0.07852, over 4287591.42 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:26:15,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-06-25 04:27:05,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-06-25 04:27:09,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.849e+02 1.051e+03 1.343e+03 1.949e+03 4.528e+03, threshold=2.685e+03, percent-clipped=20.0 2023-06-25 04:27:57,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1906416.0, ans=0.125 2023-06-25 04:28:00,579 INFO [train.py:996] (0/4) Epoch 11, batch 12800, loss[loss=0.2449, simple_loss=0.3197, pruned_loss=0.08505, over 21675.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3142, pruned_loss=0.07973, over 4287630.15 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:29:30,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1906716.0, ans=0.04949747468305833 2023-06-25 04:29:50,505 INFO [train.py:996] (0/4) Epoch 11, batch 12850, loss[loss=0.2615, simple_loss=0.3323, pruned_loss=0.0954, over 21302.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.317, pruned_loss=0.08126, over 4284047.22 frames. ], batch size: 143, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:30:53,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.548e+02 7.415e+02 1.066e+03 1.369e+03 3.330e+03, threshold=2.132e+03, percent-clipped=6.0 2023-06-25 04:31:19,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1907016.0, ans=0.125 2023-06-25 04:31:43,334 INFO [train.py:996] (0/4) Epoch 11, batch 12900, loss[loss=0.3358, simple_loss=0.3955, pruned_loss=0.1381, over 21436.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3156, pruned_loss=0.07828, over 4284785.66 frames. ], batch size: 507, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:32:49,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1907256.0, ans=0.0 2023-06-25 04:32:51,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1907256.0, ans=0.125 2023-06-25 04:32:55,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1907256.0, ans=0.2 2023-06-25 04:33:33,465 INFO [train.py:996] (0/4) Epoch 11, batch 12950, loss[loss=0.2525, simple_loss=0.3319, pruned_loss=0.08651, over 21419.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3148, pruned_loss=0.07734, over 4279631.83 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:34:05,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-25 04:34:13,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1907496.0, ans=0.0 2023-06-25 04:34:27,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1907496.0, ans=10.0 2023-06-25 04:34:31,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 6.927e+02 1.132e+03 1.522e+03 3.743e+03, threshold=2.263e+03, percent-clipped=8.0 2023-06-25 04:35:21,571 INFO [train.py:996] (0/4) Epoch 11, batch 13000, loss[loss=0.1475, simple_loss=0.21, pruned_loss=0.04252, over 21801.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3134, pruned_loss=0.07736, over 4281552.04 frames. ], batch size: 98, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:35:58,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-25 04:35:59,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1907796.0, ans=0.125 2023-06-25 04:37:07,509 INFO [train.py:996] (0/4) Epoch 11, batch 13050, loss[loss=0.2599, simple_loss=0.3237, pruned_loss=0.0981, over 21529.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3082, pruned_loss=0.07513, over 4287803.13 frames. ], batch size: 212, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:37:45,125 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-25 04:38:05,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.554e+02 7.302e+02 9.567e+02 1.329e+03 2.389e+03, threshold=1.913e+03, percent-clipped=1.0 2023-06-25 04:38:49,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1908216.0, ans=0.09899494936611666 2023-06-25 04:38:55,923 INFO [train.py:996] (0/4) Epoch 11, batch 13100, loss[loss=0.1876, simple_loss=0.2898, pruned_loss=0.04269, over 21789.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3094, pruned_loss=0.07476, over 4294853.61 frames. ], batch size: 332, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:39:32,227 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:39:47,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1908396.0, ans=0.1 2023-06-25 04:40:45,512 INFO [train.py:996] (0/4) Epoch 11, batch 13150, loss[loss=0.2103, simple_loss=0.2971, pruned_loss=0.06172, over 21493.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3133, pruned_loss=0.07754, over 4294350.02 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:41:10,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1908576.0, ans=0.2 2023-06-25 04:41:35,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.32 vs. limit=12.0 2023-06-25 04:41:55,013 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.487e+02 7.569e+02 1.234e+03 1.722e+03 3.917e+03, threshold=2.467e+03, percent-clipped=21.0 2023-06-25 04:42:13,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1908756.0, ans=0.125 2023-06-25 04:42:14,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1908756.0, ans=0.1 2023-06-25 04:42:19,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1908816.0, ans=0.0 2023-06-25 04:42:46,140 INFO [train.py:996] (0/4) Epoch 11, batch 13200, loss[loss=0.263, simple_loss=0.3358, pruned_loss=0.09504, over 21897.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3125, pruned_loss=0.07648, over 4291641.10 frames. ], batch size: 372, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 04:42:51,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1908876.0, ans=0.1 2023-06-25 04:43:14,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1908936.0, ans=0.0 2023-06-25 04:44:34,084 INFO [train.py:996] (0/4) Epoch 11, batch 13250, loss[loss=0.2788, simple_loss=0.3676, pruned_loss=0.09505, over 21523.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3127, pruned_loss=0.07896, over 4294044.38 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:44:54,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1909236.0, ans=0.1 2023-06-25 04:45:32,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 1.027e+03 1.488e+03 2.200e+03 4.599e+03, threshold=2.975e+03, percent-clipped=16.0 2023-06-25 04:45:56,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1909356.0, ans=0.125 2023-06-25 04:46:21,099 INFO [train.py:996] (0/4) Epoch 11, batch 13300, loss[loss=0.2749, simple_loss=0.3581, pruned_loss=0.09589, over 21688.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3138, pruned_loss=0.07898, over 4293861.14 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:46:27,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1909476.0, ans=0.04949747468305833 2023-06-25 04:47:55,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1909716.0, ans=0.0 2023-06-25 04:48:09,273 INFO [train.py:996] (0/4) Epoch 11, batch 13350, loss[loss=0.2487, simple_loss=0.3371, pruned_loss=0.08018, over 21712.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3179, pruned_loss=0.08159, over 4291499.88 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:48:39,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1909836.0, ans=0.0 2023-06-25 04:49:08,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.321e+02 8.273e+02 1.155e+03 1.760e+03 3.459e+03, threshold=2.310e+03, percent-clipped=3.0 2023-06-25 04:49:10,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1909956.0, ans=10.0 2023-06-25 04:49:10,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1909956.0, ans=0.0 2023-06-25 04:49:50,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1910076.0, ans=0.1 2023-06-25 04:49:52,155 INFO [train.py:996] (0/4) Epoch 11, batch 13400, loss[loss=0.2366, simple_loss=0.3046, pruned_loss=0.08429, over 21832.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3193, pruned_loss=0.0832, over 4295225.01 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:50:09,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=12.0 2023-06-25 04:50:28,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-25 04:50:32,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1910136.0, ans=0.0 2023-06-25 04:50:58,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1910256.0, ans=0.2 2023-06-25 04:51:39,505 INFO [train.py:996] (0/4) Epoch 11, batch 13450, loss[loss=0.2153, simple_loss=0.309, pruned_loss=0.06084, over 20743.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3214, pruned_loss=0.08521, over 4294500.38 frames. ], batch size: 607, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:51:44,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1910376.0, ans=0.2 2023-06-25 04:52:04,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=12.0 2023-06-25 04:52:10,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1910436.0, ans=0.0 2023-06-25 04:52:17,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1910496.0, ans=0.125 2023-06-25 04:52:29,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1910496.0, ans=0.2 2023-06-25 04:52:36,804 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.628e+02 8.174e+02 1.187e+03 1.780e+03 3.541e+03, threshold=2.373e+03, percent-clipped=13.0 2023-06-25 04:53:16,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1910616.0, ans=0.5 2023-06-25 04:53:16,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1910616.0, ans=0.0 2023-06-25 04:53:26,288 INFO [train.py:996] (0/4) Epoch 11, batch 13500, loss[loss=0.1907, simple_loss=0.2585, pruned_loss=0.06146, over 21437.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3131, pruned_loss=0.08227, over 4281702.31 frames. ], batch size: 211, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:53:31,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-25 04:54:29,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-25 04:55:11,709 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-25 04:55:13,742 INFO [train.py:996] (0/4) Epoch 11, batch 13550, loss[loss=0.2454, simple_loss=0.338, pruned_loss=0.07643, over 21349.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3171, pruned_loss=0.08166, over 4281370.80 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:55:28,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1910976.0, ans=0.0 2023-06-25 04:55:51,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1911036.0, ans=0.1 2023-06-25 04:56:02,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-25 04:56:03,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1911096.0, ans=0.0 2023-06-25 04:56:11,389 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.050e+02 7.777e+02 1.227e+03 1.710e+03 3.921e+03, threshold=2.454e+03, percent-clipped=8.0 2023-06-25 04:56:36,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-25 04:56:38,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-25 04:57:01,041 INFO [train.py:996] (0/4) Epoch 11, batch 13600, loss[loss=0.2166, simple_loss=0.2885, pruned_loss=0.07235, over 21417.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3183, pruned_loss=0.08294, over 4283172.01 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 04:58:16,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.78 vs. limit=5.0 2023-06-25 04:58:42,947 INFO [train.py:996] (0/4) Epoch 11, batch 13650, loss[loss=0.2201, simple_loss=0.2865, pruned_loss=0.07683, over 21701.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3129, pruned_loss=0.07999, over 4281668.24 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:59:38,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1911696.0, ans=0.015 2023-06-25 04:59:48,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.419e+02 6.510e+02 1.024e+03 1.563e+03 2.533e+03, threshold=2.048e+03, percent-clipped=2.0 2023-06-25 05:00:09,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1911756.0, ans=0.125 2023-06-25 05:00:12,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1911816.0, ans=0.0 2023-06-25 05:00:35,840 INFO [train.py:996] (0/4) Epoch 11, batch 13700, loss[loss=0.3025, simple_loss=0.3701, pruned_loss=0.1175, over 21479.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3083, pruned_loss=0.07926, over 4271304.03 frames. ], batch size: 508, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:00:40,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.22 vs. limit=8.0 2023-06-25 05:00:50,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1911876.0, ans=0.0 2023-06-25 05:01:27,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-25 05:02:30,054 INFO [train.py:996] (0/4) Epoch 11, batch 13750, loss[loss=0.2195, simple_loss=0.2978, pruned_loss=0.07059, over 21668.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.306, pruned_loss=0.07806, over 4262903.39 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:02:39,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1912176.0, ans=0.0 2023-06-25 05:02:48,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1912236.0, ans=0.0 2023-06-25 05:02:49,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1912236.0, ans=0.1 2023-06-25 05:03:33,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.440e+02 9.228e+02 1.294e+03 2.214e+03 4.699e+03, threshold=2.588e+03, percent-clipped=28.0 2023-06-25 05:04:01,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1912356.0, ans=0.2 2023-06-25 05:04:06,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1912416.0, ans=0.2 2023-06-25 05:04:20,980 INFO [train.py:996] (0/4) Epoch 11, batch 13800, loss[loss=0.2278, simple_loss=0.3138, pruned_loss=0.07084, over 21460.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3112, pruned_loss=0.07712, over 4262575.47 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:04:42,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1912536.0, ans=0.125 2023-06-25 05:05:04,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-25 05:05:26,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1912656.0, ans=0.2 2023-06-25 05:05:44,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1912656.0, ans=0.125 2023-06-25 05:05:57,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1912716.0, ans=0.0 2023-06-25 05:06:07,174 INFO [train.py:996] (0/4) Epoch 11, batch 13850, loss[loss=0.2638, simple_loss=0.3468, pruned_loss=0.09039, over 21763.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3168, pruned_loss=0.07673, over 4265498.37 frames. ], batch size: 332, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:06:09,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1912776.0, ans=0.1 2023-06-25 05:06:13,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1912776.0, ans=0.1 2023-06-25 05:06:46,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.94 vs. limit=6.0 2023-06-25 05:07:13,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.731e+02 7.739e+02 1.067e+03 1.553e+03 4.213e+03, threshold=2.133e+03, percent-clipped=6.0 2023-06-25 05:07:26,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1912956.0, ans=0.125 2023-06-25 05:07:44,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1913016.0, ans=0.0 2023-06-25 05:07:52,242 INFO [train.py:996] (0/4) Epoch 11, batch 13900, loss[loss=0.2471, simple_loss=0.3168, pruned_loss=0.08867, over 21853.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3213, pruned_loss=0.08087, over 4273260.95 frames. ], batch size: 371, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:08:03,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1913076.0, ans=0.0 2023-06-25 05:08:22,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1913136.0, ans=0.04949747468305833 2023-06-25 05:09:23,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1913316.0, ans=0.2 2023-06-25 05:09:44,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1913376.0, ans=0.0 2023-06-25 05:09:45,242 INFO [train.py:996] (0/4) Epoch 11, batch 13950, loss[loss=0.2243, simple_loss=0.2985, pruned_loss=0.0751, over 21522.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3221, pruned_loss=0.08217, over 4277746.43 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:10:44,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1913496.0, ans=0.07 2023-06-25 05:10:50,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.156e+02 8.731e+02 1.156e+03 1.746e+03 2.860e+03, threshold=2.312e+03, percent-clipped=13.0 2023-06-25 05:10:52,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1913556.0, ans=0.125 2023-06-25 05:11:29,111 INFO [train.py:996] (0/4) Epoch 11, batch 14000, loss[loss=0.2002, simple_loss=0.2929, pruned_loss=0.05372, over 21709.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3193, pruned_loss=0.08128, over 4274579.02 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:11:34,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1913676.0, ans=0.125 2023-06-25 05:11:54,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-25 05:11:55,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1913736.0, ans=0.07 2023-06-25 05:13:16,604 INFO [train.py:996] (0/4) Epoch 11, batch 14050, loss[loss=0.2061, simple_loss=0.2731, pruned_loss=0.06957, over 21308.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3133, pruned_loss=0.07685, over 4284466.39 frames. ], batch size: 144, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:13:17,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1913976.0, ans=0.125 2023-06-25 05:13:28,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-25 05:13:40,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1914036.0, ans=0.125 2023-06-25 05:14:24,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.465e+02 7.745e+02 1.137e+03 1.921e+03 3.840e+03, threshold=2.273e+03, percent-clipped=15.0 2023-06-25 05:14:34,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-25 05:14:53,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-25 05:15:04,514 INFO [train.py:996] (0/4) Epoch 11, batch 14100, loss[loss=0.2742, simple_loss=0.339, pruned_loss=0.1047, over 21508.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.308, pruned_loss=0.07738, over 4281962.86 frames. ], batch size: 131, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:15:19,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1914276.0, ans=0.1 2023-06-25 05:15:20,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1914276.0, ans=0.0 2023-06-25 05:15:27,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1914336.0, ans=0.125 2023-06-25 05:15:32,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1914336.0, ans=0.05 2023-06-25 05:15:52,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1914396.0, ans=0.125 2023-06-25 05:15:57,247 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:16:26,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1914456.0, ans=0.1 2023-06-25 05:16:46,921 INFO [train.py:996] (0/4) Epoch 11, batch 14150, loss[loss=0.2278, simple_loss=0.3115, pruned_loss=0.07203, over 21299.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.311, pruned_loss=0.07854, over 4278059.85 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:16:57,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1914576.0, ans=0.2 2023-06-25 05:17:40,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1914696.0, ans=0.1 2023-06-25 05:17:51,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.783e+02 7.327e+02 9.497e+02 1.308e+03 3.394e+03, threshold=1.899e+03, percent-clipped=3.0 2023-06-25 05:18:29,910 INFO [train.py:996] (0/4) Epoch 11, batch 14200, loss[loss=0.2035, simple_loss=0.285, pruned_loss=0.06097, over 21328.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3094, pruned_loss=0.07685, over 4280293.24 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:18:47,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1914876.0, ans=0.1 2023-06-25 05:19:03,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1914936.0, ans=0.1 2023-06-25 05:19:05,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1914936.0, ans=22.5 2023-06-25 05:19:17,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1914996.0, ans=0.2 2023-06-25 05:19:39,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1915056.0, ans=0.0 2023-06-25 05:19:50,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-25 05:20:08,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1915116.0, ans=0.0 2023-06-25 05:20:14,325 INFO [train.py:996] (0/4) Epoch 11, batch 14250, loss[loss=0.2095, simple_loss=0.2868, pruned_loss=0.06608, over 21706.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3039, pruned_loss=0.07655, over 4271619.20 frames. ], batch size: 333, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:20:56,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1915236.0, ans=0.07 2023-06-25 05:21:10,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1915296.0, ans=0.1 2023-06-25 05:21:24,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.492e+02 7.052e+02 9.633e+02 1.519e+03 2.693e+03, threshold=1.927e+03, percent-clipped=14.0 2023-06-25 05:21:52,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-25 05:22:03,011 INFO [train.py:996] (0/4) Epoch 11, batch 14300, loss[loss=0.2119, simple_loss=0.2873, pruned_loss=0.06828, over 21265.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3064, pruned_loss=0.07551, over 4256891.86 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:22:04,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-25 05:23:49,092 INFO [train.py:996] (0/4) Epoch 11, batch 14350, loss[loss=0.2017, simple_loss=0.2825, pruned_loss=0.06046, over 21412.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3104, pruned_loss=0.07638, over 4251448.37 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:24:28,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1915836.0, ans=0.95 2023-06-25 05:24:56,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.843e+02 8.086e+02 1.263e+03 2.324e+03 6.942e+03, threshold=2.526e+03, percent-clipped=29.0 2023-06-25 05:25:34,876 INFO [train.py:996] (0/4) Epoch 11, batch 14400, loss[loss=0.1958, simple_loss=0.2681, pruned_loss=0.06172, over 21417.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3107, pruned_loss=0.07772, over 4264432.90 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 05:25:54,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-25 05:26:06,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1916136.0, ans=0.1 2023-06-25 05:26:24,899 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:26:28,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1916196.0, ans=0.2 2023-06-25 05:27:09,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-25 05:27:29,163 INFO [train.py:996] (0/4) Epoch 11, batch 14450, loss[loss=0.1932, simple_loss=0.2637, pruned_loss=0.06137, over 21238.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3045, pruned_loss=0.07767, over 4252474.17 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:28:20,669 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:28:30,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.134e+02 8.129e+02 1.231e+03 1.648e+03 3.274e+03, threshold=2.462e+03, percent-clipped=7.0 2023-06-25 05:28:55,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1916616.0, ans=0.2 2023-06-25 05:29:07,876 INFO [train.py:996] (0/4) Epoch 11, batch 14500, loss[loss=0.1974, simple_loss=0.2839, pruned_loss=0.05546, over 21351.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3, pruned_loss=0.07681, over 4244662.79 frames. ], batch size: 131, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:29:46,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1916736.0, ans=0.0 2023-06-25 05:30:16,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1916856.0, ans=0.125 2023-06-25 05:30:35,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1916916.0, ans=0.125 2023-06-25 05:30:37,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1916916.0, ans=0.0 2023-06-25 05:30:42,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1916916.0, ans=0.125 2023-06-25 05:30:44,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1916916.0, ans=0.0 2023-06-25 05:31:01,924 INFO [train.py:996] (0/4) Epoch 11, batch 14550, loss[loss=0.2142, simple_loss=0.2965, pruned_loss=0.06598, over 21273.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3048, pruned_loss=0.078, over 4250471.66 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:32:13,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.818e+02 8.368e+02 1.257e+03 1.782e+03 3.337e+03, threshold=2.514e+03, percent-clipped=4.0 2023-06-25 05:32:56,093 INFO [train.py:996] (0/4) Epoch 11, batch 14600, loss[loss=0.2303, simple_loss=0.3125, pruned_loss=0.07408, over 21691.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3127, pruned_loss=0.08245, over 4259787.23 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:33:03,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1917276.0, ans=0.0 2023-06-25 05:33:30,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1917336.0, ans=0.0 2023-06-25 05:34:16,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1917516.0, ans=0.125 2023-06-25 05:34:29,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1917516.0, ans=15.0 2023-06-25 05:34:41,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1917516.0, ans=0.0 2023-06-25 05:34:44,123 INFO [train.py:996] (0/4) Epoch 11, batch 14650, loss[loss=0.2102, simple_loss=0.3052, pruned_loss=0.05763, over 21800.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3148, pruned_loss=0.08165, over 4256095.35 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:34:44,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1917576.0, ans=0.0 2023-06-25 05:35:15,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1917636.0, ans=0.0 2023-06-25 05:35:18,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1917636.0, ans=0.2 2023-06-25 05:35:46,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1917756.0, ans=0.125 2023-06-25 05:35:50,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 8.792e+02 1.262e+03 1.854e+03 3.152e+03, threshold=2.525e+03, percent-clipped=6.0 2023-06-25 05:35:54,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1917756.0, ans=0.0 2023-06-25 05:36:32,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1917876.0, ans=0.125 2023-06-25 05:36:33,388 INFO [train.py:996] (0/4) Epoch 11, batch 14700, loss[loss=0.2064, simple_loss=0.292, pruned_loss=0.06039, over 21370.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3093, pruned_loss=0.07686, over 4247839.31 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:36:41,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1917876.0, ans=0.0 2023-06-25 05:37:03,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1917936.0, ans=0.0 2023-06-25 05:37:14,063 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-25 05:37:20,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1917996.0, ans=0.125 2023-06-25 05:38:07,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1918116.0, ans=0.0 2023-06-25 05:38:22,512 INFO [train.py:996] (0/4) Epoch 11, batch 14750, loss[loss=0.2614, simple_loss=0.3312, pruned_loss=0.09585, over 21595.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3131, pruned_loss=0.07838, over 4246856.27 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:38:55,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1918236.0, ans=0.0 2023-06-25 05:38:55,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1918236.0, ans=0.125 2023-06-25 05:39:21,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1918296.0, ans=0.125 2023-06-25 05:39:42,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.909e+02 1.142e+03 1.631e+03 3.263e+03, threshold=2.283e+03, percent-clipped=2.0 2023-06-25 05:39:43,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1918356.0, ans=0.125 2023-06-25 05:39:58,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1918416.0, ans=0.125 2023-06-25 05:40:01,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1918416.0, ans=0.0 2023-06-25 05:40:17,911 INFO [train.py:996] (0/4) Epoch 11, batch 14800, loss[loss=0.2114, simple_loss=0.2894, pruned_loss=0.0667, over 21694.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3236, pruned_loss=0.08274, over 4249959.46 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:40:46,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1918536.0, ans=15.0 2023-06-25 05:40:53,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1918536.0, ans=0.0 2023-06-25 05:42:13,913 INFO [train.py:996] (0/4) Epoch 11, batch 14850, loss[loss=0.2414, simple_loss=0.3131, pruned_loss=0.08481, over 21709.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3185, pruned_loss=0.08264, over 4257634.87 frames. ], batch size: 282, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:42:20,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-25 05:42:47,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1918836.0, ans=0.125 2023-06-25 05:43:25,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.029e+02 9.409e+02 1.250e+03 2.186e+03 4.588e+03, threshold=2.500e+03, percent-clipped=20.0 2023-06-25 05:43:58,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1919016.0, ans=0.0 2023-06-25 05:43:59,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1919016.0, ans=22.5 2023-06-25 05:44:03,309 INFO [train.py:996] (0/4) Epoch 11, batch 14900, loss[loss=0.3413, simple_loss=0.3931, pruned_loss=0.1448, over 21455.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3193, pruned_loss=0.08364, over 4254675.70 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:44:03,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1919076.0, ans=0.125 2023-06-25 05:44:15,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.82 vs. limit=10.0 2023-06-25 05:44:38,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1919136.0, ans=0.0 2023-06-25 05:44:52,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1919196.0, ans=0.0 2023-06-25 05:45:50,690 INFO [train.py:996] (0/4) Epoch 11, batch 14950, loss[loss=0.2674, simple_loss=0.3463, pruned_loss=0.09424, over 21570.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3206, pruned_loss=0.08319, over 4263095.05 frames. ], batch size: 509, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:46:57,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1919496.0, ans=0.125 2023-06-25 05:47:03,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.346e+02 1.154e+03 1.605e+03 2.804e+03, threshold=2.309e+03, percent-clipped=2.0 2023-06-25 05:47:10,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1919556.0, ans=0.2 2023-06-25 05:47:37,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-25 05:47:39,734 INFO [train.py:996] (0/4) Epoch 11, batch 15000, loss[loss=0.2273, simple_loss=0.3004, pruned_loss=0.07714, over 21677.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.323, pruned_loss=0.08532, over 4265679.04 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:47:39,735 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 05:48:02,335 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2537, simple_loss=0.3474, pruned_loss=0.08002, over 1796401.00 frames. 2023-06-25 05:48:02,336 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 05:48:28,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1919736.0, ans=0.2 2023-06-25 05:48:42,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-25 05:49:09,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1919856.0, ans=0.125 2023-06-25 05:49:09,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1919856.0, ans=0.2 2023-06-25 05:49:50,642 INFO [train.py:996] (0/4) Epoch 11, batch 15050, loss[loss=0.2483, simple_loss=0.3271, pruned_loss=0.08478, over 21364.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3239, pruned_loss=0.08633, over 4260428.64 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:49:55,513 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-320000.pt 2023-06-25 05:50:45,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-25 05:50:57,308 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.512e+02 8.828e+02 1.154e+03 1.761e+03 2.876e+03, threshold=2.308e+03, percent-clipped=7.0 2023-06-25 05:51:39,401 INFO [train.py:996] (0/4) Epoch 11, batch 15100, loss[loss=0.3178, simple_loss=0.3813, pruned_loss=0.1272, over 21648.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3265, pruned_loss=0.08625, over 4267620.02 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:51:47,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-25 05:52:12,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1920336.0, ans=0.125 2023-06-25 05:52:31,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1920396.0, ans=0.1 2023-06-25 05:52:47,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1920456.0, ans=0.05 2023-06-25 05:53:13,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1920516.0, ans=10.0 2023-06-25 05:53:29,054 INFO [train.py:996] (0/4) Epoch 11, batch 15150, loss[loss=0.1913, simple_loss=0.2612, pruned_loss=0.06066, over 21337.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3219, pruned_loss=0.08597, over 4267928.82 frames. ], batch size: 131, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:53:46,849 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:54:20,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1920696.0, ans=0.125 2023-06-25 05:54:44,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.398e+02 9.134e+02 1.396e+03 2.248e+03 4.445e+03, threshold=2.791e+03, percent-clipped=24.0 2023-06-25 05:54:47,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1920756.0, ans=0.0 2023-06-25 05:55:18,876 INFO [train.py:996] (0/4) Epoch 11, batch 15200, loss[loss=0.1981, simple_loss=0.2835, pruned_loss=0.05641, over 21817.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3133, pruned_loss=0.08168, over 4259711.23 frames. ], batch size: 372, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:55:19,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1920876.0, ans=0.1 2023-06-25 05:55:21,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1920876.0, ans=0.125 2023-06-25 05:55:39,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1920876.0, ans=0.125 2023-06-25 05:56:59,560 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-25 05:57:06,663 INFO [train.py:996] (0/4) Epoch 11, batch 15250, loss[loss=0.2732, simple_loss=0.415, pruned_loss=0.06572, over 19743.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3092, pruned_loss=0.08015, over 4258174.13 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:57:22,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1921176.0, ans=0.0 2023-06-25 05:57:59,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1921296.0, ans=0.1 2023-06-25 05:58:19,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.800e+02 7.712e+02 1.026e+03 1.486e+03 3.458e+03, threshold=2.053e+03, percent-clipped=2.0 2023-06-25 05:58:19,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1921356.0, ans=0.0 2023-06-25 05:58:44,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-25 05:58:53,109 INFO [train.py:996] (0/4) Epoch 11, batch 15300, loss[loss=0.25, simple_loss=0.3228, pruned_loss=0.08864, over 21321.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3103, pruned_loss=0.08217, over 4267878.26 frames. ], batch size: 159, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:59:39,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1921596.0, ans=0.0 2023-06-25 06:00:10,194 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.79 vs. limit=10.0 2023-06-25 06:00:17,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-25 06:00:48,360 INFO [train.py:996] (0/4) Epoch 11, batch 15350, loss[loss=0.2351, simple_loss=0.3307, pruned_loss=0.06973, over 21837.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3162, pruned_loss=0.08522, over 4273475.45 frames. ], batch size: 316, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:00:50,977 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-25 06:01:02,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.93 vs. limit=6.0 2023-06-25 06:01:29,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-25 06:01:53,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.974e+02 1.016e+03 1.491e+03 3.012e+03, threshold=2.032e+03, percent-clipped=10.0 2023-06-25 06:02:27,017 INFO [train.py:996] (0/4) Epoch 11, batch 15400, loss[loss=0.1988, simple_loss=0.2831, pruned_loss=0.05721, over 21864.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3165, pruned_loss=0.08351, over 4272575.31 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:03:15,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1922196.0, ans=0.125 2023-06-25 06:03:18,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1922196.0, ans=0.0 2023-06-25 06:03:40,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1922256.0, ans=0.125 2023-06-25 06:03:45,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1922256.0, ans=0.125 2023-06-25 06:03:52,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1922316.0, ans=0.125 2023-06-25 06:04:11,529 INFO [train.py:996] (0/4) Epoch 11, batch 15450, loss[loss=0.2322, simple_loss=0.3353, pruned_loss=0.06457, over 21688.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3154, pruned_loss=0.0828, over 4265294.48 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:04:58,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1922496.0, ans=0.125 2023-06-25 06:05:25,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.669e+02 7.354e+02 9.513e+02 1.338e+03 2.588e+03, threshold=1.903e+03, percent-clipped=5.0 2023-06-25 06:06:04,772 INFO [train.py:996] (0/4) Epoch 11, batch 15500, loss[loss=0.2567, simple_loss=0.3383, pruned_loss=0.0876, over 21818.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3191, pruned_loss=0.08318, over 4272816.20 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:06:05,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1922676.0, ans=0.125 2023-06-25 06:06:07,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1922676.0, ans=0.0 2023-06-25 06:06:14,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1922676.0, ans=0.125 2023-06-25 06:06:22,702 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-25 06:06:30,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-25 06:06:45,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1922796.0, ans=10.0 2023-06-25 06:07:41,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1922916.0, ans=0.0 2023-06-25 06:07:54,189 INFO [train.py:996] (0/4) Epoch 11, batch 15550, loss[loss=0.2066, simple_loss=0.2851, pruned_loss=0.06402, over 21670.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3177, pruned_loss=0.08078, over 4269809.29 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:07:59,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1922976.0, ans=0.125 2023-06-25 06:08:58,690 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.83 vs. limit=15.0 2023-06-25 06:09:07,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.113e+02 7.965e+02 1.145e+03 1.833e+03 5.244e+03, threshold=2.290e+03, percent-clipped=21.0 2023-06-25 06:09:37,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1923216.0, ans=0.125 2023-06-25 06:09:41,630 INFO [train.py:996] (0/4) Epoch 11, batch 15600, loss[loss=0.262, simple_loss=0.3327, pruned_loss=0.09572, over 21494.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3131, pruned_loss=0.07968, over 4271914.06 frames. ], batch size: 441, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:09:57,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1923276.0, ans=0.0 2023-06-25 06:10:07,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1923336.0, ans=0.125 2023-06-25 06:10:09,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1923336.0, ans=0.0 2023-06-25 06:10:46,149 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:11:33,815 INFO [train.py:996] (0/4) Epoch 11, batch 15650, loss[loss=0.2284, simple_loss=0.3039, pruned_loss=0.07648, over 20100.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3112, pruned_loss=0.07874, over 4274093.71 frames. ], batch size: 703, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:11:59,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1923636.0, ans=0.0 2023-06-25 06:12:18,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1923696.0, ans=0.2 2023-06-25 06:12:41,214 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-25 06:12:43,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.231e+02 1.048e+03 1.538e+03 3.677e+03, threshold=2.096e+03, percent-clipped=8.0 2023-06-25 06:12:59,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-25 06:13:08,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1923816.0, ans=0.125 2023-06-25 06:13:09,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1923816.0, ans=0.1 2023-06-25 06:13:23,111 INFO [train.py:996] (0/4) Epoch 11, batch 15700, loss[loss=0.218, simple_loss=0.2892, pruned_loss=0.07336, over 21375.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3062, pruned_loss=0.07721, over 4275826.03 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:13:28,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1923876.0, ans=0.0 2023-06-25 06:14:17,335 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:15:08,064 INFO [train.py:996] (0/4) Epoch 11, batch 15750, loss[loss=0.2023, simple_loss=0.3003, pruned_loss=0.05219, over 20791.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3014, pruned_loss=0.07673, over 4260004.65 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:15:15,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-25 06:16:15,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-25 06:16:19,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.466e+02 7.452e+02 1.136e+03 1.633e+03 2.643e+03, threshold=2.272e+03, percent-clipped=11.0 2023-06-25 06:16:37,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1924416.0, ans=0.125 2023-06-25 06:16:55,705 INFO [train.py:996] (0/4) Epoch 11, batch 15800, loss[loss=0.2201, simple_loss=0.287, pruned_loss=0.07663, over 21654.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2967, pruned_loss=0.07635, over 4255880.25 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:17:32,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1924536.0, ans=0.125 2023-06-25 06:17:35,995 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:17:38,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1924596.0, ans=0.125 2023-06-25 06:17:39,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1924596.0, ans=0.0 2023-06-25 06:18:10,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1924656.0, ans=0.125 2023-06-25 06:18:24,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1924716.0, ans=0.2 2023-06-25 06:18:40,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1924716.0, ans=0.125 2023-06-25 06:18:45,293 INFO [train.py:996] (0/4) Epoch 11, batch 15850, loss[loss=0.2845, simple_loss=0.3383, pruned_loss=0.1153, over 21331.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2985, pruned_loss=0.07837, over 4258128.14 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:19:57,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 6.760e+02 9.766e+02 1.376e+03 2.542e+03, threshold=1.953e+03, percent-clipped=1.0 2023-06-25 06:20:34,249 INFO [train.py:996] (0/4) Epoch 11, batch 15900, loss[loss=0.2481, simple_loss=0.3218, pruned_loss=0.08722, over 21312.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2959, pruned_loss=0.0786, over 4259258.46 frames. ], batch size: 160, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:21:11,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1925136.0, ans=0.125 2023-06-25 06:21:26,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1925196.0, ans=0.125 2023-06-25 06:21:31,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1925196.0, ans=0.1 2023-06-25 06:21:50,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1925256.0, ans=0.0 2023-06-25 06:21:55,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1925316.0, ans=0.125 2023-06-25 06:22:22,492 INFO [train.py:996] (0/4) Epoch 11, batch 15950, loss[loss=0.2693, simple_loss=0.3502, pruned_loss=0.09418, over 21672.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2977, pruned_loss=0.07647, over 4266936.68 frames. ], batch size: 441, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:22:32,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-25 06:23:29,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-25 06:23:35,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.240e+02 8.208e+02 1.106e+03 1.560e+03 3.108e+03, threshold=2.211e+03, percent-clipped=12.0 2023-06-25 06:23:38,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1925556.0, ans=0.0 2023-06-25 06:23:49,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1925616.0, ans=0.2 2023-06-25 06:24:12,231 INFO [train.py:996] (0/4) Epoch 11, batch 16000, loss[loss=0.181, simple_loss=0.2627, pruned_loss=0.04972, over 21371.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3005, pruned_loss=0.07501, over 4267290.61 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:25:54,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1925916.0, ans=0.0 2023-06-25 06:25:55,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1925916.0, ans=0.0 2023-06-25 06:25:58,391 INFO [train.py:996] (0/4) Epoch 11, batch 16050, loss[loss=0.2285, simple_loss=0.3372, pruned_loss=0.05992, over 21627.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.303, pruned_loss=0.07284, over 4265171.34 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:26:12,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-25 06:26:37,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1926096.0, ans=0.125 2023-06-25 06:26:59,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-25 06:27:06,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.209e+02 1.010e+03 1.605e+03 2.461e+03 5.413e+03, threshold=3.210e+03, percent-clipped=30.0 2023-06-25 06:27:20,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-25 06:27:36,757 INFO [train.py:996] (0/4) Epoch 11, batch 16100, loss[loss=0.2478, simple_loss=0.3171, pruned_loss=0.0892, over 21581.00 frames. ], tot_loss[loss=0.229, simple_loss=0.309, pruned_loss=0.07453, over 4265091.66 frames. ], batch size: 131, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:27:54,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1926276.0, ans=0.125 2023-06-25 06:28:07,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1926336.0, ans=0.125 2023-06-25 06:28:45,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1926456.0, ans=0.125 2023-06-25 06:29:04,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-25 06:29:16,780 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-25 06:29:17,242 INFO [train.py:996] (0/4) Epoch 11, batch 16150, loss[loss=0.2263, simple_loss=0.303, pruned_loss=0.07482, over 21841.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3093, pruned_loss=0.07726, over 4275906.16 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:30:22,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1926696.0, ans=0.2 2023-06-25 06:30:40,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.105e+02 8.824e+02 1.229e+03 1.712e+03 3.510e+03, threshold=2.459e+03, percent-clipped=5.0 2023-06-25 06:30:45,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1926756.0, ans=0.04949747468305833 2023-06-25 06:30:48,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1926816.0, ans=0.1 2023-06-25 06:31:16,338 INFO [train.py:996] (0/4) Epoch 11, batch 16200, loss[loss=0.267, simple_loss=0.3374, pruned_loss=0.09832, over 21242.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3133, pruned_loss=0.07884, over 4283185.70 frames. ], batch size: 143, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:31:17,289 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:32:03,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1926996.0, ans=0.125 2023-06-25 06:32:10,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1926996.0, ans=0.0 2023-06-25 06:32:29,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1927056.0, ans=0.1 2023-06-25 06:33:02,374 INFO [train.py:996] (0/4) Epoch 11, batch 16250, loss[loss=0.2232, simple_loss=0.3052, pruned_loss=0.07059, over 21603.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3133, pruned_loss=0.07941, over 4284863.20 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:33:41,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1927236.0, ans=0.125 2023-06-25 06:34:11,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.609e+02 8.190e+02 1.044e+03 1.433e+03 2.783e+03, threshold=2.088e+03, percent-clipped=4.0 2023-06-25 06:34:16,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-25 06:34:49,055 INFO [train.py:996] (0/4) Epoch 11, batch 16300, loss[loss=0.2347, simple_loss=0.3216, pruned_loss=0.0739, over 21387.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3062, pruned_loss=0.07503, over 4276593.52 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:35:12,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1927536.0, ans=0.125 2023-06-25 06:35:14,550 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.87 vs. limit=10.0 2023-06-25 06:36:10,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.47 vs. limit=10.0 2023-06-25 06:36:36,573 INFO [train.py:996] (0/4) Epoch 11, batch 16350, loss[loss=0.2242, simple_loss=0.3079, pruned_loss=0.07023, over 21784.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3058, pruned_loss=0.07588, over 4278042.82 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:36:37,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1927776.0, ans=0.125 2023-06-25 06:36:45,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1927776.0, ans=0.125 2023-06-25 06:37:03,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1927836.0, ans=0.125 2023-06-25 06:37:52,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.557e+02 7.101e+02 1.051e+03 1.461e+03 2.820e+03, threshold=2.102e+03, percent-clipped=5.0 2023-06-25 06:38:24,492 INFO [train.py:996] (0/4) Epoch 11, batch 16400, loss[loss=0.2304, simple_loss=0.2999, pruned_loss=0.08048, over 21891.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.31, pruned_loss=0.07848, over 4283915.59 frames. ], batch size: 371, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:38:52,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1928136.0, ans=0.125 2023-06-25 06:39:12,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1928196.0, ans=0.1 2023-06-25 06:39:46,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1928316.0, ans=0.035 2023-06-25 06:40:01,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1928316.0, ans=0.07 2023-06-25 06:40:07,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1928316.0, ans=6.0 2023-06-25 06:40:09,733 INFO [train.py:996] (0/4) Epoch 11, batch 16450, loss[loss=0.2427, simple_loss=0.301, pruned_loss=0.09216, over 21660.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3086, pruned_loss=0.0786, over 4290422.32 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:40:50,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1928496.0, ans=0.0 2023-06-25 06:41:18,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1928556.0, ans=0.0 2023-06-25 06:41:22,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.821e+02 6.919e+02 9.825e+02 1.554e+03 3.786e+03, threshold=1.965e+03, percent-clipped=13.0 2023-06-25 06:41:25,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-25 06:41:40,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1928616.0, ans=0.0 2023-06-25 06:41:53,333 INFO [train.py:996] (0/4) Epoch 11, batch 16500, loss[loss=0.1783, simple_loss=0.256, pruned_loss=0.05029, over 21656.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3065, pruned_loss=0.07836, over 4280585.53 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:41:53,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1928676.0, ans=0.2 2023-06-25 06:41:55,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-06-25 06:42:24,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1928736.0, ans=0.2 2023-06-25 06:43:43,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1928976.0, ans=15.0 2023-06-25 06:43:44,085 INFO [train.py:996] (0/4) Epoch 11, batch 16550, loss[loss=0.2306, simple_loss=0.2958, pruned_loss=0.08268, over 21313.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3043, pruned_loss=0.07625, over 4275054.47 frames. ], batch size: 159, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:43:46,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1928976.0, ans=0.1 2023-06-25 06:44:39,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1929096.0, ans=0.0 2023-06-25 06:44:39,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.97 vs. limit=10.0 2023-06-25 06:45:07,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.454e+02 9.180e+02 1.462e+03 2.154e+03 5.250e+03, threshold=2.924e+03, percent-clipped=28.0 2023-06-25 06:45:31,244 INFO [train.py:996] (0/4) Epoch 11, batch 16600, loss[loss=0.2664, simple_loss=0.3473, pruned_loss=0.09281, over 21806.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3129, pruned_loss=0.07918, over 4273069.89 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:45:33,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1929276.0, ans=0.2 2023-06-25 06:46:04,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1929336.0, ans=0.125 2023-06-25 06:46:49,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1929456.0, ans=0.2 2023-06-25 06:47:06,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1929516.0, ans=0.125 2023-06-25 06:47:21,352 INFO [train.py:996] (0/4) Epoch 11, batch 16650, loss[loss=0.284, simple_loss=0.3545, pruned_loss=0.1068, over 21571.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3233, pruned_loss=0.08157, over 4273865.55 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:47:36,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1929576.0, ans=0.0 2023-06-25 06:47:45,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1929636.0, ans=0.0 2023-06-25 06:48:32,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1929696.0, ans=0.1 2023-06-25 06:48:40,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-25 06:48:43,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1929756.0, ans=0.125 2023-06-25 06:48:48,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.799e+02 8.385e+02 1.061e+03 1.516e+03 3.591e+03, threshold=2.122e+03, percent-clipped=0.0 2023-06-25 06:49:11,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1929816.0, ans=0.0 2023-06-25 06:49:18,273 INFO [train.py:996] (0/4) Epoch 11, batch 16700, loss[loss=0.2645, simple_loss=0.3624, pruned_loss=0.08337, over 21266.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3243, pruned_loss=0.08228, over 4270977.32 frames. ], batch size: 549, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:50:28,684 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:50:44,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1930056.0, ans=0.0 2023-06-25 06:50:50,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1930116.0, ans=0.125 2023-06-25 06:51:19,150 INFO [train.py:996] (0/4) Epoch 11, batch 16750, loss[loss=0.2714, simple_loss=0.3497, pruned_loss=0.0965, over 21758.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3275, pruned_loss=0.08523, over 4270903.34 frames. ], batch size: 332, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:51:46,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1930236.0, ans=0.2 2023-06-25 06:52:26,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1930356.0, ans=0.125 2023-06-25 06:52:28,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1930356.0, ans=0.125 2023-06-25 06:52:39,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.774e+02 8.238e+02 1.096e+03 1.590e+03 4.377e+03, threshold=2.192e+03, percent-clipped=15.0 2023-06-25 06:52:52,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1930416.0, ans=0.1 2023-06-25 06:53:01,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=15.0 2023-06-25 06:53:14,816 INFO [train.py:996] (0/4) Epoch 11, batch 16800, loss[loss=0.1657, simple_loss=0.2121, pruned_loss=0.05967, over 17012.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.331, pruned_loss=0.08482, over 4260638.05 frames. ], batch size: 61, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:53:50,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1930536.0, ans=0.125 2023-06-25 06:54:11,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1930656.0, ans=0.2 2023-06-25 06:54:59,809 INFO [train.py:996] (0/4) Epoch 11, batch 16850, loss[loss=0.2499, simple_loss=0.3281, pruned_loss=0.0858, over 21956.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3275, pruned_loss=0.08452, over 4263429.32 frames. ], batch size: 113, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:55:00,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1930776.0, ans=0.0 2023-06-25 06:55:10,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1930776.0, ans=0.0 2023-06-25 06:55:18,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1930836.0, ans=0.0 2023-06-25 06:55:25,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1930836.0, ans=0.2 2023-06-25 06:55:28,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1930836.0, ans=0.125 2023-06-25 06:56:12,568 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.019e+02 7.924e+02 1.109e+03 1.823e+03 3.367e+03, threshold=2.218e+03, percent-clipped=14.0 2023-06-25 06:56:40,188 INFO [train.py:996] (0/4) Epoch 11, batch 16900, loss[loss=0.1907, simple_loss=0.2816, pruned_loss=0.04993, over 21843.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3205, pruned_loss=0.08188, over 4267893.82 frames. ], batch size: 371, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:56:53,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1931076.0, ans=0.125 2023-06-25 06:56:56,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1931136.0, ans=0.0 2023-06-25 06:57:58,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1931256.0, ans=0.125 2023-06-25 06:58:00,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1931316.0, ans=0.125 2023-06-25 06:58:23,632 INFO [train.py:996] (0/4) Epoch 11, batch 16950, loss[loss=0.2008, simple_loss=0.2771, pruned_loss=0.06228, over 21812.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3138, pruned_loss=0.08077, over 4269500.79 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:58:34,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1931376.0, ans=0.125 2023-06-25 06:58:47,344 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:59:41,841 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.539e+02 6.475e+02 7.581e+02 1.089e+03 2.288e+03, threshold=1.516e+03, percent-clipped=2.0 2023-06-25 06:59:44,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1931556.0, ans=0.0 2023-06-25 07:00:09,571 INFO [train.py:996] (0/4) Epoch 11, batch 17000, loss[loss=0.2214, simple_loss=0.2827, pruned_loss=0.08012, over 20116.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3102, pruned_loss=0.08103, over 4271858.43 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:00:19,443 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-25 07:00:42,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1931736.0, ans=0.0 2023-06-25 07:01:26,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1931856.0, ans=0.1 2023-06-25 07:01:55,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-25 07:01:56,046 INFO [train.py:996] (0/4) Epoch 11, batch 17050, loss[loss=0.2446, simple_loss=0.3238, pruned_loss=0.08265, over 21649.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3158, pruned_loss=0.08323, over 4280702.08 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:02:52,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1932096.0, ans=0.0 2023-06-25 07:02:57,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1932156.0, ans=0.125 2023-06-25 07:03:03,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1932156.0, ans=0.125 2023-06-25 07:03:06,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=13.41 vs. limit=15.0 2023-06-25 07:03:06,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1932156.0, ans=0.2 2023-06-25 07:03:10,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1932156.0, ans=0.0 2023-06-25 07:03:22,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.436e+02 8.949e+02 1.142e+03 1.744e+03 3.951e+03, threshold=2.284e+03, percent-clipped=33.0 2023-06-25 07:03:42,213 INFO [train.py:996] (0/4) Epoch 11, batch 17100, loss[loss=0.2777, simple_loss=0.3801, pruned_loss=0.08763, over 20993.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3167, pruned_loss=0.08465, over 4288925.62 frames. ], batch size: 607, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:03:50,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-25 07:04:33,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1932396.0, ans=0.125 2023-06-25 07:05:09,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-25 07:05:29,195 INFO [train.py:996] (0/4) Epoch 11, batch 17150, loss[loss=0.2404, simple_loss=0.3055, pruned_loss=0.08762, over 21877.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3134, pruned_loss=0.08375, over 4291290.15 frames. ], batch size: 391, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:06:09,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1932696.0, ans=0.2 2023-06-25 07:06:31,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1932696.0, ans=0.2 2023-06-25 07:06:55,806 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.253e+02 7.399e+02 1.011e+03 1.479e+03 2.669e+03, threshold=2.021e+03, percent-clipped=4.0 2023-06-25 07:07:01,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1932816.0, ans=0.125 2023-06-25 07:07:05,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1932816.0, ans=0.125 2023-06-25 07:07:10,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932816.0, ans=0.1 2023-06-25 07:07:16,477 INFO [train.py:996] (0/4) Epoch 11, batch 17200, loss[loss=0.3147, simple_loss=0.3701, pruned_loss=0.1297, over 21487.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3114, pruned_loss=0.08238, over 4287453.88 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:07:55,556 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-25 07:08:31,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-25 07:08:58,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1933116.0, ans=0.035 2023-06-25 07:09:10,170 INFO [train.py:996] (0/4) Epoch 11, batch 17250, loss[loss=0.2752, simple_loss=0.3434, pruned_loss=0.1035, over 21389.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3154, pruned_loss=0.08483, over 4288980.95 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:09:12,369 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:09:23,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1933176.0, ans=0.125 2023-06-25 07:09:54,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1933236.0, ans=0.0 2023-06-25 07:10:24,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1933356.0, ans=0.2 2023-06-25 07:10:29,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1933356.0, ans=0.0 2023-06-25 07:10:31,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.220e+02 7.674e+02 1.037e+03 1.511e+03 3.569e+03, threshold=2.074e+03, percent-clipped=11.0 2023-06-25 07:10:56,496 INFO [train.py:996] (0/4) Epoch 11, batch 17300, loss[loss=0.2829, simple_loss=0.3491, pruned_loss=0.1083, over 21758.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3241, pruned_loss=0.08824, over 4285557.76 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:11:37,463 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.70 vs. limit=15.0 2023-06-25 07:11:50,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1933596.0, ans=0.04949747468305833 2023-06-25 07:12:09,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1933656.0, ans=0.0 2023-06-25 07:12:50,951 INFO [train.py:996] (0/4) Epoch 11, batch 17350, loss[loss=0.2694, simple_loss=0.3582, pruned_loss=0.09034, over 21475.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3258, pruned_loss=0.08796, over 4285225.79 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:13:01,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=15.0 2023-06-25 07:13:27,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1933836.0, ans=0.1 2023-06-25 07:13:38,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1933896.0, ans=0.0 2023-06-25 07:13:39,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.04 vs. limit=5.0 2023-06-25 07:14:08,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.102e+02 8.837e+02 1.250e+03 1.745e+03 4.253e+03, threshold=2.500e+03, percent-clipped=18.0 2023-06-25 07:14:46,098 INFO [train.py:996] (0/4) Epoch 11, batch 17400, loss[loss=0.2218, simple_loss=0.2866, pruned_loss=0.07849, over 20134.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3214, pruned_loss=0.08409, over 4278346.94 frames. ], batch size: 707, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:15:12,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1934136.0, ans=0.1 2023-06-25 07:16:33,125 INFO [train.py:996] (0/4) Epoch 11, batch 17450, loss[loss=0.193, simple_loss=0.2803, pruned_loss=0.05289, over 21573.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3195, pruned_loss=0.08196, over 4276913.20 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:16:52,697 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-25 07:17:08,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-25 07:17:29,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-06-25 07:17:32,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=22.5 2023-06-25 07:17:33,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1934556.0, ans=0.125 2023-06-25 07:17:51,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1934556.0, ans=0.2 2023-06-25 07:18:02,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1934556.0, ans=0.0 2023-06-25 07:18:04,013 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 7.727e+02 1.188e+03 2.165e+03 4.981e+03, threshold=2.376e+03, percent-clipped=19.0 2023-06-25 07:18:10,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-25 07:18:15,567 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-25 07:18:22,114 INFO [train.py:996] (0/4) Epoch 11, batch 17500, loss[loss=0.2502, simple_loss=0.3146, pruned_loss=0.09289, over 21703.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3133, pruned_loss=0.07891, over 4279093.85 frames. ], batch size: 508, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:19:05,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1934796.0, ans=0.2 2023-06-25 07:19:59,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1934916.0, ans=0.0 2023-06-25 07:20:05,474 INFO [train.py:996] (0/4) Epoch 11, batch 17550, loss[loss=0.2269, simple_loss=0.3146, pruned_loss=0.06963, over 21355.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3131, pruned_loss=0.07794, over 4289062.95 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:20:30,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2023-06-25 07:20:37,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1935036.0, ans=0.0 2023-06-25 07:20:57,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1935096.0, ans=0.125 2023-06-25 07:21:29,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.993e+02 7.228e+02 9.267e+02 1.344e+03 3.002e+03, threshold=1.853e+03, percent-clipped=5.0 2023-06-25 07:21:49,280 INFO [train.py:996] (0/4) Epoch 11, batch 17600, loss[loss=0.2698, simple_loss=0.3405, pruned_loss=0.09958, over 21373.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3159, pruned_loss=0.07842, over 4280822.11 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:21:56,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1935276.0, ans=0.125 2023-06-25 07:22:18,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1935336.0, ans=0.125 2023-06-25 07:22:46,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1935456.0, ans=0.125 2023-06-25 07:23:43,426 INFO [train.py:996] (0/4) Epoch 11, batch 17650, loss[loss=0.2916, simple_loss=0.3554, pruned_loss=0.1139, over 21146.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3138, pruned_loss=0.07838, over 4266981.17 frames. ], batch size: 143, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:23:50,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1935576.0, ans=0.125 2023-06-25 07:24:58,183 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.796e-03 2023-06-25 07:25:11,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1935756.0, ans=0.1 2023-06-25 07:25:12,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.151e+02 8.324e+02 1.406e+03 1.795e+03 4.059e+03, threshold=2.812e+03, percent-clipped=23.0 2023-06-25 07:25:30,947 INFO [train.py:996] (0/4) Epoch 11, batch 17700, loss[loss=0.2165, simple_loss=0.3132, pruned_loss=0.05991, over 20696.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.308, pruned_loss=0.07506, over 4272307.91 frames. ], batch size: 607, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:25:42,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1935876.0, ans=0.2 2023-06-25 07:27:21,743 INFO [train.py:996] (0/4) Epoch 11, batch 17750, loss[loss=0.2639, simple_loss=0.3403, pruned_loss=0.09377, over 21388.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3158, pruned_loss=0.07856, over 4279844.39 frames. ], batch size: 549, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:27:42,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1936236.0, ans=0.125 2023-06-25 07:28:17,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1936296.0, ans=0.125 2023-06-25 07:28:30,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1936356.0, ans=0.1 2023-06-25 07:28:50,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 6.855e+02 8.351e+02 1.068e+03 2.757e+03, threshold=1.670e+03, percent-clipped=0.0 2023-06-25 07:29:05,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1936416.0, ans=0.1 2023-06-25 07:29:09,864 INFO [train.py:996] (0/4) Epoch 11, batch 17800, loss[loss=0.1915, simple_loss=0.2639, pruned_loss=0.05956, over 21657.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.314, pruned_loss=0.07767, over 4281702.17 frames. ], batch size: 112, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:29:20,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1936476.0, ans=0.0 2023-06-25 07:30:09,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1936596.0, ans=0.1 2023-06-25 07:30:11,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-25 07:30:16,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1936596.0, ans=0.125 2023-06-25 07:30:55,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-25 07:30:57,471 INFO [train.py:996] (0/4) Epoch 11, batch 17850, loss[loss=0.3129, simple_loss=0.3807, pruned_loss=0.1225, over 21416.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3148, pruned_loss=0.07797, over 4271045.51 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:31:20,131 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.305e-02 2023-06-25 07:31:59,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1936896.0, ans=0.125 2023-06-25 07:32:03,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1936896.0, ans=0.125 2023-06-25 07:32:20,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1936956.0, ans=0.125 2023-06-25 07:32:22,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 9.470e+02 1.328e+03 1.940e+03 3.459e+03, threshold=2.655e+03, percent-clipped=37.0 2023-06-25 07:32:33,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1937016.0, ans=0.125 2023-06-25 07:32:35,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1937016.0, ans=0.2 2023-06-25 07:32:38,444 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:32:39,631 INFO [train.py:996] (0/4) Epoch 11, batch 17900, loss[loss=0.2279, simple_loss=0.3154, pruned_loss=0.07017, over 21266.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3208, pruned_loss=0.08069, over 4271119.88 frames. ], batch size: 159, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:33:12,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-06-25 07:33:48,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-25 07:34:01,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1937256.0, ans=0.125 2023-06-25 07:34:41,346 INFO [train.py:996] (0/4) Epoch 11, batch 17950, loss[loss=0.2081, simple_loss=0.2733, pruned_loss=0.07146, over 20098.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3202, pruned_loss=0.07778, over 4269630.80 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:34:47,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1937376.0, ans=0.0 2023-06-25 07:34:49,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-25 07:35:14,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1937436.0, ans=0.1 2023-06-25 07:35:26,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1937496.0, ans=0.0 2023-06-25 07:35:59,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.777e+02 1.188e+03 1.807e+03 3.395e+03, threshold=2.377e+03, percent-clipped=4.0 2023-06-25 07:36:27,830 INFO [train.py:996] (0/4) Epoch 11, batch 18000, loss[loss=0.2235, simple_loss=0.2878, pruned_loss=0.0796, over 21598.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3121, pruned_loss=0.07568, over 4271219.07 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:36:27,832 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 07:36:44,998 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2562, simple_loss=0.3557, pruned_loss=0.07833, over 1796401.00 frames. 2023-06-25 07:36:44,999 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 07:36:56,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1937676.0, ans=0.0 2023-06-25 07:38:33,166 INFO [train.py:996] (0/4) Epoch 11, batch 18050, loss[loss=0.2623, simple_loss=0.3276, pruned_loss=0.09855, over 21333.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.307, pruned_loss=0.07465, over 4265914.39 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:39:59,295 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.782e+02 7.777e+02 1.078e+03 1.586e+03 2.998e+03, threshold=2.156e+03, percent-clipped=7.0 2023-06-25 07:40:21,752 INFO [train.py:996] (0/4) Epoch 11, batch 18100, loss[loss=0.2233, simple_loss=0.2994, pruned_loss=0.0736, over 21844.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.311, pruned_loss=0.07724, over 4265274.21 frames. ], batch size: 107, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:40:23,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.35 vs. limit=22.5 2023-06-25 07:40:31,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1938276.0, ans=0.1 2023-06-25 07:40:46,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-25 07:41:30,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-25 07:41:36,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1938456.0, ans=0.04949747468305833 2023-06-25 07:42:08,820 INFO [train.py:996] (0/4) Epoch 11, batch 18150, loss[loss=0.2246, simple_loss=0.2944, pruned_loss=0.07745, over 21625.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3123, pruned_loss=0.07786, over 4253146.44 frames. ], batch size: 247, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:42:11,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1938576.0, ans=0.0 2023-06-25 07:42:27,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.46 vs. limit=22.5 2023-06-25 07:42:28,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1938636.0, ans=0.125 2023-06-25 07:42:43,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1938696.0, ans=0.1 2023-06-25 07:43:17,352 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.56 vs. limit=15.0 2023-06-25 07:43:27,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1938756.0, ans=0.125 2023-06-25 07:43:31,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.895e+02 7.455e+02 1.236e+03 1.816e+03 3.616e+03, threshold=2.471e+03, percent-clipped=14.0 2023-06-25 07:43:54,193 INFO [train.py:996] (0/4) Epoch 11, batch 18200, loss[loss=0.1905, simple_loss=0.2678, pruned_loss=0.05664, over 21700.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3072, pruned_loss=0.07727, over 4257624.89 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:44:19,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1938936.0, ans=0.2 2023-06-25 07:44:30,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1938996.0, ans=0.1 2023-06-25 07:45:33,100 INFO [train.py:996] (0/4) Epoch 11, batch 18250, loss[loss=0.2309, simple_loss=0.2964, pruned_loss=0.08276, over 21730.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3005, pruned_loss=0.07516, over 4260646.23 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:45:44,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1939176.0, ans=0.1 2023-06-25 07:45:50,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1939176.0, ans=0.125 2023-06-25 07:46:34,806 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:46:56,932 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.451e+02 6.639e+02 9.483e+02 1.514e+03 2.544e+03, threshold=1.897e+03, percent-clipped=1.0 2023-06-25 07:47:10,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1939416.0, ans=0.0 2023-06-25 07:47:21,095 INFO [train.py:996] (0/4) Epoch 11, batch 18300, loss[loss=0.2379, simple_loss=0.3413, pruned_loss=0.06722, over 21716.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3001, pruned_loss=0.07563, over 4265739.62 frames. ], batch size: 247, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:47:23,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1939476.0, ans=0.125 2023-06-25 07:48:13,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-25 07:49:01,556 INFO [train.py:996] (0/4) Epoch 11, batch 18350, loss[loss=0.2185, simple_loss=0.2778, pruned_loss=0.0796, over 21363.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3056, pruned_loss=0.07558, over 4263389.33 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:49:28,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1939836.0, ans=0.125 2023-06-25 07:49:44,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-25 07:50:32,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.196e+02 8.074e+02 1.390e+03 1.835e+03 4.417e+03, threshold=2.780e+03, percent-clipped=23.0 2023-06-25 07:50:53,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.04 vs. limit=15.0 2023-06-25 07:50:54,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1940076.0, ans=0.125 2023-06-25 07:50:55,382 INFO [train.py:996] (0/4) Epoch 11, batch 18400, loss[loss=0.2166, simple_loss=0.2875, pruned_loss=0.0728, over 21723.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3027, pruned_loss=0.07412, over 4263074.77 frames. ], batch size: 112, lr: 2.64e-03, grad_scale: 32.0 2023-06-25 07:51:24,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1940136.0, ans=0.125 2023-06-25 07:51:42,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1940196.0, ans=0.125 2023-06-25 07:51:52,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1940256.0, ans=0.125 2023-06-25 07:51:55,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1940256.0, ans=0.035 2023-06-25 07:52:43,135 INFO [train.py:996] (0/4) Epoch 11, batch 18450, loss[loss=0.1659, simple_loss=0.2609, pruned_loss=0.03547, over 21791.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2984, pruned_loss=0.07078, over 4258402.33 frames. ], batch size: 352, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:53:05,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1940436.0, ans=0.1 2023-06-25 07:54:05,002 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.624e+02 6.992e+02 1.032e+03 1.619e+03 3.807e+03, threshold=2.064e+03, percent-clipped=5.0 2023-06-25 07:54:11,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-25 07:54:25,091 INFO [train.py:996] (0/4) Epoch 11, batch 18500, loss[loss=0.219, simple_loss=0.3111, pruned_loss=0.06348, over 21649.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2949, pruned_loss=0.06919, over 4249951.11 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:54:33,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1940676.0, ans=0.125 2023-06-25 07:55:22,105 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-25 07:55:40,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1940856.0, ans=0.125 2023-06-25 07:56:00,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1940916.0, ans=0.5 2023-06-25 07:56:15,433 INFO [train.py:996] (0/4) Epoch 11, batch 18550, loss[loss=0.1967, simple_loss=0.2629, pruned_loss=0.06529, over 21707.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2921, pruned_loss=0.06804, over 4251248.25 frames. ], batch size: 124, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:57:49,425 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.217e+02 7.256e+02 1.032e+03 1.520e+03 3.767e+03, threshold=2.064e+03, percent-clipped=11.0 2023-06-25 07:58:04,467 INFO [train.py:996] (0/4) Epoch 11, batch 18600, loss[loss=0.2771, simple_loss=0.3587, pruned_loss=0.09772, over 21539.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2906, pruned_loss=0.06894, over 4252268.80 frames. ], batch size: 473, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:58:06,469 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:58:23,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.88 vs. limit=6.0 2023-06-25 07:58:44,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1941396.0, ans=0.1 2023-06-25 07:59:12,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1941456.0, ans=0.125 2023-06-25 07:59:17,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-25 07:59:29,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1941456.0, ans=0.95 2023-06-25 07:59:41,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1941516.0, ans=0.125 2023-06-25 07:59:51,206 INFO [train.py:996] (0/4) Epoch 11, batch 18650, loss[loss=0.1947, simple_loss=0.2616, pruned_loss=0.06385, over 21240.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2911, pruned_loss=0.06989, over 4245329.30 frames. ], batch size: 177, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:59:53,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1941576.0, ans=0.1 2023-06-25 08:00:05,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-25 08:01:11,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-25 08:01:21,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.720e+02 7.139e+02 9.409e+02 1.577e+03 2.753e+03, threshold=1.882e+03, percent-clipped=11.0 2023-06-25 08:01:30,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1941816.0, ans=0.125 2023-06-25 08:01:35,909 INFO [train.py:996] (0/4) Epoch 11, batch 18700, loss[loss=0.2121, simple_loss=0.2737, pruned_loss=0.07525, over 21579.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2882, pruned_loss=0.07134, over 4242393.64 frames. ], batch size: 247, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:02:22,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1941996.0, ans=0.125 2023-06-25 08:02:52,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.85 vs. limit=10.0 2023-06-25 08:02:55,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1942056.0, ans=0.125 2023-06-25 08:02:58,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1942056.0, ans=0.0 2023-06-25 08:03:23,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1942176.0, ans=0.125 2023-06-25 08:03:24,900 INFO [train.py:996] (0/4) Epoch 11, batch 18750, loss[loss=0.2679, simple_loss=0.3577, pruned_loss=0.08906, over 21290.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2924, pruned_loss=0.07478, over 4259013.06 frames. ], batch size: 548, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:03:36,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1942176.0, ans=0.0 2023-06-25 08:03:41,834 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:03:53,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1942236.0, ans=0.125 2023-06-25 08:04:50,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.888e+02 8.399e+02 1.249e+03 1.994e+03 4.167e+03, threshold=2.497e+03, percent-clipped=25.0 2023-06-25 08:05:02,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1942416.0, ans=0.0 2023-06-25 08:05:11,246 INFO [train.py:996] (0/4) Epoch 11, batch 18800, loss[loss=0.2288, simple_loss=0.2929, pruned_loss=0.08233, over 21217.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2965, pruned_loss=0.07542, over 4246245.35 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 32.0 2023-06-25 08:05:21,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1942476.0, ans=0.125 2023-06-25 08:06:56,514 INFO [train.py:996] (0/4) Epoch 11, batch 18850, loss[loss=0.2013, simple_loss=0.2924, pruned_loss=0.05512, over 21750.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2939, pruned_loss=0.07136, over 4244814.43 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:07:01,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-06-25 08:07:03,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1942776.0, ans=0.125 2023-06-25 08:07:48,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1942896.0, ans=0.125 2023-06-25 08:08:19,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-25 08:08:21,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.163e+02 6.140e+02 8.289e+02 1.259e+03 4.459e+03, threshold=1.658e+03, percent-clipped=10.0 2023-06-25 08:08:22,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1943016.0, ans=0.125 2023-06-25 08:08:32,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1943016.0, ans=0.125 2023-06-25 08:08:40,585 INFO [train.py:996] (0/4) Epoch 11, batch 18900, loss[loss=0.243, simple_loss=0.2957, pruned_loss=0.09514, over 21763.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2897, pruned_loss=0.07105, over 4249525.77 frames. ], batch size: 102, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:08:56,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1943136.0, ans=0.125 2023-06-25 08:09:58,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1943256.0, ans=0.2 2023-06-25 08:10:26,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1943376.0, ans=0.0 2023-06-25 08:10:27,774 INFO [train.py:996] (0/4) Epoch 11, batch 18950, loss[loss=0.2236, simple_loss=0.2961, pruned_loss=0.07558, over 21878.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2899, pruned_loss=0.07249, over 4260257.71 frames. ], batch size: 124, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:11:31,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1943496.0, ans=0.125 2023-06-25 08:11:57,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1943616.0, ans=0.125 2023-06-25 08:12:02,079 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.904e+02 8.258e+02 1.054e+03 1.529e+03 3.478e+03, threshold=2.107e+03, percent-clipped=19.0 2023-06-25 08:12:15,310 INFO [train.py:996] (0/4) Epoch 11, batch 19000, loss[loss=0.2586, simple_loss=0.3273, pruned_loss=0.09501, over 21499.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2997, pruned_loss=0.07424, over 4265036.40 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:12:27,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1943676.0, ans=0.02 2023-06-25 08:13:29,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1943856.0, ans=0.125 2023-06-25 08:13:53,168 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-06-25 08:13:55,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1943916.0, ans=0.09899494936611666 2023-06-25 08:13:59,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1943916.0, ans=0.0 2023-06-25 08:14:01,787 INFO [train.py:996] (0/4) Epoch 11, batch 19050, loss[loss=0.2345, simple_loss=0.299, pruned_loss=0.085, over 21806.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3042, pruned_loss=0.07766, over 4271878.46 frames. ], batch size: 112, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:14:07,246 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-324000.pt 2023-06-25 08:14:17,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1943976.0, ans=0.125 2023-06-25 08:14:32,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1944036.0, ans=0.125 2023-06-25 08:15:13,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1944156.0, ans=0.2 2023-06-25 08:15:15,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1944156.0, ans=0.125 2023-06-25 08:15:19,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=12.0 2023-06-25 08:15:33,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.658e+02 1.037e+03 1.522e+03 3.485e+03, threshold=2.073e+03, percent-clipped=12.0 2023-06-25 08:15:48,089 INFO [train.py:996] (0/4) Epoch 11, batch 19100, loss[loss=0.2401, simple_loss=0.2976, pruned_loss=0.09128, over 15711.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3039, pruned_loss=0.07891, over 4268142.63 frames. ], batch size: 61, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:16:57,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 08:17:35,912 INFO [train.py:996] (0/4) Epoch 11, batch 19150, loss[loss=0.2625, simple_loss=0.3661, pruned_loss=0.07943, over 21860.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3071, pruned_loss=0.07989, over 4271839.43 frames. ], batch size: 372, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:17:57,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1944636.0, ans=0.125 2023-06-25 08:19:13,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1944816.0, ans=0.0 2023-06-25 08:19:14,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.349e+02 9.714e+02 1.394e+03 2.160e+03 4.455e+03, threshold=2.788e+03, percent-clipped=28.0 2023-06-25 08:19:26,273 INFO [train.py:996] (0/4) Epoch 11, batch 19200, loss[loss=0.2879, simple_loss=0.3839, pruned_loss=0.0959, over 21731.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3156, pruned_loss=0.08031, over 4273554.09 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:20:09,135 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:21:11,489 INFO [train.py:996] (0/4) Epoch 11, batch 19250, loss[loss=0.173, simple_loss=0.2678, pruned_loss=0.03911, over 21665.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3157, pruned_loss=0.07613, over 4275313.78 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:21:17,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1945176.0, ans=0.1 2023-06-25 08:21:49,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1945236.0, ans=0.95 2023-06-25 08:21:59,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1945236.0, ans=0.125 2023-06-25 08:22:04,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1945296.0, ans=0.125 2023-06-25 08:22:46,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.037e+02 6.757e+02 9.006e+02 1.219e+03 2.409e+03, threshold=1.801e+03, percent-clipped=0.0 2023-06-25 08:22:57,455 INFO [train.py:996] (0/4) Epoch 11, batch 19300, loss[loss=0.2636, simple_loss=0.3372, pruned_loss=0.09505, over 21602.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3123, pruned_loss=0.07596, over 4286714.42 frames. ], batch size: 508, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:22:59,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1945476.0, ans=0.125 2023-06-25 08:23:13,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1945476.0, ans=0.2 2023-06-25 08:24:32,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1945716.0, ans=0.1 2023-06-25 08:24:52,191 INFO [train.py:996] (0/4) Epoch 11, batch 19350, loss[loss=0.1855, simple_loss=0.272, pruned_loss=0.04943, over 21609.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3074, pruned_loss=0.07247, over 4281163.18 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:26:01,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.38 vs. limit=10.0 2023-06-25 08:26:18,157 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.662e+02 8.721e+02 1.407e+03 2.132e+03 4.703e+03, threshold=2.815e+03, percent-clipped=33.0 2023-06-25 08:26:36,765 INFO [train.py:996] (0/4) Epoch 11, batch 19400, loss[loss=0.2014, simple_loss=0.2753, pruned_loss=0.06378, over 21724.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3028, pruned_loss=0.07113, over 4277003.06 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:27:09,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1946136.0, ans=0.0 2023-06-25 08:27:38,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1946196.0, ans=0.125 2023-06-25 08:28:22,719 INFO [train.py:996] (0/4) Epoch 11, batch 19450, loss[loss=0.2302, simple_loss=0.2933, pruned_loss=0.08357, over 21482.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3013, pruned_loss=0.07315, over 4282319.72 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:28:24,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1946376.0, ans=0.0 2023-06-25 08:28:33,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1946376.0, ans=0.125 2023-06-25 08:29:28,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1946556.0, ans=0.07 2023-06-25 08:29:33,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1946556.0, ans=0.2 2023-06-25 08:29:40,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1946556.0, ans=0.025 2023-06-25 08:29:53,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.635e+02 8.363e+02 1.164e+03 1.702e+03 3.020e+03, threshold=2.327e+03, percent-clipped=5.0 2023-06-25 08:30:08,976 INFO [train.py:996] (0/4) Epoch 11, batch 19500, loss[loss=0.2344, simple_loss=0.3082, pruned_loss=0.08034, over 21766.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2973, pruned_loss=0.07418, over 4284466.61 frames. ], batch size: 333, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:31:00,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1946796.0, ans=0.125 2023-06-25 08:31:12,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-25 08:31:57,069 INFO [train.py:996] (0/4) Epoch 11, batch 19550, loss[loss=0.1739, simple_loss=0.2619, pruned_loss=0.04294, over 21768.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2958, pruned_loss=0.07398, over 4276366.17 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:32:49,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1947096.0, ans=0.0 2023-06-25 08:33:15,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-25 08:33:31,270 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.735e+02 7.971e+02 1.072e+03 1.636e+03 3.226e+03, threshold=2.144e+03, percent-clipped=9.0 2023-06-25 08:33:41,360 INFO [train.py:996] (0/4) Epoch 11, batch 19600, loss[loss=0.2554, simple_loss=0.3217, pruned_loss=0.09458, over 21441.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2977, pruned_loss=0.07456, over 4285444.99 frames. ], batch size: 144, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:34:33,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1947396.0, ans=0.0 2023-06-25 08:34:45,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1947396.0, ans=0.125 2023-06-25 08:35:36,930 INFO [train.py:996] (0/4) Epoch 11, batch 19650, loss[loss=0.228, simple_loss=0.3008, pruned_loss=0.07756, over 21829.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3034, pruned_loss=0.07895, over 4286490.80 frames. ], batch size: 247, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:35:41,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-25 08:35:41,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.34 vs. limit=22.5 2023-06-25 08:36:35,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1947756.0, ans=0.0 2023-06-25 08:37:04,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1947816.0, ans=0.2 2023-06-25 08:37:15,099 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.249e+02 7.644e+02 9.843e+02 1.375e+03 3.676e+03, threshold=1.969e+03, percent-clipped=9.0 2023-06-25 08:37:30,382 INFO [train.py:996] (0/4) Epoch 11, batch 19700, loss[loss=0.1948, simple_loss=0.3115, pruned_loss=0.03906, over 20740.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3072, pruned_loss=0.07881, over 4280883.51 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:37:36,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1947876.0, ans=0.05 2023-06-25 08:38:28,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1948056.0, ans=0.1 2023-06-25 08:38:32,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1948056.0, ans=0.125 2023-06-25 08:38:52,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1948116.0, ans=0.125 2023-06-25 08:39:12,081 INFO [train.py:996] (0/4) Epoch 11, batch 19750, loss[loss=0.2215, simple_loss=0.3178, pruned_loss=0.06257, over 21431.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3162, pruned_loss=0.0801, over 4273396.79 frames. ], batch size: 211, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:39:14,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1948176.0, ans=0.125 2023-06-25 08:39:14,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1948176.0, ans=0.0 2023-06-25 08:39:24,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1948176.0, ans=0.1 2023-06-25 08:40:08,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-25 08:40:36,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1948416.0, ans=0.2 2023-06-25 08:40:49,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.288e+02 1.013e+03 1.397e+03 2.237e+03 5.539e+03, threshold=2.794e+03, percent-clipped=30.0 2023-06-25 08:41:00,484 INFO [train.py:996] (0/4) Epoch 11, batch 19800, loss[loss=0.193, simple_loss=0.2641, pruned_loss=0.06099, over 21527.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3153, pruned_loss=0.08058, over 4278880.69 frames. ], batch size: 212, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:42:31,912 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-25 08:42:42,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1948716.0, ans=0.5 2023-06-25 08:42:46,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-25 08:42:47,231 INFO [train.py:996] (0/4) Epoch 11, batch 19850, loss[loss=0.1977, simple_loss=0.2899, pruned_loss=0.05277, over 21746.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.307, pruned_loss=0.07555, over 4271758.97 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:42:49,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1948776.0, ans=0.125 2023-06-25 08:43:24,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1948836.0, ans=0.1 2023-06-25 08:43:51,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-25 08:44:08,772 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=13.09 vs. limit=15.0 2023-06-25 08:44:23,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 7.609e+02 1.066e+03 1.634e+03 3.345e+03, threshold=2.132e+03, percent-clipped=4.0 2023-06-25 08:44:32,248 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:44:33,372 INFO [train.py:996] (0/4) Epoch 11, batch 19900, loss[loss=0.2049, simple_loss=0.2823, pruned_loss=0.0637, over 21568.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3072, pruned_loss=0.07278, over 4274551.82 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:45:04,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1949136.0, ans=0.125 2023-06-25 08:45:05,457 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-25 08:46:00,189 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-25 08:46:19,624 INFO [train.py:996] (0/4) Epoch 11, batch 19950, loss[loss=0.1883, simple_loss=0.2583, pruned_loss=0.05911, over 21552.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3005, pruned_loss=0.07294, over 4273301.23 frames. ], batch size: 230, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:46:26,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1949376.0, ans=0.125 2023-06-25 08:47:53,977 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.809e+02 7.247e+02 1.068e+03 1.569e+03 2.873e+03, threshold=2.135e+03, percent-clipped=11.0 2023-06-25 08:48:03,745 INFO [train.py:996] (0/4) Epoch 11, batch 20000, loss[loss=0.257, simple_loss=0.324, pruned_loss=0.09502, over 21860.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3036, pruned_loss=0.07397, over 4276927.79 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 08:48:37,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-25 08:48:46,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1949736.0, ans=0.0 2023-06-25 08:48:53,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1949796.0, ans=0.07 2023-06-25 08:48:57,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1949796.0, ans=0.1 2023-06-25 08:48:59,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1949796.0, ans=0.125 2023-06-25 08:49:20,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1949856.0, ans=0.125 2023-06-25 08:49:39,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1949916.0, ans=0.125 2023-06-25 08:49:45,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-25 08:49:45,806 INFO [train.py:996] (0/4) Epoch 11, batch 20050, loss[loss=0.2679, simple_loss=0.3329, pruned_loss=0.1015, over 21743.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3055, pruned_loss=0.07659, over 4275089.32 frames. ], batch size: 389, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 08:50:32,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1950036.0, ans=0.125 2023-06-25 08:51:23,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.669e+02 8.021e+02 1.064e+03 1.748e+03 3.117e+03, threshold=2.127e+03, percent-clipped=13.0 2023-06-25 08:51:33,721 INFO [train.py:996] (0/4) Epoch 11, batch 20100, loss[loss=0.2586, simple_loss=0.3571, pruned_loss=0.07999, over 21858.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3081, pruned_loss=0.07862, over 4279294.81 frames. ], batch size: 371, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:51:48,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-25 08:52:22,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1950336.0, ans=0.125 2023-06-25 08:53:12,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-25 08:53:28,313 INFO [train.py:996] (0/4) Epoch 11, batch 20150, loss[loss=0.2975, simple_loss=0.362, pruned_loss=0.1165, over 21338.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3164, pruned_loss=0.08059, over 4276682.76 frames. ], batch size: 507, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:53:50,259 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.80 vs. limit=22.5 2023-06-25 08:53:52,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1950636.0, ans=0.025 2023-06-25 08:54:25,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1950696.0, ans=0.125 2023-06-25 08:55:17,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.413e+02 8.388e+02 1.067e+03 1.531e+03 4.094e+03, threshold=2.133e+03, percent-clipped=12.0 2023-06-25 08:55:17,579 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.577e-02 2023-06-25 08:55:25,386 INFO [train.py:996] (0/4) Epoch 11, batch 20200, loss[loss=0.2584, simple_loss=0.3863, pruned_loss=0.06519, over 19936.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3226, pruned_loss=0.08406, over 4276002.27 frames. ], batch size: 702, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:55:58,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1950936.0, ans=0.07 2023-06-25 08:56:00,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1950936.0, ans=0.125 2023-06-25 08:56:05,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1950996.0, ans=0.125 2023-06-25 08:56:31,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-25 08:56:32,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-25 08:57:10,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1951116.0, ans=0.125 2023-06-25 08:57:12,759 INFO [train.py:996] (0/4) Epoch 11, batch 20250, loss[loss=0.2261, simple_loss=0.3117, pruned_loss=0.07021, over 21735.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3235, pruned_loss=0.0827, over 4277072.31 frames. ], batch size: 247, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:57:26,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1951176.0, ans=0.025 2023-06-25 08:57:42,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-25 08:58:19,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1951356.0, ans=0.125 2023-06-25 08:58:43,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1951416.0, ans=0.0 2023-06-25 08:58:52,151 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 7.038e+02 1.016e+03 1.334e+03 4.106e+03, threshold=2.032e+03, percent-clipped=11.0 2023-06-25 08:58:55,876 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:59:05,818 INFO [train.py:996] (0/4) Epoch 11, batch 20300, loss[loss=0.2214, simple_loss=0.3006, pruned_loss=0.07109, over 21314.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3217, pruned_loss=0.08016, over 4271595.46 frames. ], batch size: 194, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:59:25,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1951536.0, ans=0.2 2023-06-25 08:59:43,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1951596.0, ans=0.0 2023-06-25 08:59:45,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1951596.0, ans=0.125 2023-06-25 09:00:40,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1951716.0, ans=0.1 2023-06-25 09:00:46,183 INFO [train.py:996] (0/4) Epoch 11, batch 20350, loss[loss=0.2498, simple_loss=0.3174, pruned_loss=0.09107, over 21267.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3198, pruned_loss=0.07918, over 4270059.09 frames. ], batch size: 143, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:01:06,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1951836.0, ans=0.035 2023-06-25 09:01:21,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1951836.0, ans=0.025 2023-06-25 09:02:15,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1952016.0, ans=0.125 2023-06-25 09:02:24,592 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.181e+02 7.498e+02 1.071e+03 1.543e+03 3.638e+03, threshold=2.141e+03, percent-clipped=16.0 2023-06-25 09:02:25,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-25 09:02:31,887 INFO [train.py:996] (0/4) Epoch 11, batch 20400, loss[loss=0.2686, simple_loss=0.3555, pruned_loss=0.09086, over 21420.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3229, pruned_loss=0.08225, over 4271006.36 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:02:36,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1952076.0, ans=0.125 2023-06-25 09:03:07,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1952136.0, ans=0.0 2023-06-25 09:03:09,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1952196.0, ans=0.015 2023-06-25 09:03:29,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1952196.0, ans=0.2 2023-06-25 09:03:45,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1952256.0, ans=0.125 2023-06-25 09:04:03,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 09:04:05,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1952316.0, ans=0.0 2023-06-25 09:04:16,705 INFO [train.py:996] (0/4) Epoch 11, batch 20450, loss[loss=0.2139, simple_loss=0.2916, pruned_loss=0.06806, over 16419.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3239, pruned_loss=0.08482, over 4259193.89 frames. ], batch size: 62, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:04:17,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1952376.0, ans=0.0 2023-06-25 09:04:34,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1952376.0, ans=0.0 2023-06-25 09:05:05,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1952496.0, ans=0.035 2023-06-25 09:05:20,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-25 09:05:55,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.393e+02 8.343e+02 1.181e+03 1.747e+03 3.039e+03, threshold=2.362e+03, percent-clipped=12.0 2023-06-25 09:06:02,408 INFO [train.py:996] (0/4) Epoch 11, batch 20500, loss[loss=0.2307, simple_loss=0.2953, pruned_loss=0.08305, over 21792.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3187, pruned_loss=0.08497, over 4255613.09 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:07:02,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1952856.0, ans=0.1 2023-06-25 09:07:09,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1952856.0, ans=0.125 2023-06-25 09:07:44,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1952916.0, ans=0.2 2023-06-25 09:07:48,699 INFO [train.py:996] (0/4) Epoch 11, batch 20550, loss[loss=0.178, simple_loss=0.2461, pruned_loss=0.05496, over 21159.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3114, pruned_loss=0.08357, over 4252962.10 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:07:56,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1952976.0, ans=0.125 2023-06-25 09:08:20,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1953036.0, ans=0.025 2023-06-25 09:09:28,348 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.073e+02 8.204e+02 1.449e+03 2.191e+03 5.725e+03, threshold=2.898e+03, percent-clipped=18.0 2023-06-25 09:09:40,429 INFO [train.py:996] (0/4) Epoch 11, batch 20600, loss[loss=0.2418, simple_loss=0.316, pruned_loss=0.08382, over 22085.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.314, pruned_loss=0.08149, over 4240348.45 frames. ], batch size: 119, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:09:41,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1953276.0, ans=0.0 2023-06-25 09:09:53,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-06-25 09:11:13,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1953516.0, ans=0.2 2023-06-25 09:11:16,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-25 09:11:26,286 INFO [train.py:996] (0/4) Epoch 11, batch 20650, loss[loss=0.1888, simple_loss=0.2547, pruned_loss=0.06147, over 21454.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3101, pruned_loss=0.08175, over 4252831.01 frames. ], batch size: 195, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:12:52,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1953816.0, ans=0.1 2023-06-25 09:13:02,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-25 09:13:04,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.039e+02 6.601e+02 8.640e+02 1.224e+03 2.485e+03, threshold=1.728e+03, percent-clipped=0.0 2023-06-25 09:13:16,543 INFO [train.py:996] (0/4) Epoch 11, batch 20700, loss[loss=0.2678, simple_loss=0.3628, pruned_loss=0.08644, over 21229.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3025, pruned_loss=0.07789, over 4243404.19 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:13:24,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1953876.0, ans=0.2 2023-06-25 09:13:43,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1953936.0, ans=0.1 2023-06-25 09:13:46,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1953936.0, ans=0.125 2023-06-25 09:14:06,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1953996.0, ans=0.0 2023-06-25 09:15:07,991 INFO [train.py:996] (0/4) Epoch 11, batch 20750, loss[loss=0.2507, simple_loss=0.3748, pruned_loss=0.06325, over 20800.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.305, pruned_loss=0.07693, over 4244411.24 frames. ], batch size: 607, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:16:04,170 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-25 09:16:39,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1954416.0, ans=0.0 2023-06-25 09:16:48,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.092e+02 1.287e+03 1.980e+03 4.706e+03, threshold=2.574e+03, percent-clipped=34.0 2023-06-25 09:16:55,084 INFO [train.py:996] (0/4) Epoch 11, batch 20800, loss[loss=0.2091, simple_loss=0.273, pruned_loss=0.07253, over 21830.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3099, pruned_loss=0.07798, over 4243602.21 frames. ], batch size: 118, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 09:17:28,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1954536.0, ans=0.125 2023-06-25 09:17:29,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1954536.0, ans=0.125 2023-06-25 09:18:01,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1954656.0, ans=0.2 2023-06-25 09:18:06,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1954656.0, ans=0.125 2023-06-25 09:18:35,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-25 09:18:40,356 INFO [train.py:996] (0/4) Epoch 11, batch 20850, loss[loss=0.2311, simple_loss=0.3044, pruned_loss=0.07884, over 22004.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3017, pruned_loss=0.07577, over 4245696.44 frames. ], batch size: 113, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:20:02,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=22.5 2023-06-25 09:20:06,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1954956.0, ans=0.025 2023-06-25 09:20:13,454 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:20:18,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1955016.0, ans=0.1 2023-06-25 09:20:20,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.513e+02 8.708e+02 1.139e+03 1.646e+03 3.626e+03, threshold=2.277e+03, percent-clipped=8.0 2023-06-25 09:20:24,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1955076.0, ans=0.125 2023-06-25 09:20:25,826 INFO [train.py:996] (0/4) Epoch 11, batch 20900, loss[loss=0.246, simple_loss=0.3207, pruned_loss=0.08561, over 21781.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3021, pruned_loss=0.07691, over 4259484.13 frames. ], batch size: 391, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:20:58,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1955136.0, ans=0.125 2023-06-25 09:21:36,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1955256.0, ans=0.125 2023-06-25 09:21:40,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1955256.0, ans=0.2 2023-06-25 09:22:08,611 INFO [train.py:996] (0/4) Epoch 11, batch 20950, loss[loss=0.1946, simple_loss=0.272, pruned_loss=0.05857, over 21503.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2987, pruned_loss=0.0738, over 4266402.44 frames. ], batch size: 212, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:22:14,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-06-25 09:22:22,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1955436.0, ans=0.2 2023-06-25 09:22:53,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1955496.0, ans=0.0 2023-06-25 09:23:28,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.14 vs. limit=15.0 2023-06-25 09:23:40,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.503e+02 1.270e+03 1.885e+03 4.065e+03, threshold=2.540e+03, percent-clipped=13.0 2023-06-25 09:23:45,558 INFO [train.py:996] (0/4) Epoch 11, batch 21000, loss[loss=0.2526, simple_loss=0.3201, pruned_loss=0.09257, over 21933.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2973, pruned_loss=0.07417, over 4274528.87 frames. ], batch size: 316, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:23:45,560 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 09:23:56,512 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.4469, 4.9423, 2.9959, 2.3961], device='cuda:0') 2023-06-25 09:24:03,613 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2627, simple_loss=0.3591, pruned_loss=0.08313, over 1796401.00 frames. 2023-06-25 09:24:03,614 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 09:24:04,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1955676.0, ans=0.125 2023-06-25 09:24:09,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1955676.0, ans=0.2 2023-06-25 09:25:46,567 INFO [train.py:996] (0/4) Epoch 11, batch 21050, loss[loss=0.2478, simple_loss=0.3222, pruned_loss=0.08665, over 16118.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2963, pruned_loss=0.07452, over 4272732.86 frames. ], batch size: 64, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:26:42,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-25 09:26:49,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=12.0 2023-06-25 09:27:20,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-25 09:27:27,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.534e+02 6.581e+02 8.702e+02 1.278e+03 3.016e+03, threshold=1.740e+03, percent-clipped=3.0 2023-06-25 09:27:30,707 INFO [train.py:996] (0/4) Epoch 11, batch 21100, loss[loss=0.2107, simple_loss=0.2725, pruned_loss=0.07444, over 21674.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2931, pruned_loss=0.07461, over 4269114.28 frames. ], batch size: 248, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:27:31,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-25 09:28:32,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1956396.0, ans=0.125 2023-06-25 09:28:40,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1956456.0, ans=0.125 2023-06-25 09:29:11,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1956516.0, ans=0.07 2023-06-25 09:29:11,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1956516.0, ans=0.125 2023-06-25 09:29:15,556 INFO [train.py:996] (0/4) Epoch 11, batch 21150, loss[loss=0.1879, simple_loss=0.2499, pruned_loss=0.06293, over 21595.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2893, pruned_loss=0.07467, over 4264101.34 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:29:19,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1956576.0, ans=0.125 2023-06-25 09:29:24,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1956576.0, ans=0.125 2023-06-25 09:29:24,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1956576.0, ans=10.0 2023-06-25 09:29:27,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1956576.0, ans=0.0 2023-06-25 09:29:30,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1956636.0, ans=0.125 2023-06-25 09:30:50,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-25 09:30:51,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1956816.0, ans=0.125 2023-06-25 09:30:55,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 7.371e+02 1.068e+03 1.667e+03 5.764e+03, threshold=2.137e+03, percent-clipped=24.0 2023-06-25 09:30:59,140 INFO [train.py:996] (0/4) Epoch 11, batch 21200, loss[loss=0.2133, simple_loss=0.2851, pruned_loss=0.07075, over 21336.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2855, pruned_loss=0.07371, over 4269326.19 frames. ], batch size: 471, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:31:53,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1956996.0, ans=0.0 2023-06-25 09:31:57,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-25 09:32:19,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.74 vs. limit=22.5 2023-06-25 09:32:38,426 INFO [train.py:996] (0/4) Epoch 11, batch 21250, loss[loss=0.2044, simple_loss=0.2746, pruned_loss=0.06705, over 16061.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2829, pruned_loss=0.07387, over 4258555.19 frames. ], batch size: 65, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:32:44,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1957176.0, ans=0.1 2023-06-25 09:32:47,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1957176.0, ans=0.125 2023-06-25 09:33:10,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1957236.0, ans=0.04949747468305833 2023-06-25 09:33:40,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1957356.0, ans=0.2 2023-06-25 09:33:46,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1957356.0, ans=0.0 2023-06-25 09:33:46,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1957356.0, ans=0.0 2023-06-25 09:34:16,848 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 8.531e+02 1.344e+03 2.187e+03 4.666e+03, threshold=2.689e+03, percent-clipped=25.0 2023-06-25 09:34:18,319 INFO [train.py:996] (0/4) Epoch 11, batch 21300, loss[loss=0.2575, simple_loss=0.3218, pruned_loss=0.09654, over 21609.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2894, pruned_loss=0.07581, over 4267382.23 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:34:28,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1957476.0, ans=0.125 2023-06-25 09:34:50,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1957536.0, ans=0.125 2023-06-25 09:34:52,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1957536.0, ans=0.0 2023-06-25 09:35:16,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1957596.0, ans=0.125 2023-06-25 09:35:42,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1957656.0, ans=0.0 2023-06-25 09:35:57,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-25 09:36:04,250 INFO [train.py:996] (0/4) Epoch 11, batch 21350, loss[loss=0.2074, simple_loss=0.2795, pruned_loss=0.06768, over 21163.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2945, pruned_loss=0.0764, over 4265284.98 frames. ], batch size: 143, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:36:15,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1957776.0, ans=0.0 2023-06-25 09:36:33,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.42 vs. limit=10.0 2023-06-25 09:37:05,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1957896.0, ans=0.0 2023-06-25 09:37:07,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1957896.0, ans=0.0 2023-06-25 09:37:47,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-06-25 09:37:55,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.389e+02 7.142e+02 1.027e+03 1.660e+03 3.891e+03, threshold=2.053e+03, percent-clipped=5.0 2023-06-25 09:37:57,556 INFO [train.py:996] (0/4) Epoch 11, batch 21400, loss[loss=0.2081, simple_loss=0.2664, pruned_loss=0.07489, over 20228.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2982, pruned_loss=0.07632, over 4264087.54 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:38:59,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1958196.0, ans=0.125 2023-06-25 09:39:13,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1958256.0, ans=0.0 2023-06-25 09:39:16,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1958256.0, ans=0.2 2023-06-25 09:39:41,667 INFO [train.py:996] (0/4) Epoch 11, batch 21450, loss[loss=0.2192, simple_loss=0.2826, pruned_loss=0.07788, over 21423.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2995, pruned_loss=0.07692, over 4265183.84 frames. ], batch size: 144, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:40:03,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1958436.0, ans=0.2 2023-06-25 09:40:17,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1958436.0, ans=0.125 2023-06-25 09:41:00,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1958556.0, ans=0.2 2023-06-25 09:41:06,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-25 09:41:25,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.203e+02 7.277e+02 9.911e+02 1.372e+03 2.622e+03, threshold=1.982e+03, percent-clipped=4.0 2023-06-25 09:41:27,010 INFO [train.py:996] (0/4) Epoch 11, batch 21500, loss[loss=0.2685, simple_loss=0.3367, pruned_loss=0.1002, over 21878.00 frames. ], tot_loss[loss=0.228, simple_loss=0.299, pruned_loss=0.07849, over 4268104.66 frames. ], batch size: 107, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:41:36,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-25 09:42:26,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1958796.0, ans=0.1 2023-06-25 09:43:06,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1958916.0, ans=0.1 2023-06-25 09:43:11,052 INFO [train.py:996] (0/4) Epoch 11, batch 21550, loss[loss=0.1865, simple_loss=0.2548, pruned_loss=0.05908, over 21282.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2936, pruned_loss=0.07658, over 4255281.32 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:44:05,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1959096.0, ans=0.0 2023-06-25 09:44:51,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 7.990e+02 1.429e+03 2.000e+03 5.379e+03, threshold=2.857e+03, percent-clipped=25.0 2023-06-25 09:44:53,174 INFO [train.py:996] (0/4) Epoch 11, batch 21600, loss[loss=0.194, simple_loss=0.2617, pruned_loss=0.06314, over 21602.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2892, pruned_loss=0.0746, over 4259389.53 frames. ], batch size: 415, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:45:26,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1959336.0, ans=0.0 2023-06-25 09:46:03,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1959456.0, ans=0.125 2023-06-25 09:46:40,929 INFO [train.py:996] (0/4) Epoch 11, batch 21650, loss[loss=0.218, simple_loss=0.3146, pruned_loss=0.06072, over 21813.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2927, pruned_loss=0.0724, over 4263995.23 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:47:05,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1959576.0, ans=0.125 2023-06-25 09:47:26,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.92 vs. limit=22.5 2023-06-25 09:47:30,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1959696.0, ans=0.0 2023-06-25 09:47:41,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1959696.0, ans=0.0 2023-06-25 09:47:52,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=12.0 2023-06-25 09:48:01,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1959816.0, ans=0.95 2023-06-25 09:48:25,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.099e+02 8.539e+02 1.351e+03 1.899e+03 3.491e+03, threshold=2.702e+03, percent-clipped=7.0 2023-06-25 09:48:27,792 INFO [train.py:996] (0/4) Epoch 11, batch 21700, loss[loss=0.2017, simple_loss=0.2762, pruned_loss=0.06355, over 21797.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2932, pruned_loss=0.07015, over 4255064.17 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:48:35,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1959876.0, ans=0.0 2023-06-25 09:49:22,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1959996.0, ans=0.125 2023-06-25 09:49:56,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1960116.0, ans=0.0 2023-06-25 09:50:12,978 INFO [train.py:996] (0/4) Epoch 11, batch 21750, loss[loss=0.2253, simple_loss=0.283, pruned_loss=0.0838, over 21526.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2889, pruned_loss=0.07055, over 4263196.42 frames. ], batch size: 442, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:50:23,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1960176.0, ans=0.125 2023-06-25 09:50:41,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-25 09:51:09,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1960296.0, ans=0.125 2023-06-25 09:51:45,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1960416.0, ans=0.125 2023-06-25 09:51:52,666 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-25 09:51:58,431 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.456e+02 8.216e+02 1.100e+03 1.452e+03 3.027e+03, threshold=2.200e+03, percent-clipped=1.0 2023-06-25 09:51:59,883 INFO [train.py:996] (0/4) Epoch 11, batch 21800, loss[loss=0.2001, simple_loss=0.2619, pruned_loss=0.06914, over 21594.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2861, pruned_loss=0.07158, over 4255217.25 frames. ], batch size: 264, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:53:45,100 INFO [train.py:996] (0/4) Epoch 11, batch 21850, loss[loss=0.2147, simple_loss=0.3148, pruned_loss=0.05731, over 21852.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2935, pruned_loss=0.07273, over 4245523.85 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:53:52,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1960776.0, ans=0.0 2023-06-25 09:54:01,716 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 09:54:04,908 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=15.0 2023-06-25 09:54:20,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=8.0 2023-06-25 09:54:27,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1960896.0, ans=0.125 2023-06-25 09:54:36,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1960896.0, ans=0.2 2023-06-25 09:54:38,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1960896.0, ans=0.125 2023-06-25 09:55:01,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1960956.0, ans=0.125 2023-06-25 09:55:13,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1961016.0, ans=0.125 2023-06-25 09:55:27,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.221e+02 7.360e+02 1.053e+03 1.458e+03 3.571e+03, threshold=2.107e+03, percent-clipped=7.0 2023-06-25 09:55:35,149 INFO [train.py:996] (0/4) Epoch 11, batch 21900, loss[loss=0.2019, simple_loss=0.26, pruned_loss=0.07196, over 21657.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2946, pruned_loss=0.07455, over 4250304.64 frames. ], batch size: 231, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:55:42,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1961076.0, ans=0.125 2023-06-25 09:55:59,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1961136.0, ans=0.1 2023-06-25 09:56:28,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-25 09:56:54,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1961316.0, ans=0.0 2023-06-25 09:57:11,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1961316.0, ans=0.125 2023-06-25 09:57:20,428 INFO [train.py:996] (0/4) Epoch 11, batch 21950, loss[loss=0.1613, simple_loss=0.2253, pruned_loss=0.04863, over 16235.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2888, pruned_loss=0.07314, over 4244410.20 frames. ], batch size: 64, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 09:57:50,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1961436.0, ans=0.125 2023-06-25 09:58:57,729 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.305e+02 6.629e+02 8.784e+02 1.230e+03 3.737e+03, threshold=1.757e+03, percent-clipped=5.0 2023-06-25 09:58:59,383 INFO [train.py:996] (0/4) Epoch 11, batch 22000, loss[loss=0.1602, simple_loss=0.2297, pruned_loss=0.0454, over 21237.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2819, pruned_loss=0.06912, over 4256848.48 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 09:59:34,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1961736.0, ans=0.0 2023-06-25 10:00:35,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1961916.0, ans=0.07 2023-06-25 10:00:38,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1961916.0, ans=0.1 2023-06-25 10:00:50,117 INFO [train.py:996] (0/4) Epoch 11, batch 22050, loss[loss=0.1926, simple_loss=0.2595, pruned_loss=0.06278, over 20758.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2851, pruned_loss=0.07021, over 4247566.85 frames. ], batch size: 608, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:01:06,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1961976.0, ans=0.1 2023-06-25 10:01:14,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1962036.0, ans=0.1 2023-06-25 10:02:37,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.423e+02 9.011e+02 1.269e+03 1.922e+03 5.194e+03, threshold=2.539e+03, percent-clipped=30.0 2023-06-25 10:02:37,431 INFO [train.py:996] (0/4) Epoch 11, batch 22100, loss[loss=0.3145, simple_loss=0.4131, pruned_loss=0.108, over 19876.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2988, pruned_loss=0.0761, over 4249395.91 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:03:05,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1962336.0, ans=0.1 2023-06-25 10:03:27,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1962396.0, ans=0.0 2023-06-25 10:03:45,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1962456.0, ans=0.07 2023-06-25 10:04:23,291 INFO [train.py:996] (0/4) Epoch 11, batch 22150, loss[loss=0.2552, simple_loss=0.3234, pruned_loss=0.09354, over 21771.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3044, pruned_loss=0.07897, over 4261013.18 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:04:25,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1962576.0, ans=0.0 2023-06-25 10:04:46,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1962636.0, ans=0.1 2023-06-25 10:05:47,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1962816.0, ans=0.0 2023-06-25 10:06:10,665 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.244e+02 8.641e+02 1.312e+03 2.175e+03 4.145e+03, threshold=2.624e+03, percent-clipped=16.0 2023-06-25 10:06:10,690 INFO [train.py:996] (0/4) Epoch 11, batch 22200, loss[loss=0.338, simple_loss=0.4073, pruned_loss=0.1344, over 21639.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3073, pruned_loss=0.08084, over 4270421.03 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:06:44,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-25 10:06:49,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-25 10:07:46,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-25 10:07:56,914 INFO [train.py:996] (0/4) Epoch 11, batch 22250, loss[loss=0.2413, simple_loss=0.3197, pruned_loss=0.08148, over 21897.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3123, pruned_loss=0.08175, over 4271583.33 frames. ], batch size: 316, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:08:48,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1963296.0, ans=10.0 2023-06-25 10:08:50,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.58 vs. limit=15.0 2023-06-25 10:09:44,211 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-25 10:09:44,568 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.502e+02 7.191e+02 1.032e+03 1.470e+03 3.757e+03, threshold=2.063e+03, percent-clipped=7.0 2023-06-25 10:09:44,596 INFO [train.py:996] (0/4) Epoch 11, batch 22300, loss[loss=0.2207, simple_loss=0.2879, pruned_loss=0.07673, over 21365.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3143, pruned_loss=0.08307, over 4276729.07 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:09:45,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1963476.0, ans=0.0 2023-06-25 10:10:18,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1963536.0, ans=0.125 2023-06-25 10:10:44,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1963656.0, ans=0.125 2023-06-25 10:11:16,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1963716.0, ans=0.0 2023-06-25 10:11:34,598 INFO [train.py:996] (0/4) Epoch 11, batch 22350, loss[loss=0.2357, simple_loss=0.2978, pruned_loss=0.08673, over 21801.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3116, pruned_loss=0.08376, over 4286872.33 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:13:16,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1964016.0, ans=0.0 2023-06-25 10:13:21,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 7.078e+02 9.389e+02 1.336e+03 2.790e+03, threshold=1.878e+03, percent-clipped=4.0 2023-06-25 10:13:21,930 INFO [train.py:996] (0/4) Epoch 11, batch 22400, loss[loss=0.2056, simple_loss=0.2924, pruned_loss=0.05938, over 21635.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3085, pruned_loss=0.08116, over 4287774.17 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 10:14:06,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1964196.0, ans=0.125 2023-06-25 10:15:02,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1964316.0, ans=0.0 2023-06-25 10:15:05,403 INFO [train.py:996] (0/4) Epoch 11, batch 22450, loss[loss=0.1915, simple_loss=0.2527, pruned_loss=0.06512, over 21615.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3022, pruned_loss=0.07959, over 4286881.81 frames. ], batch size: 231, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:15:42,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-25 10:15:52,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-25 10:15:57,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1964496.0, ans=0.125 2023-06-25 10:16:12,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1964496.0, ans=0.2 2023-06-25 10:16:16,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1964556.0, ans=0.0 2023-06-25 10:16:53,983 INFO [train.py:996] (0/4) Epoch 11, batch 22500, loss[loss=0.3266, simple_loss=0.3886, pruned_loss=0.1323, over 21398.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.298, pruned_loss=0.07966, over 4283700.65 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:16:55,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.803e+02 7.459e+02 1.048e+03 1.318e+03 4.554e+03, threshold=2.097e+03, percent-clipped=12.0 2023-06-25 10:17:41,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1964796.0, ans=0.125 2023-06-25 10:18:01,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1964856.0, ans=0.2 2023-06-25 10:18:17,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1964916.0, ans=0.0 2023-06-25 10:18:41,457 INFO [train.py:996] (0/4) Epoch 11, batch 22550, loss[loss=0.2359, simple_loss=0.3147, pruned_loss=0.07857, over 21841.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3008, pruned_loss=0.07995, over 4288940.09 frames. ], batch size: 298, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:19:25,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1965096.0, ans=0.125 2023-06-25 10:19:54,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1965156.0, ans=0.125 2023-06-25 10:20:36,198 INFO [train.py:996] (0/4) Epoch 11, batch 22600, loss[loss=0.2233, simple_loss=0.3229, pruned_loss=0.06184, over 20094.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3051, pruned_loss=0.08068, over 4289372.00 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:20:39,514 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.096e+02 1.052e+03 1.426e+03 2.192e+03 4.902e+03, threshold=2.852e+03, percent-clipped=27.0 2023-06-25 10:20:52,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1965276.0, ans=0.0 2023-06-25 10:20:54,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1965276.0, ans=0.125 2023-06-25 10:21:25,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 10:21:28,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1965396.0, ans=0.1 2023-06-25 10:21:34,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1965456.0, ans=0.2 2023-06-25 10:21:57,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1965516.0, ans=0.0 2023-06-25 10:22:04,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.25 vs. limit=10.0 2023-06-25 10:22:20,059 INFO [train.py:996] (0/4) Epoch 11, batch 22650, loss[loss=0.2017, simple_loss=0.2618, pruned_loss=0.07082, over 21103.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3021, pruned_loss=0.08014, over 4271745.22 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:22:38,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1965576.0, ans=0.125 2023-06-25 10:22:49,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1965636.0, ans=0.125 2023-06-25 10:23:13,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.93 vs. limit=15.0 2023-06-25 10:23:38,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1965756.0, ans=0.125 2023-06-25 10:23:38,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1965756.0, ans=0.0 2023-06-25 10:24:02,917 INFO [train.py:996] (0/4) Epoch 11, batch 22700, loss[loss=0.1885, simple_loss=0.2543, pruned_loss=0.06138, over 21597.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2951, pruned_loss=0.07856, over 4271592.73 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:24:06,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.606e+02 1.053e+03 1.643e+03 3.332e+03, threshold=2.107e+03, percent-clipped=4.0 2023-06-25 10:24:50,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1965996.0, ans=0.025 2023-06-25 10:25:26,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1966116.0, ans=0.125 2023-06-25 10:25:31,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1966116.0, ans=0.0 2023-06-25 10:25:50,174 INFO [train.py:996] (0/4) Epoch 11, batch 22750, loss[loss=0.2844, simple_loss=0.3344, pruned_loss=0.1172, over 21479.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2964, pruned_loss=0.08037, over 4275042.88 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:26:03,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-25 10:26:11,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1966236.0, ans=0.125 2023-06-25 10:27:40,904 INFO [train.py:996] (0/4) Epoch 11, batch 22800, loss[loss=0.2499, simple_loss=0.3126, pruned_loss=0.0936, over 21703.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3006, pruned_loss=0.08242, over 4284868.27 frames. ], batch size: 391, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:27:44,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.117e+02 8.724e+02 1.380e+03 2.378e+03 6.132e+03, threshold=2.761e+03, percent-clipped=34.0 2023-06-25 10:27:58,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1966536.0, ans=0.125 2023-06-25 10:28:02,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1966536.0, ans=0.0 2023-06-25 10:28:34,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.75 vs. limit=10.0 2023-06-25 10:29:25,755 INFO [train.py:996] (0/4) Epoch 11, batch 22850, loss[loss=0.2161, simple_loss=0.2808, pruned_loss=0.0757, over 21656.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2968, pruned_loss=0.08121, over 4273888.05 frames. ], batch size: 332, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:29:32,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1966776.0, ans=0.125 2023-06-25 10:29:35,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1966776.0, ans=0.125 2023-06-25 10:30:03,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1966896.0, ans=0.1 2023-06-25 10:30:04,397 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-25 10:31:12,059 INFO [train.py:996] (0/4) Epoch 11, batch 22900, loss[loss=0.1869, simple_loss=0.263, pruned_loss=0.05545, over 21098.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2995, pruned_loss=0.08082, over 4278421.90 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:31:15,875 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.509e+02 6.842e+02 1.024e+03 1.500e+03 4.089e+03, threshold=2.047e+03, percent-clipped=2.0 2023-06-25 10:32:46,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1967316.0, ans=0.0 2023-06-25 10:32:58,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1967376.0, ans=0.125 2023-06-25 10:32:59,808 INFO [train.py:996] (0/4) Epoch 11, batch 22950, loss[loss=0.2163, simple_loss=0.3306, pruned_loss=0.05101, over 21756.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3101, pruned_loss=0.07856, over 4277286.84 frames. ], batch size: 282, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:33:00,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1967376.0, ans=0.2 2023-06-25 10:33:04,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1967376.0, ans=0.2 2023-06-25 10:33:34,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1967436.0, ans=0.125 2023-06-25 10:33:41,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-25 10:34:36,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1967616.0, ans=0.05 2023-06-25 10:34:44,069 INFO [train.py:996] (0/4) Epoch 11, batch 23000, loss[loss=0.249, simple_loss=0.3138, pruned_loss=0.09213, over 21917.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3091, pruned_loss=0.077, over 4281906.18 frames. ], batch size: 333, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:34:46,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1967676.0, ans=0.0 2023-06-25 10:34:47,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.236e+02 8.204e+02 1.340e+03 2.035e+03 4.542e+03, threshold=2.680e+03, percent-clipped=23.0 2023-06-25 10:34:51,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1967676.0, ans=0.0 2023-06-25 10:35:08,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1967736.0, ans=0.05 2023-06-25 10:35:46,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1967796.0, ans=10.0 2023-06-25 10:36:31,929 INFO [train.py:996] (0/4) Epoch 11, batch 23050, loss[loss=0.2503, simple_loss=0.33, pruned_loss=0.08524, over 21583.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3112, pruned_loss=0.07835, over 4278561.60 frames. ], batch size: 415, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:36:42,915 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-328000.pt 2023-06-25 10:36:45,627 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-25 10:36:53,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1967976.0, ans=0.125 2023-06-25 10:37:02,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1968036.0, ans=0.125 2023-06-25 10:37:46,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1968156.0, ans=0.125 2023-06-25 10:38:01,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1968216.0, ans=0.125 2023-06-25 10:38:10,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1968216.0, ans=0.1 2023-06-25 10:38:18,107 INFO [train.py:996] (0/4) Epoch 11, batch 23100, loss[loss=0.2175, simple_loss=0.2786, pruned_loss=0.0782, over 15685.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3066, pruned_loss=0.07816, over 4273288.88 frames. ], batch size: 60, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:38:29,168 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.702e+02 7.516e+02 1.022e+03 1.433e+03 4.307e+03, threshold=2.044e+03, percent-clipped=3.0 2023-06-25 10:38:54,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1968336.0, ans=0.1 2023-06-25 10:39:31,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1968456.0, ans=0.0 2023-06-25 10:39:41,735 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-25 10:40:00,068 INFO [train.py:996] (0/4) Epoch 11, batch 23150, loss[loss=0.2345, simple_loss=0.3005, pruned_loss=0.08423, over 21803.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.301, pruned_loss=0.07737, over 4276746.75 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:40:51,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-25 10:41:11,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1968756.0, ans=0.95 2023-06-25 10:41:12,374 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-25 10:41:35,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1968816.0, ans=0.125 2023-06-25 10:41:38,208 INFO [train.py:996] (0/4) Epoch 11, batch 23200, loss[loss=0.24, simple_loss=0.3243, pruned_loss=0.07787, over 17581.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3014, pruned_loss=0.07856, over 4281232.78 frames. ], batch size: 60, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:41:43,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.976e+02 7.986e+02 1.089e+03 1.684e+03 3.717e+03, threshold=2.178e+03, percent-clipped=18.0 2023-06-25 10:42:47,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1969056.0, ans=0.025 2023-06-25 10:42:55,739 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:43:29,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1969176.0, ans=0.1 2023-06-25 10:43:29,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1969176.0, ans=0.2 2023-06-25 10:43:30,230 INFO [train.py:996] (0/4) Epoch 11, batch 23250, loss[loss=0.2529, simple_loss=0.3313, pruned_loss=0.08727, over 19911.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3007, pruned_loss=0.07848, over 4285221.01 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:43:52,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1969236.0, ans=0.0 2023-06-25 10:45:17,604 INFO [train.py:996] (0/4) Epoch 11, batch 23300, loss[loss=0.2518, simple_loss=0.368, pruned_loss=0.06781, over 21747.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3062, pruned_loss=0.07993, over 4283643.41 frames. ], batch size: 332, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:45:22,829 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.331e+02 7.944e+02 1.056e+03 1.535e+03 4.546e+03, threshold=2.112e+03, percent-clipped=10.0 2023-06-25 10:45:47,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1969536.0, ans=0.0 2023-06-25 10:46:26,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1969656.0, ans=0.1 2023-06-25 10:46:51,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1969716.0, ans=0.0 2023-06-25 10:47:03,383 INFO [train.py:996] (0/4) Epoch 11, batch 23350, loss[loss=0.1643, simple_loss=0.2554, pruned_loss=0.03656, over 21828.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3108, pruned_loss=0.07982, over 4281385.23 frames. ], batch size: 317, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:47:26,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1969776.0, ans=0.025 2023-06-25 10:47:52,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1969896.0, ans=0.125 2023-06-25 10:47:57,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1969896.0, ans=0.2 2023-06-25 10:48:09,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1969956.0, ans=0.2 2023-06-25 10:48:35,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1970016.0, ans=0.1 2023-06-25 10:48:35,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1970016.0, ans=0.125 2023-06-25 10:48:39,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1970016.0, ans=0.125 2023-06-25 10:48:54,036 INFO [train.py:996] (0/4) Epoch 11, batch 23400, loss[loss=0.168, simple_loss=0.23, pruned_loss=0.053, over 20028.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3034, pruned_loss=0.07645, over 4267742.54 frames. ], batch size: 704, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:48:54,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1970076.0, ans=0.0 2023-06-25 10:49:07,030 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 8.471e+02 1.302e+03 1.874e+03 3.604e+03, threshold=2.604e+03, percent-clipped=20.0 2023-06-25 10:49:36,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1970136.0, ans=0.125 2023-06-25 10:49:42,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1970196.0, ans=0.125 2023-06-25 10:49:44,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1970196.0, ans=0.0 2023-06-25 10:50:07,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-06-25 10:50:07,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1970256.0, ans=0.1 2023-06-25 10:50:34,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1970316.0, ans=0.0 2023-06-25 10:50:38,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1970316.0, ans=0.0 2023-06-25 10:50:48,373 INFO [train.py:996] (0/4) Epoch 11, batch 23450, loss[loss=0.3265, simple_loss=0.3914, pruned_loss=0.1308, over 21817.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3043, pruned_loss=0.07705, over 4269352.14 frames. ], batch size: 124, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:51:14,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1970436.0, ans=0.125 2023-06-25 10:51:16,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1970436.0, ans=0.125 2023-06-25 10:51:25,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1970496.0, ans=0.125 2023-06-25 10:52:18,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1970616.0, ans=0.125 2023-06-25 10:52:34,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=12.0 2023-06-25 10:52:34,823 INFO [train.py:996] (0/4) Epoch 11, batch 23500, loss[loss=0.2238, simple_loss=0.2978, pruned_loss=0.07493, over 21894.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.307, pruned_loss=0.07939, over 4273409.64 frames. ], batch size: 371, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:52:41,444 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.140e+02 8.381e+02 1.197e+03 1.768e+03 4.081e+03, threshold=2.394e+03, percent-clipped=6.0 2023-06-25 10:53:10,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1970736.0, ans=0.0 2023-06-25 10:53:30,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-25 10:53:40,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1970856.0, ans=0.125 2023-06-25 10:54:19,591 INFO [train.py:996] (0/4) Epoch 11, batch 23550, loss[loss=0.2209, simple_loss=0.2766, pruned_loss=0.08257, over 21974.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3009, pruned_loss=0.07884, over 4276320.30 frames. ], batch size: 375, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:54:45,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1971036.0, ans=0.125 2023-06-25 10:55:00,311 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-25 10:55:11,344 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.81 vs. limit=12.0 2023-06-25 10:56:04,926 INFO [train.py:996] (0/4) Epoch 11, batch 23600, loss[loss=0.2549, simple_loss=0.3233, pruned_loss=0.09324, over 21509.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3017, pruned_loss=0.07914, over 4276506.91 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:56:17,393 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.780e+02 1.013e+03 1.475e+03 2.570e+03, threshold=2.026e+03, percent-clipped=2.0 2023-06-25 10:56:20,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.19 vs. limit=15.0 2023-06-25 10:57:06,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-25 10:57:19,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1971456.0, ans=0.0 2023-06-25 10:57:55,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1971576.0, ans=0.0 2023-06-25 10:57:56,489 INFO [train.py:996] (0/4) Epoch 11, batch 23650, loss[loss=0.3453, simple_loss=0.3972, pruned_loss=0.1468, over 21380.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3019, pruned_loss=0.07767, over 4276617.64 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:57:58,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1971576.0, ans=0.125 2023-06-25 10:58:31,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-25 10:58:38,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-25 10:58:46,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1971696.0, ans=0.125 2023-06-25 10:59:38,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1971816.0, ans=0.125 2023-06-25 10:59:43,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1971876.0, ans=0.2 2023-06-25 10:59:44,393 INFO [train.py:996] (0/4) Epoch 11, batch 23700, loss[loss=0.1765, simple_loss=0.2657, pruned_loss=0.04369, over 21608.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3024, pruned_loss=0.07669, over 4272112.77 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:59:56,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.722e+02 1.155e+03 1.933e+03 4.444e+03, threshold=2.311e+03, percent-clipped=20.0 2023-06-25 11:00:07,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1971936.0, ans=0.2 2023-06-25 11:00:17,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1971936.0, ans=0.125 2023-06-25 11:00:25,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1971996.0, ans=0.0 2023-06-25 11:00:57,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1972056.0, ans=0.05 2023-06-25 11:01:40,522 INFO [train.py:996] (0/4) Epoch 11, batch 23750, loss[loss=0.2443, simple_loss=0.3359, pruned_loss=0.07631, over 21463.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3057, pruned_loss=0.07764, over 4280229.21 frames. ], batch size: 471, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:02:10,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1972236.0, ans=0.125 2023-06-25 11:02:23,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1972236.0, ans=0.125 2023-06-25 11:02:24,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-25 11:02:59,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1972356.0, ans=0.125 2023-06-25 11:03:28,674 INFO [train.py:996] (0/4) Epoch 11, batch 23800, loss[loss=0.2336, simple_loss=0.3166, pruned_loss=0.07528, over 21229.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3056, pruned_loss=0.07671, over 4275959.78 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:03:34,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1972476.0, ans=0.125 2023-06-25 11:03:35,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.508e+02 9.725e+02 1.368e+03 2.347e+03 4.369e+03, threshold=2.737e+03, percent-clipped=25.0 2023-06-25 11:04:35,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1972656.0, ans=0.125 2023-06-25 11:04:36,104 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-25 11:04:37,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1972656.0, ans=0.0 2023-06-25 11:05:16,769 INFO [train.py:996] (0/4) Epoch 11, batch 23850, loss[loss=0.2625, simple_loss=0.3402, pruned_loss=0.09245, over 21956.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3135, pruned_loss=0.07834, over 4277788.52 frames. ], batch size: 317, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:05:38,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1972776.0, ans=0.125 2023-06-25 11:06:47,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1973016.0, ans=0.125 2023-06-25 11:06:52,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1973016.0, ans=0.125 2023-06-25 11:07:13,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1973076.0, ans=0.125 2023-06-25 11:07:14,247 INFO [train.py:996] (0/4) Epoch 11, batch 23900, loss[loss=0.1949, simple_loss=0.2841, pruned_loss=0.05285, over 20752.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3192, pruned_loss=0.07966, over 4282573.79 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:07:20,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.632e+02 1.020e+03 1.662e+03 2.575e+03 5.101e+03, threshold=3.324e+03, percent-clipped=22.0 2023-06-25 11:07:40,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-25 11:07:41,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1973136.0, ans=0.2 2023-06-25 11:07:57,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1973196.0, ans=0.125 2023-06-25 11:08:12,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1973196.0, ans=0.0 2023-06-25 11:08:28,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-25 11:08:33,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-25 11:08:48,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1973316.0, ans=0.1 2023-06-25 11:08:54,249 INFO [train.py:996] (0/4) Epoch 11, batch 23950, loss[loss=0.2123, simple_loss=0.2879, pruned_loss=0.06831, over 21358.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3174, pruned_loss=0.07978, over 4272598.82 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:09:30,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1973436.0, ans=0.95 2023-06-25 11:09:45,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1973496.0, ans=0.125 2023-06-25 11:10:47,913 INFO [train.py:996] (0/4) Epoch 11, batch 24000, loss[loss=0.2506, simple_loss=0.3164, pruned_loss=0.09234, over 21998.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3177, pruned_loss=0.08217, over 4276939.87 frames. ], batch size: 317, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:10:47,914 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 11:11:07,125 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.263, simple_loss=0.3578, pruned_loss=0.08405, over 1796401.00 frames. 2023-06-25 11:11:07,126 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 11:11:14,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.140e+02 7.509e+02 1.143e+03 1.580e+03 3.381e+03, threshold=2.286e+03, percent-clipped=1.0 2023-06-25 11:11:20,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-25 11:12:18,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1973856.0, ans=0.1 2023-06-25 11:12:20,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1973856.0, ans=0.0 2023-06-25 11:12:21,847 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:12:55,332 INFO [train.py:996] (0/4) Epoch 11, batch 24050, loss[loss=0.2832, simple_loss=0.3567, pruned_loss=0.1048, over 21446.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3206, pruned_loss=0.08379, over 4279745.11 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:13:29,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1974036.0, ans=0.125 2023-06-25 11:13:42,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1974096.0, ans=0.2 2023-06-25 11:14:44,593 INFO [train.py:996] (0/4) Epoch 11, batch 24100, loss[loss=0.2375, simple_loss=0.324, pruned_loss=0.07543, over 21859.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3203, pruned_loss=0.08249, over 4275290.02 frames. ], batch size: 282, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:14:50,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.104e+02 8.872e+02 1.198e+03 1.771e+03 4.014e+03, threshold=2.396e+03, percent-clipped=16.0 2023-06-25 11:15:47,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1974456.0, ans=0.1 2023-06-25 11:16:04,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1974456.0, ans=0.2 2023-06-25 11:16:19,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=12.0 2023-06-25 11:16:29,683 INFO [train.py:996] (0/4) Epoch 11, batch 24150, loss[loss=0.2705, simple_loss=0.3406, pruned_loss=0.1002, over 21871.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3209, pruned_loss=0.0839, over 4284904.56 frames. ], batch size: 107, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:16:49,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1974576.0, ans=0.125 2023-06-25 11:17:13,780 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-25 11:17:25,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1974696.0, ans=0.0 2023-06-25 11:18:00,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1974756.0, ans=0.125 2023-06-25 11:18:17,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1974816.0, ans=0.0 2023-06-25 11:18:20,290 INFO [train.py:996] (0/4) Epoch 11, batch 24200, loss[loss=0.2729, simple_loss=0.3602, pruned_loss=0.09284, over 21610.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3229, pruned_loss=0.08545, over 4286148.28 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:18:34,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.944e+02 9.608e+02 1.226e+03 1.956e+03 3.417e+03, threshold=2.452e+03, percent-clipped=15.0 2023-06-25 11:19:27,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1975056.0, ans=0.0 2023-06-25 11:19:56,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1975116.0, ans=0.2 2023-06-25 11:20:16,275 INFO [train.py:996] (0/4) Epoch 11, batch 24250, loss[loss=0.1871, simple_loss=0.2786, pruned_loss=0.04775, over 21371.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3196, pruned_loss=0.07962, over 4289453.97 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:20:22,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.64 vs. limit=10.0 2023-06-25 11:21:05,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1975296.0, ans=0.0 2023-06-25 11:21:05,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1975296.0, ans=0.1 2023-06-25 11:21:07,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1975296.0, ans=0.1 2023-06-25 11:21:54,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-25 11:22:04,829 INFO [train.py:996] (0/4) Epoch 11, batch 24300, loss[loss=0.1781, simple_loss=0.2546, pruned_loss=0.05074, over 21521.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3147, pruned_loss=0.07449, over 4287323.02 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:22:12,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.467e+02 7.478e+02 1.137e+03 1.748e+03 3.902e+03, threshold=2.274e+03, percent-clipped=10.0 2023-06-25 11:23:22,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1975656.0, ans=0.0 2023-06-25 11:23:43,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1975716.0, ans=0.95 2023-06-25 11:23:49,716 INFO [train.py:996] (0/4) Epoch 11, batch 24350, loss[loss=0.1698, simple_loss=0.2497, pruned_loss=0.04491, over 21634.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3119, pruned_loss=0.0746, over 4293097.13 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:24:03,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1975776.0, ans=0.0 2023-06-25 11:24:11,125 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-25 11:24:44,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1975896.0, ans=0.125 2023-06-25 11:25:11,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1975956.0, ans=0.0 2023-06-25 11:25:18,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-25 11:25:26,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1976016.0, ans=0.125 2023-06-25 11:25:43,177 INFO [train.py:996] (0/4) Epoch 11, batch 24400, loss[loss=0.2201, simple_loss=0.3054, pruned_loss=0.06743, over 21816.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3144, pruned_loss=0.07807, over 4293564.01 frames. ], batch size: 282, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:26:00,908 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.788e+02 8.688e+02 1.209e+03 1.955e+03 3.228e+03, threshold=2.419e+03, percent-clipped=16.0 2023-06-25 11:26:21,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1976136.0, ans=0.125 2023-06-25 11:26:29,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1976196.0, ans=0.125 2023-06-25 11:26:37,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1976196.0, ans=0.125 2023-06-25 11:26:44,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1976256.0, ans=0.0 2023-06-25 11:26:52,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.52 vs. limit=12.0 2023-06-25 11:27:36,908 INFO [train.py:996] (0/4) Epoch 11, batch 24450, loss[loss=0.1992, simple_loss=0.2879, pruned_loss=0.05524, over 21244.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3138, pruned_loss=0.07877, over 4282856.41 frames. ], batch size: 159, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:27:39,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1976376.0, ans=0.125 2023-06-25 11:28:15,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1976496.0, ans=0.125 2023-06-25 11:29:07,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-25 11:29:25,326 INFO [train.py:996] (0/4) Epoch 11, batch 24500, loss[loss=0.2609, simple_loss=0.3242, pruned_loss=0.09877, over 21618.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3136, pruned_loss=0.07867, over 4289196.67 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:29:29,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-06-25 11:29:32,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1976676.0, ans=0.125 2023-06-25 11:29:34,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1976676.0, ans=0.125 2023-06-25 11:29:34,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.743e+02 7.294e+02 9.026e+02 1.332e+03 4.707e+03, threshold=1.805e+03, percent-clipped=7.0 2023-06-25 11:31:11,592 INFO [train.py:996] (0/4) Epoch 11, batch 24550, loss[loss=0.2972, simple_loss=0.3663, pruned_loss=0.1141, over 21570.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3169, pruned_loss=0.08084, over 4284491.02 frames. ], batch size: 414, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:31:13,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1976976.0, ans=0.125 2023-06-25 11:31:55,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1977096.0, ans=0.125 2023-06-25 11:31:59,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1977096.0, ans=0.0 2023-06-25 11:32:06,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1977096.0, ans=0.2 2023-06-25 11:32:15,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-25 11:32:43,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1977216.0, ans=0.125 2023-06-25 11:32:43,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1977216.0, ans=0.2 2023-06-25 11:33:02,833 INFO [train.py:996] (0/4) Epoch 11, batch 24600, loss[loss=0.2102, simple_loss=0.2658, pruned_loss=0.07736, over 21258.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3154, pruned_loss=0.08227, over 4277204.03 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:33:04,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1977276.0, ans=0.125 2023-06-25 11:33:04,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.16 vs. limit=10.0 2023-06-25 11:33:13,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.578e+02 9.421e+02 1.303e+03 2.147e+03 3.735e+03, threshold=2.606e+03, percent-clipped=31.0 2023-06-25 11:33:23,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1977336.0, ans=0.125 2023-06-25 11:33:41,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1977336.0, ans=0.0 2023-06-25 11:34:51,957 INFO [train.py:996] (0/4) Epoch 11, batch 24650, loss[loss=0.2055, simple_loss=0.2577, pruned_loss=0.07663, over 21321.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3093, pruned_loss=0.08014, over 4265105.13 frames. ], batch size: 144, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:34:57,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1977576.0, ans=0.0 2023-06-25 11:36:36,916 INFO [train.py:996] (0/4) Epoch 11, batch 24700, loss[loss=0.1948, simple_loss=0.2683, pruned_loss=0.06062, over 21381.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3055, pruned_loss=0.07836, over 4263236.38 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:36:46,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.759e+02 8.060e+02 1.267e+03 1.761e+03 3.816e+03, threshold=2.533e+03, percent-clipped=4.0 2023-06-25 11:37:01,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1977936.0, ans=0.125 2023-06-25 11:38:17,820 INFO [train.py:996] (0/4) Epoch 11, batch 24750, loss[loss=0.2039, simple_loss=0.2609, pruned_loss=0.07349, over 20725.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.298, pruned_loss=0.07606, over 4265710.72 frames. ], batch size: 607, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:38:41,468 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=12.0 2023-06-25 11:39:45,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1978416.0, ans=0.125 2023-06-25 11:39:57,336 INFO [train.py:996] (0/4) Epoch 11, batch 24800, loss[loss=0.195, simple_loss=0.2551, pruned_loss=0.06746, over 20070.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2935, pruned_loss=0.07575, over 4270854.59 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 11:40:11,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1978476.0, ans=0.125 2023-06-25 11:40:14,477 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.678e+02 6.530e+02 8.946e+02 1.365e+03 3.586e+03, threshold=1.789e+03, percent-clipped=4.0 2023-06-25 11:40:56,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1978596.0, ans=0.125 2023-06-25 11:40:59,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1978656.0, ans=0.125 2023-06-25 11:41:32,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1978716.0, ans=0.0 2023-06-25 11:41:36,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1978716.0, ans=0.2 2023-06-25 11:41:48,322 INFO [train.py:996] (0/4) Epoch 11, batch 24850, loss[loss=0.2149, simple_loss=0.281, pruned_loss=0.07444, over 21548.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.293, pruned_loss=0.07709, over 4279118.39 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:41:54,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1978776.0, ans=0.0 2023-06-25 11:42:42,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1978896.0, ans=0.0 2023-06-25 11:43:06,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1978956.0, ans=0.2 2023-06-25 11:43:34,988 INFO [train.py:996] (0/4) Epoch 11, batch 24900, loss[loss=0.253, simple_loss=0.3139, pruned_loss=0.09607, over 21274.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2953, pruned_loss=0.07764, over 4284003.65 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:43:37,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1979076.0, ans=0.5 2023-06-25 11:43:47,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.897e+02 9.704e+02 1.419e+03 1.998e+03 4.449e+03, threshold=2.839e+03, percent-clipped=31.0 2023-06-25 11:43:49,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1979136.0, ans=0.05 2023-06-25 11:44:10,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.82 vs. limit=10.0 2023-06-25 11:44:21,193 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.23 vs. limit=10.0 2023-06-25 11:44:54,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.93 vs. limit=22.5 2023-06-25 11:45:15,041 INFO [train.py:996] (0/4) Epoch 11, batch 24950, loss[loss=0.3196, simple_loss=0.3746, pruned_loss=0.1323, over 21391.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3036, pruned_loss=0.08183, over 4285204.16 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:45:19,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1979376.0, ans=0.2 2023-06-25 11:45:46,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1979436.0, ans=0.125 2023-06-25 11:45:52,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1979436.0, ans=0.125 2023-06-25 11:46:19,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1979496.0, ans=0.0 2023-06-25 11:46:22,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1979496.0, ans=0.125 2023-06-25 11:46:54,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1979616.0, ans=0.125 2023-06-25 11:47:00,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1979616.0, ans=0.125 2023-06-25 11:47:03,421 INFO [train.py:996] (0/4) Epoch 11, batch 25000, loss[loss=0.195, simple_loss=0.2572, pruned_loss=0.06635, over 21278.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3093, pruned_loss=0.08346, over 4278080.71 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:47:23,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.368e+02 7.420e+02 9.724e+02 1.691e+03 3.300e+03, threshold=1.945e+03, percent-clipped=1.0 2023-06-25 11:48:05,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979796.0, ans=0.1 2023-06-25 11:48:23,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1979856.0, ans=0.125 2023-06-25 11:48:24,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1979856.0, ans=0.0 2023-06-25 11:48:43,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979916.0, ans=0.1 2023-06-25 11:48:43,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1979916.0, ans=0.125 2023-06-25 11:48:44,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1979916.0, ans=0.2 2023-06-25 11:48:48,903 INFO [train.py:996] (0/4) Epoch 11, batch 25050, loss[loss=0.2198, simple_loss=0.283, pruned_loss=0.07828, over 21268.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3047, pruned_loss=0.08179, over 4273491.86 frames. ], batch size: 144, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:49:49,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1980096.0, ans=0.125 2023-06-25 11:49:58,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1980156.0, ans=0.0 2023-06-25 11:50:10,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980156.0, ans=0.1 2023-06-25 11:50:37,635 INFO [train.py:996] (0/4) Epoch 11, batch 25100, loss[loss=0.2146, simple_loss=0.28, pruned_loss=0.07467, over 21839.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.2999, pruned_loss=0.08081, over 4270126.78 frames. ], batch size: 107, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:50:58,390 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.064e+02 8.337e+02 1.105e+03 1.657e+03 3.761e+03, threshold=2.211e+03, percent-clipped=17.0 2023-06-25 11:51:03,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1980336.0, ans=0.2 2023-06-25 11:51:04,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-25 11:51:44,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1980456.0, ans=0.125 2023-06-25 11:52:14,278 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-25 11:52:22,291 INFO [train.py:996] (0/4) Epoch 11, batch 25150, loss[loss=0.2101, simple_loss=0.2958, pruned_loss=0.06226, over 21821.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3005, pruned_loss=0.07857, over 4258093.84 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:52:48,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980636.0, ans=0.1 2023-06-25 11:53:58,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-25 11:54:08,609 INFO [train.py:996] (0/4) Epoch 11, batch 25200, loss[loss=0.189, simple_loss=0.2744, pruned_loss=0.05176, over 21448.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3004, pruned_loss=0.07678, over 4266036.68 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:54:21,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.466e+02 1.183e+03 1.682e+03 4.504e+03, threshold=2.365e+03, percent-clipped=14.0 2023-06-25 11:54:38,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-25 11:54:44,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1980936.0, ans=0.0 2023-06-25 11:54:58,523 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.59 vs. limit=10.0 2023-06-25 11:55:39,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1981116.0, ans=0.0 2023-06-25 11:55:48,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1981116.0, ans=0.0 2023-06-25 11:55:51,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1981116.0, ans=0.04949747468305833 2023-06-25 11:55:56,156 INFO [train.py:996] (0/4) Epoch 11, batch 25250, loss[loss=0.2287, simple_loss=0.301, pruned_loss=0.0782, over 21772.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2985, pruned_loss=0.07521, over 4267845.28 frames. ], batch size: 371, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:56:05,758 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:56:43,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1981296.0, ans=0.0 2023-06-25 11:57:05,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-25 11:57:16,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1981356.0, ans=0.1 2023-06-25 11:57:20,514 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-25 11:57:29,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1981416.0, ans=0.125 2023-06-25 11:57:38,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1981416.0, ans=0.0 2023-06-25 11:57:42,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=12.0 2023-06-25 11:57:44,428 INFO [train.py:996] (0/4) Epoch 11, batch 25300, loss[loss=0.2672, simple_loss=0.3387, pruned_loss=0.09781, over 21663.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2976, pruned_loss=0.07474, over 4256911.11 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:57:57,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.052e+02 7.897e+02 1.317e+03 1.738e+03 3.362e+03, threshold=2.634e+03, percent-clipped=11.0 2023-06-25 11:58:14,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1981536.0, ans=0.0 2023-06-25 11:59:32,062 INFO [train.py:996] (0/4) Epoch 11, batch 25350, loss[loss=0.1886, simple_loss=0.2693, pruned_loss=0.05399, over 21567.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2997, pruned_loss=0.07406, over 4261729.95 frames. ], batch size: 263, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:00:16,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1981896.0, ans=0.125 2023-06-25 12:01:12,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1982016.0, ans=0.125 2023-06-25 12:01:17,484 INFO [train.py:996] (0/4) Epoch 11, batch 25400, loss[loss=0.2207, simple_loss=0.2786, pruned_loss=0.0814, over 21498.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2965, pruned_loss=0.07324, over 4261296.32 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:01:37,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.367e+02 9.282e+02 1.307e+03 1.888e+03 3.568e+03, threshold=2.613e+03, percent-clipped=8.0 2023-06-25 12:01:41,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1982136.0, ans=0.125 2023-06-25 12:02:47,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1982316.0, ans=0.125 2023-06-25 12:03:02,669 INFO [train.py:996] (0/4) Epoch 11, batch 25450, loss[loss=0.1874, simple_loss=0.2804, pruned_loss=0.04716, over 21708.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2977, pruned_loss=0.07451, over 4272386.35 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:04:16,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1982556.0, ans=0.04949747468305833 2023-06-25 12:04:16,744 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:04:49,937 INFO [train.py:996] (0/4) Epoch 11, batch 25500, loss[loss=0.2507, simple_loss=0.3305, pruned_loss=0.08543, over 21777.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.297, pruned_loss=0.07191, over 4261756.64 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:05:03,668 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=15.0 2023-06-25 12:05:06,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1982676.0, ans=0.0 2023-06-25 12:05:10,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.694e+02 1.169e+03 1.712e+03 3.614e+03, threshold=2.338e+03, percent-clipped=5.0 2023-06-25 12:06:02,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1982856.0, ans=0.0 2023-06-25 12:06:34,940 INFO [train.py:996] (0/4) Epoch 11, batch 25550, loss[loss=0.2365, simple_loss=0.3446, pruned_loss=0.06422, over 21334.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3031, pruned_loss=0.07196, over 4267327.92 frames. ], batch size: 548, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:07:40,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1983096.0, ans=0.1 2023-06-25 12:08:05,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1983156.0, ans=0.125 2023-06-25 12:08:38,095 INFO [train.py:996] (0/4) Epoch 11, batch 25600, loss[loss=0.2014, simple_loss=0.2851, pruned_loss=0.05881, over 21817.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3055, pruned_loss=0.07163, over 4255061.39 frames. ], batch size: 102, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:08:52,929 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.144e+02 7.738e+02 1.030e+03 1.718e+03 3.511e+03, threshold=2.059e+03, percent-clipped=11.0 2023-06-25 12:09:50,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1983456.0, ans=0.125 2023-06-25 12:09:57,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1983516.0, ans=0.0 2023-06-25 12:10:24,094 INFO [train.py:996] (0/4) Epoch 11, batch 25650, loss[loss=0.2317, simple_loss=0.2965, pruned_loss=0.08342, over 21435.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3067, pruned_loss=0.07476, over 4247225.66 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:11:26,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-06-25 12:11:46,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=12.0 2023-06-25 12:12:07,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1983816.0, ans=0.0 2023-06-25 12:12:11,578 INFO [train.py:996] (0/4) Epoch 11, batch 25700, loss[loss=0.2236, simple_loss=0.2838, pruned_loss=0.08166, over 21752.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.303, pruned_loss=0.07617, over 4243717.17 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:12:38,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.851e+02 8.245e+02 1.134e+03 1.562e+03 3.915e+03, threshold=2.269e+03, percent-clipped=11.0 2023-06-25 12:12:47,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1983936.0, ans=0.1 2023-06-25 12:13:09,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1983996.0, ans=0.0 2023-06-25 12:13:17,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1984056.0, ans=0.125 2023-06-25 12:14:01,466 INFO [train.py:996] (0/4) Epoch 11, batch 25750, loss[loss=0.2888, simple_loss=0.3498, pruned_loss=0.1139, over 21333.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3092, pruned_loss=0.08005, over 4258642.38 frames. ], batch size: 548, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:14:29,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1984236.0, ans=0.025 2023-06-25 12:15:17,848 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:15:50,362 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:15:56,820 INFO [train.py:996] (0/4) Epoch 11, batch 25800, loss[loss=0.3173, simple_loss=0.3825, pruned_loss=0.1261, over 21755.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3177, pruned_loss=0.08322, over 4257656.94 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:16:12,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.210e+02 8.965e+02 1.490e+03 2.590e+03 4.866e+03, threshold=2.981e+03, percent-clipped=29.0 2023-06-25 12:17:13,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1984656.0, ans=0.125 2023-06-25 12:17:21,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1984656.0, ans=0.125 2023-06-25 12:17:21,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1984656.0, ans=0.125 2023-06-25 12:17:41,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1984716.0, ans=0.125 2023-06-25 12:17:41,512 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-25 12:17:45,342 INFO [train.py:996] (0/4) Epoch 11, batch 25850, loss[loss=0.2575, simple_loss=0.3203, pruned_loss=0.0974, over 21822.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3209, pruned_loss=0.08395, over 4262697.59 frames. ], batch size: 118, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:19:06,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1984956.0, ans=0.125 2023-06-25 12:19:07,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1984956.0, ans=0.2 2023-06-25 12:19:33,851 INFO [train.py:996] (0/4) Epoch 11, batch 25900, loss[loss=0.2694, simple_loss=0.36, pruned_loss=0.0894, over 21821.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3219, pruned_loss=0.08426, over 4267323.45 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:19:54,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 8.473e+02 1.223e+03 1.634e+03 2.981e+03, threshold=2.447e+03, percent-clipped=0.0 2023-06-25 12:20:49,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1985256.0, ans=0.125 2023-06-25 12:20:51,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1985256.0, ans=0.125 2023-06-25 12:21:21,903 INFO [train.py:996] (0/4) Epoch 11, batch 25950, loss[loss=0.2501, simple_loss=0.3333, pruned_loss=0.08342, over 21788.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3316, pruned_loss=0.08868, over 4276554.03 frames. ], batch size: 124, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:21:23,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1985376.0, ans=0.125 2023-06-25 12:21:25,187 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-25 12:21:51,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1985436.0, ans=0.125 2023-06-25 12:21:54,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1985436.0, ans=0.2 2023-06-25 12:22:31,709 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2023-06-25 12:22:37,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1985556.0, ans=0.0 2023-06-25 12:22:56,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1985616.0, ans=0.125 2023-06-25 12:22:57,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1985616.0, ans=0.0 2023-06-25 12:23:02,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1985616.0, ans=0.0 2023-06-25 12:23:18,152 INFO [train.py:996] (0/4) Epoch 11, batch 26000, loss[loss=0.2376, simple_loss=0.3238, pruned_loss=0.0757, over 21714.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3311, pruned_loss=0.0867, over 4277554.20 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:23:29,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1985676.0, ans=0.0 2023-06-25 12:23:39,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1985736.0, ans=0.5 2023-06-25 12:23:40,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 7.867e+02 1.001e+03 1.506e+03 3.925e+03, threshold=2.003e+03, percent-clipped=6.0 2023-06-25 12:23:52,353 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:23:57,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=15.0 2023-06-25 12:24:15,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1985796.0, ans=0.125 2023-06-25 12:24:22,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1985796.0, ans=0.125 2023-06-25 12:24:23,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1985856.0, ans=0.0 2023-06-25 12:24:41,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1985916.0, ans=0.5 2023-06-25 12:24:43,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1985916.0, ans=0.125 2023-06-25 12:24:49,455 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:25:03,157 INFO [train.py:996] (0/4) Epoch 11, batch 26050, loss[loss=0.2389, simple_loss=0.2987, pruned_loss=0.08961, over 21860.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3314, pruned_loss=0.08782, over 4271715.43 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:25:37,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1986036.0, ans=0.125 2023-06-25 12:26:20,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-25 12:26:47,673 INFO [train.py:996] (0/4) Epoch 11, batch 26100, loss[loss=0.2421, simple_loss=0.2999, pruned_loss=0.09214, over 21619.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3265, pruned_loss=0.08776, over 4281346.48 frames. ], batch size: 195, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:27:09,766 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.534e+02 7.535e+02 1.106e+03 1.701e+03 2.759e+03, threshold=2.213e+03, percent-clipped=15.0 2023-06-25 12:27:21,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1986336.0, ans=0.125 2023-06-25 12:27:24,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1986336.0, ans=0.0 2023-06-25 12:27:37,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-06-25 12:27:38,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1986396.0, ans=0.125 2023-06-25 12:28:39,869 INFO [train.py:996] (0/4) Epoch 11, batch 26150, loss[loss=0.2737, simple_loss=0.3411, pruned_loss=0.1031, over 21366.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3225, pruned_loss=0.08691, over 4287322.98 frames. ], batch size: 548, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:29:00,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-25 12:29:11,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1986636.0, ans=0.2 2023-06-25 12:29:16,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1986636.0, ans=0.0 2023-06-25 12:29:20,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1986636.0, ans=0.2 2023-06-25 12:29:24,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-25 12:29:25,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1986696.0, ans=0.0 2023-06-25 12:30:05,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1986816.0, ans=0.5 2023-06-25 12:30:21,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1986816.0, ans=0.125 2023-06-25 12:30:25,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1986876.0, ans=0.125 2023-06-25 12:30:26,206 INFO [train.py:996] (0/4) Epoch 11, batch 26200, loss[loss=0.273, simple_loss=0.3743, pruned_loss=0.08587, over 21778.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.323, pruned_loss=0.08438, over 4287590.95 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:30:26,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1986876.0, ans=0.2 2023-06-25 12:30:48,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1986876.0, ans=0.0 2023-06-25 12:30:52,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1986936.0, ans=0.125 2023-06-25 12:30:52,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1986936.0, ans=0.125 2023-06-25 12:30:53,725 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.107e+02 7.785e+02 1.042e+03 1.454e+03 3.867e+03, threshold=2.084e+03, percent-clipped=10.0 2023-06-25 12:31:47,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1987116.0, ans=0.125 2023-06-25 12:31:54,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1987116.0, ans=0.0 2023-06-25 12:32:08,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1987116.0, ans=0.0 2023-06-25 12:32:10,635 INFO [train.py:996] (0/4) Epoch 11, batch 26250, loss[loss=0.2673, simple_loss=0.3387, pruned_loss=0.09798, over 21898.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3273, pruned_loss=0.08381, over 4286456.58 frames. ], batch size: 414, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:32:43,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1987236.0, ans=0.2 2023-06-25 12:32:43,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1987236.0, ans=0.09899494936611666 2023-06-25 12:32:44,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1987236.0, ans=0.2 2023-06-25 12:32:49,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1987236.0, ans=0.0 2023-06-25 12:33:06,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=12.0 2023-06-25 12:33:18,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1987356.0, ans=0.125 2023-06-25 12:33:47,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1987416.0, ans=0.2 2023-06-25 12:34:04,646 INFO [train.py:996] (0/4) Epoch 11, batch 26300, loss[loss=0.2183, simple_loss=0.3002, pruned_loss=0.0682, over 21912.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3244, pruned_loss=0.08383, over 4287801.76 frames. ], batch size: 118, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:34:12,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1987476.0, ans=0.0 2023-06-25 12:34:17,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1987476.0, ans=0.1 2023-06-25 12:34:26,039 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.433e+02 7.825e+02 1.057e+03 1.626e+03 4.026e+03, threshold=2.114e+03, percent-clipped=11.0 2023-06-25 12:34:41,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1987596.0, ans=0.0 2023-06-25 12:34:48,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1987596.0, ans=0.125 2023-06-25 12:35:49,282 INFO [train.py:996] (0/4) Epoch 11, batch 26350, loss[loss=0.2634, simple_loss=0.3345, pruned_loss=0.0962, over 21303.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3223, pruned_loss=0.08466, over 4288221.57 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:36:06,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1987836.0, ans=0.2 2023-06-25 12:36:14,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1987836.0, ans=0.0 2023-06-25 12:36:16,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1987836.0, ans=0.125 2023-06-25 12:37:15,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1988016.0, ans=0.125 2023-06-25 12:37:31,608 INFO [train.py:996] (0/4) Epoch 11, batch 26400, loss[loss=0.1874, simple_loss=0.255, pruned_loss=0.05987, over 21249.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3159, pruned_loss=0.08428, over 4280886.97 frames. ], batch size: 160, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:37:50,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.280e+02 8.065e+02 9.903e+02 1.362e+03 2.931e+03, threshold=1.981e+03, percent-clipped=5.0 2023-06-25 12:37:52,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1988136.0, ans=0.0 2023-06-25 12:38:24,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1988196.0, ans=0.125 2023-06-25 12:39:11,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1988316.0, ans=0.125 2023-06-25 12:39:22,519 INFO [train.py:996] (0/4) Epoch 11, batch 26450, loss[loss=0.2195, simple_loss=0.2943, pruned_loss=0.07232, over 21364.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3145, pruned_loss=0.0833, over 4280588.97 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:39:45,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.08 vs. limit=12.0 2023-06-25 12:40:56,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=12.0 2023-06-25 12:41:04,381 INFO [train.py:996] (0/4) Epoch 11, batch 26500, loss[loss=0.1805, simple_loss=0.241, pruned_loss=0.05994, over 21326.00 frames. ], tot_loss[loss=0.241, simple_loss=0.318, pruned_loss=0.08198, over 4269450.13 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:41:34,320 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.135e+02 9.452e+02 1.417e+03 2.268e+03 5.584e+03, threshold=2.834e+03, percent-clipped=34.0 2023-06-25 12:43:02,816 INFO [train.py:996] (0/4) Epoch 11, batch 26550, loss[loss=0.225, simple_loss=0.3322, pruned_loss=0.05891, over 21178.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3127, pruned_loss=0.07891, over 4252464.66 frames. ], batch size: 548, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:43:43,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1989036.0, ans=0.125 2023-06-25 12:43:47,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-25 12:44:55,007 INFO [train.py:996] (0/4) Epoch 11, batch 26600, loss[loss=0.1977, simple_loss=0.2758, pruned_loss=0.05987, over 21439.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3122, pruned_loss=0.07629, over 4247203.39 frames. ], batch size: 212, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:45:18,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.580e+02 8.378e+02 1.280e+03 1.887e+03 4.610e+03, threshold=2.560e+03, percent-clipped=7.0 2023-06-25 12:45:24,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1989336.0, ans=0.1 2023-06-25 12:46:41,113 INFO [train.py:996] (0/4) Epoch 11, batch 26650, loss[loss=0.2038, simple_loss=0.289, pruned_loss=0.05929, over 21492.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3054, pruned_loss=0.07507, over 4252633.86 frames. ], batch size: 473, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:46:49,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-25 12:47:04,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1989636.0, ans=0.1 2023-06-25 12:47:26,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1989696.0, ans=0.0 2023-06-25 12:47:34,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1989756.0, ans=0.1 2023-06-25 12:47:39,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1989756.0, ans=0.2 2023-06-25 12:48:21,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1989816.0, ans=10.0 2023-06-25 12:48:26,223 INFO [train.py:996] (0/4) Epoch 11, batch 26700, loss[loss=0.2548, simple_loss=0.32, pruned_loss=0.09485, over 21879.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2979, pruned_loss=0.0719, over 4264451.72 frames. ], batch size: 371, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:48:49,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 6.649e+02 8.698e+02 1.295e+03 2.536e+03, threshold=1.740e+03, percent-clipped=0.0 2023-06-25 12:49:02,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1989936.0, ans=0.1 2023-06-25 12:50:13,225 INFO [train.py:996] (0/4) Epoch 11, batch 26750, loss[loss=0.2821, simple_loss=0.3657, pruned_loss=0.09926, over 21823.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2981, pruned_loss=0.07127, over 4262454.84 frames. ], batch size: 124, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:50:13,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1990176.0, ans=0.0 2023-06-25 12:51:25,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1990356.0, ans=0.125 2023-06-25 12:51:32,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1990416.0, ans=0.125 2023-06-25 12:51:42,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1990416.0, ans=0.0 2023-06-25 12:51:54,758 INFO [train.py:996] (0/4) Epoch 11, batch 26800, loss[loss=0.2712, simple_loss=0.3396, pruned_loss=0.1014, over 21765.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.306, pruned_loss=0.07552, over 4265145.01 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:52:04,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1990476.0, ans=0.125 2023-06-25 12:52:15,291 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 8.736e+02 1.158e+03 1.774e+03 3.470e+03, threshold=2.315e+03, percent-clipped=25.0 2023-06-25 12:52:59,202 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:53:15,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1990656.0, ans=0.0 2023-06-25 12:53:25,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-25 12:53:31,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1990716.0, ans=0.07 2023-06-25 12:53:38,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1990716.0, ans=0.1 2023-06-25 12:53:42,516 INFO [train.py:996] (0/4) Epoch 11, batch 26850, loss[loss=0.2299, simple_loss=0.2953, pruned_loss=0.08222, over 21808.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3079, pruned_loss=0.078, over 4260972.08 frames. ], batch size: 352, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:53:46,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1990776.0, ans=0.07 2023-06-25 12:54:28,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.68 vs. limit=6.0 2023-06-25 12:55:26,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1991076.0, ans=0.0 2023-06-25 12:55:27,819 INFO [train.py:996] (0/4) Epoch 11, batch 26900, loss[loss=0.2246, simple_loss=0.2743, pruned_loss=0.08742, over 21352.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2989, pruned_loss=0.07683, over 4263901.63 frames. ], batch size: 160, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:55:47,033 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.238e+02 7.282e+02 8.869e+02 1.344e+03 2.683e+03, threshold=1.774e+03, percent-clipped=1.0 2023-06-25 12:55:52,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1991136.0, ans=0.125 2023-06-25 12:57:07,831 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-06-25 12:57:13,479 INFO [train.py:996] (0/4) Epoch 11, batch 26950, loss[loss=0.2327, simple_loss=0.3058, pruned_loss=0.07978, over 21448.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2998, pruned_loss=0.07783, over 4261561.24 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:57:15,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1991376.0, ans=0.0 2023-06-25 12:57:22,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1991376.0, ans=0.125 2023-06-25 12:58:59,446 INFO [train.py:996] (0/4) Epoch 11, batch 27000, loss[loss=0.2006, simple_loss=0.3089, pruned_loss=0.04616, over 19799.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2998, pruned_loss=0.07526, over 4258617.67 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 12:58:59,447 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 12:59:12,075 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.8462, 3.3198, 3.2564, 1.7915], device='cuda:0') 2023-06-25 12:59:16,968 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.235, simple_loss=0.334, pruned_loss=0.06803, over 1796401.00 frames. 2023-06-25 12:59:16,970 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 12:59:55,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1991736.0, ans=0.125 2023-06-25 12:59:55,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 9.015e+02 1.282e+03 1.827e+03 4.662e+03, threshold=2.565e+03, percent-clipped=27.0 2023-06-25 13:00:27,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1991856.0, ans=0.1 2023-06-25 13:00:28,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-25 13:00:28,391 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-25 13:01:03,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1991916.0, ans=0.125 2023-06-25 13:01:06,448 INFO [train.py:996] (0/4) Epoch 11, batch 27050, loss[loss=0.2232, simple_loss=0.3033, pruned_loss=0.07152, over 21590.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3034, pruned_loss=0.0724, over 4261881.23 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:01:11,838 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-332000.pt 2023-06-25 13:02:53,336 INFO [train.py:996] (0/4) Epoch 11, batch 27100, loss[loss=0.1929, simple_loss=0.2663, pruned_loss=0.05977, over 21205.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3038, pruned_loss=0.07357, over 4271258.40 frames. ], batch size: 607, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:02:58,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1992276.0, ans=0.125 2023-06-25 13:03:02,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1992276.0, ans=0.0 2023-06-25 13:03:27,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.097e+02 9.568e+02 1.359e+03 2.016e+03 3.804e+03, threshold=2.717e+03, percent-clipped=7.0 2023-06-25 13:03:40,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.30 vs. limit=22.5 2023-06-25 13:03:41,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1992336.0, ans=0.125 2023-06-25 13:03:45,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1992396.0, ans=0.0 2023-06-25 13:04:17,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1992516.0, ans=0.125 2023-06-25 13:04:29,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-25 13:04:41,030 INFO [train.py:996] (0/4) Epoch 11, batch 27150, loss[loss=0.2427, simple_loss=0.3595, pruned_loss=0.06301, over 19877.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3177, pruned_loss=0.07811, over 4272479.35 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:04:44,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1992576.0, ans=0.2 2023-06-25 13:05:04,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1992576.0, ans=0.0 2023-06-25 13:05:30,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1992696.0, ans=0.0 2023-06-25 13:06:10,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1992816.0, ans=0.0 2023-06-25 13:06:16,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-25 13:06:27,160 INFO [train.py:996] (0/4) Epoch 11, batch 27200, loss[loss=0.2335, simple_loss=0.304, pruned_loss=0.08153, over 21623.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3252, pruned_loss=0.08053, over 4272505.53 frames. ], batch size: 112, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:07:01,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.854e+02 1.107e+03 1.912e+03 4.473e+03, threshold=2.214e+03, percent-clipped=8.0 2023-06-25 13:07:08,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1992936.0, ans=0.125 2023-06-25 13:07:11,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-25 13:07:11,278 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-25 13:07:12,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1992996.0, ans=0.1 2023-06-25 13:07:14,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1992996.0, ans=0.2 2023-06-25 13:07:35,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1993056.0, ans=0.125 2023-06-25 13:08:20,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1993116.0, ans=0.1 2023-06-25 13:08:27,032 INFO [train.py:996] (0/4) Epoch 11, batch 27250, loss[loss=0.3436, simple_loss=0.3893, pruned_loss=0.1489, over 21383.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3274, pruned_loss=0.08442, over 4265505.59 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:08:46,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-25 13:08:51,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1993236.0, ans=0.0 2023-06-25 13:08:54,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-06-25 13:09:12,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 13:10:13,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1993416.0, ans=0.125 2023-06-25 13:10:18,718 INFO [train.py:996] (0/4) Epoch 11, batch 27300, loss[loss=0.2632, simple_loss=0.3519, pruned_loss=0.08727, over 21309.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.329, pruned_loss=0.08535, over 4270615.60 frames. ], batch size: 549, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:10:32,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1993476.0, ans=0.125 2023-06-25 13:10:46,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.500e+02 8.064e+02 1.048e+03 1.560e+03 3.072e+03, threshold=2.097e+03, percent-clipped=8.0 2023-06-25 13:10:46,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1993536.0, ans=0.0 2023-06-25 13:11:39,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1993656.0, ans=0.2 2023-06-25 13:12:04,752 INFO [train.py:996] (0/4) Epoch 11, batch 27350, loss[loss=0.2448, simple_loss=0.3219, pruned_loss=0.08388, over 21476.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3301, pruned_loss=0.08598, over 4271595.42 frames. ], batch size: 194, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:12:10,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1993776.0, ans=0.0 2023-06-25 13:12:26,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-25 13:12:37,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.90 vs. limit=15.0 2023-06-25 13:12:53,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1993896.0, ans=0.015 2023-06-25 13:13:50,534 INFO [train.py:996] (0/4) Epoch 11, batch 27400, loss[loss=0.2314, simple_loss=0.2977, pruned_loss=0.08253, over 21512.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3244, pruned_loss=0.08467, over 4280118.32 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:14:13,834 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-25 13:14:17,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.617e+02 1.033e+03 1.386e+03 3.217e+03, threshold=2.066e+03, percent-clipped=9.0 2023-06-25 13:15:02,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1994256.0, ans=0.1 2023-06-25 13:15:05,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-25 13:15:37,987 INFO [train.py:996] (0/4) Epoch 11, batch 27450, loss[loss=0.2237, simple_loss=0.3141, pruned_loss=0.06665, over 21749.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3188, pruned_loss=0.08299, over 4279384.14 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:15:54,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.57 vs. limit=6.0 2023-06-25 13:16:30,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1994496.0, ans=0.1 2023-06-25 13:16:55,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.54 vs. limit=15.0 2023-06-25 13:17:07,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=12.0 2023-06-25 13:17:23,737 INFO [train.py:996] (0/4) Epoch 11, batch 27500, loss[loss=0.2259, simple_loss=0.2938, pruned_loss=0.07902, over 21562.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3187, pruned_loss=0.08393, over 4289536.76 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:17:36,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1994676.0, ans=0.125 2023-06-25 13:17:51,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.259e+02 1.005e+03 1.389e+03 2.816e+03, threshold=2.010e+03, percent-clipped=4.0 2023-06-25 13:18:44,048 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.26 vs. limit=10.0 2023-06-25 13:18:56,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1994916.0, ans=0.125 2023-06-25 13:19:07,767 INFO [train.py:996] (0/4) Epoch 11, batch 27550, loss[loss=0.2105, simple_loss=0.2782, pruned_loss=0.07138, over 21252.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3121, pruned_loss=0.08, over 4293643.71 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:19:17,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.36 vs. limit=10.0 2023-06-25 13:19:26,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1995036.0, ans=0.1 2023-06-25 13:19:28,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1995036.0, ans=22.5 2023-06-25 13:20:01,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1995096.0, ans=0.125 2023-06-25 13:20:06,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1995096.0, ans=0.0 2023-06-25 13:20:13,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-25 13:20:40,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1995216.0, ans=0.125 2023-06-25 13:20:54,531 INFO [train.py:996] (0/4) Epoch 11, batch 27600, loss[loss=0.2037, simple_loss=0.2644, pruned_loss=0.07149, over 21574.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3052, pruned_loss=0.07904, over 4284833.57 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:21:01,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1995276.0, ans=0.1 2023-06-25 13:21:17,412 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.292e+02 9.329e+02 1.469e+03 1.993e+03 3.791e+03, threshold=2.938e+03, percent-clipped=25.0 2023-06-25 13:22:22,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1995516.0, ans=0.2 2023-06-25 13:22:27,930 INFO [train.py:996] (0/4) Epoch 11, batch 27650, loss[loss=0.2383, simple_loss=0.3212, pruned_loss=0.07769, over 19982.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2997, pruned_loss=0.07834, over 4280206.64 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:22:46,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1995576.0, ans=0.0 2023-06-25 13:22:53,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1995636.0, ans=0.125 2023-06-25 13:23:00,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1995636.0, ans=0.0 2023-06-25 13:23:12,385 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-25 13:23:21,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1995696.0, ans=0.2 2023-06-25 13:23:24,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1995696.0, ans=0.125 2023-06-25 13:23:59,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-25 13:24:10,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1995816.0, ans=0.125 2023-06-25 13:24:11,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1995816.0, ans=0.125 2023-06-25 13:24:19,728 INFO [train.py:996] (0/4) Epoch 11, batch 27700, loss[loss=0.2608, simple_loss=0.3447, pruned_loss=0.08847, over 21665.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2997, pruned_loss=0.07671, over 4280610.86 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:24:22,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-25 13:24:43,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.775e+02 8.277e+02 1.271e+03 1.738e+03 3.564e+03, threshold=2.542e+03, percent-clipped=2.0 2023-06-25 13:25:17,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-25 13:26:05,580 INFO [train.py:996] (0/4) Epoch 11, batch 27750, loss[loss=0.1743, simple_loss=0.2618, pruned_loss=0.04337, over 21319.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3052, pruned_loss=0.07703, over 4275990.15 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:26:25,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-25 13:27:09,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-25 13:27:17,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1996356.0, ans=0.2 2023-06-25 13:27:42,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1996476.0, ans=0.0 2023-06-25 13:27:43,143 INFO [train.py:996] (0/4) Epoch 11, batch 27800, loss[loss=0.2613, simple_loss=0.324, pruned_loss=0.09932, over 21727.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3035, pruned_loss=0.07708, over 4284029.33 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:27:43,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1996476.0, ans=0.0 2023-06-25 13:28:03,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1996536.0, ans=0.0 2023-06-25 13:28:10,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.247e+02 7.249e+02 9.541e+02 1.506e+03 2.955e+03, threshold=1.908e+03, percent-clipped=10.0 2023-06-25 13:28:43,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-25 13:29:11,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1996716.0, ans=0.2 2023-06-25 13:29:13,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1996716.0, ans=0.0 2023-06-25 13:29:27,217 INFO [train.py:996] (0/4) Epoch 11, batch 27850, loss[loss=0.2344, simple_loss=0.3261, pruned_loss=0.07133, over 21858.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3027, pruned_loss=0.07793, over 4294083.76 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:30:22,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1996896.0, ans=0.125 2023-06-25 13:30:38,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1996956.0, ans=0.125 2023-06-25 13:30:48,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1996956.0, ans=0.2 2023-06-25 13:30:59,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1997016.0, ans=0.1 2023-06-25 13:31:17,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-06-25 13:31:17,910 INFO [train.py:996] (0/4) Epoch 11, batch 27900, loss[loss=0.1983, simple_loss=0.2922, pruned_loss=0.05217, over 21394.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3114, pruned_loss=0.079, over 4296117.77 frames. ], batch size: 211, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:31:53,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 7.594e+02 1.073e+03 1.549e+03 3.110e+03, threshold=2.145e+03, percent-clipped=9.0 2023-06-25 13:31:59,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-25 13:32:17,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1997196.0, ans=0.0 2023-06-25 13:32:17,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1997196.0, ans=0.0 2023-06-25 13:32:34,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1997256.0, ans=0.0 2023-06-25 13:33:11,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1997376.0, ans=0.125 2023-06-25 13:33:12,656 INFO [train.py:996] (0/4) Epoch 11, batch 27950, loss[loss=0.2629, simple_loss=0.3833, pruned_loss=0.07128, over 19899.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3108, pruned_loss=0.07525, over 4281477.20 frames. ], batch size: 703, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:33:35,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1997436.0, ans=0.125 2023-06-25 13:33:51,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-25 13:33:57,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1997496.0, ans=0.0 2023-06-25 13:34:17,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1997556.0, ans=0.1 2023-06-25 13:34:27,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1997556.0, ans=0.125 2023-06-25 13:34:57,437 INFO [train.py:996] (0/4) Epoch 11, batch 28000, loss[loss=0.218, simple_loss=0.2874, pruned_loss=0.07428, over 21813.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3078, pruned_loss=0.07289, over 4281262.89 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 13:34:58,074 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:35:11,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1997676.0, ans=0.1 2023-06-25 13:35:13,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1997676.0, ans=0.125 2023-06-25 13:35:22,754 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:35:25,421 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.615e+02 8.690e+02 1.335e+03 1.864e+03 4.176e+03, threshold=2.670e+03, percent-clipped=16.0 2023-06-25 13:36:49,674 INFO [train.py:996] (0/4) Epoch 11, batch 28050, loss[loss=0.2165, simple_loss=0.2955, pruned_loss=0.06872, over 21797.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.306, pruned_loss=0.0749, over 4283485.70 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 13:36:50,334 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:37:21,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-25 13:37:33,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1998096.0, ans=0.0 2023-06-25 13:37:37,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1998096.0, ans=0.0 2023-06-25 13:37:40,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1998096.0, ans=0.125 2023-06-25 13:38:02,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1998156.0, ans=0.07 2023-06-25 13:38:33,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-25 13:38:37,708 INFO [train.py:996] (0/4) Epoch 11, batch 28100, loss[loss=0.2152, simple_loss=0.2867, pruned_loss=0.07187, over 21735.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3041, pruned_loss=0.07482, over 4279244.94 frames. ], batch size: 371, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:39:01,517 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.345e+02 8.283e+02 1.257e+03 1.912e+03 3.792e+03, threshold=2.513e+03, percent-clipped=5.0 2023-06-25 13:39:14,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1998396.0, ans=0.2 2023-06-25 13:39:28,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1998396.0, ans=0.125 2023-06-25 13:40:22,787 INFO [train.py:996] (0/4) Epoch 11, batch 28150, loss[loss=0.2317, simple_loss=0.2848, pruned_loss=0.08928, over 21226.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.298, pruned_loss=0.07548, over 4263258.50 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:41:58,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1998816.0, ans=0.1 2023-06-25 13:42:10,821 INFO [train.py:996] (0/4) Epoch 11, batch 28200, loss[loss=0.2725, simple_loss=0.4109, pruned_loss=0.06702, over 19889.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2969, pruned_loss=0.07629, over 4265998.18 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:42:42,182 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.056e+02 7.864e+02 1.049e+03 1.647e+03 3.891e+03, threshold=2.099e+03, percent-clipped=11.0 2023-06-25 13:43:57,362 INFO [train.py:996] (0/4) Epoch 11, batch 28250, loss[loss=0.2231, simple_loss=0.2902, pruned_loss=0.07802, over 21159.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3006, pruned_loss=0.07897, over 4269006.45 frames. ], batch size: 143, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:44:30,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1999236.0, ans=0.07 2023-06-25 13:45:45,822 INFO [train.py:996] (0/4) Epoch 11, batch 28300, loss[loss=0.1937, simple_loss=0.2961, pruned_loss=0.04563, over 21624.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2979, pruned_loss=0.07648, over 4275515.83 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:46:14,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-25 13:46:24,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.207e+02 7.571e+02 1.027e+03 1.599e+03 2.949e+03, threshold=2.054e+03, percent-clipped=6.0 2023-06-25 13:46:44,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1999596.0, ans=0.125 2023-06-25 13:47:11,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1999656.0, ans=0.125 2023-06-25 13:47:38,024 INFO [train.py:996] (0/4) Epoch 11, batch 28350, loss[loss=0.1601, simple_loss=0.2497, pruned_loss=0.03525, over 21366.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2956, pruned_loss=0.07161, over 4271074.71 frames. ], batch size: 211, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:48:42,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1999896.0, ans=0.125 2023-06-25 13:48:55,194 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-25 13:49:24,432 INFO [train.py:996] (0/4) Epoch 11, batch 28400, loss[loss=0.2306, simple_loss=0.2948, pruned_loss=0.08317, over 21798.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2928, pruned_loss=0.07131, over 4255294.74 frames. ], batch size: 372, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:49:27,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=12.0 2023-06-25 13:49:57,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.013e+02 9.043e+02 1.474e+03 1.977e+03 3.910e+03, threshold=2.949e+03, percent-clipped=21.0 2023-06-25 13:50:32,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2000256.0, ans=0.125 2023-06-25 13:50:45,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-25 13:51:09,819 INFO [train.py:996] (0/4) Epoch 11, batch 28450, loss[loss=0.249, simple_loss=0.3223, pruned_loss=0.08788, over 21740.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2993, pruned_loss=0.0754, over 4263249.59 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:51:59,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2000496.0, ans=0.0 2023-06-25 13:52:18,449 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.98 vs. limit=10.0 2023-06-25 13:52:45,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-25 13:53:03,187 INFO [train.py:996] (0/4) Epoch 11, batch 28500, loss[loss=0.2565, simple_loss=0.3309, pruned_loss=0.09105, over 21682.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3017, pruned_loss=0.07792, over 4272886.41 frames. ], batch size: 415, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:53:38,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2000736.0, ans=0.125 2023-06-25 13:53:39,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-25 13:53:45,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2000736.0, ans=0.1 2023-06-25 13:53:46,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.310e+02 7.652e+02 9.900e+02 1.430e+03 3.378e+03, threshold=1.980e+03, percent-clipped=2.0 2023-06-25 13:54:51,346 INFO [train.py:996] (0/4) Epoch 11, batch 28550, loss[loss=0.3463, simple_loss=0.4357, pruned_loss=0.1284, over 21527.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3091, pruned_loss=0.08021, over 4276730.36 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 4.0 2023-06-25 13:55:26,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2001036.0, ans=0.0 2023-06-25 13:56:44,027 INFO [train.py:996] (0/4) Epoch 11, batch 28600, loss[loss=0.2676, simple_loss=0.3461, pruned_loss=0.09459, over 21557.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3169, pruned_loss=0.083, over 4276610.00 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:57:18,473 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.619e+02 9.941e+02 1.518e+03 3.528e+03, threshold=1.988e+03, percent-clipped=12.0 2023-06-25 13:57:19,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2001336.0, ans=0.1 2023-06-25 13:57:29,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2001396.0, ans=0.07 2023-06-25 13:57:32,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2001396.0, ans=0.125 2023-06-25 13:57:39,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2001396.0, ans=0.125 2023-06-25 13:58:13,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=12.0 2023-06-25 13:58:14,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2001516.0, ans=0.125 2023-06-25 13:58:28,365 INFO [train.py:996] (0/4) Epoch 11, batch 28650, loss[loss=0.2212, simple_loss=0.2748, pruned_loss=0.08384, over 21135.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3107, pruned_loss=0.08217, over 4277781.06 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:58:28,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2001576.0, ans=0.125 2023-06-25 13:58:29,619 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.91 vs. limit=15.0 2023-06-25 13:59:15,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-25 13:59:36,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2001756.0, ans=0.07 2023-06-25 13:59:50,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2001756.0, ans=0.2 2023-06-25 13:59:52,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2001816.0, ans=0.2 2023-06-25 14:00:20,289 INFO [train.py:996] (0/4) Epoch 11, batch 28700, loss[loss=0.2273, simple_loss=0.3013, pruned_loss=0.07658, over 20661.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3088, pruned_loss=0.08268, over 4267791.31 frames. ], batch size: 607, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 14:00:42,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2001936.0, ans=0.0 2023-06-25 14:00:55,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.441e+02 7.291e+02 9.817e+02 1.860e+03 4.444e+03, threshold=1.963e+03, percent-clipped=16.0 2023-06-25 14:01:32,042 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:01:32,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-25 14:02:03,144 INFO [train.py:996] (0/4) Epoch 11, batch 28750, loss[loss=0.2396, simple_loss=0.3249, pruned_loss=0.07718, over 21725.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3092, pruned_loss=0.08318, over 4268688.21 frames. ], batch size: 441, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 14:02:57,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2002296.0, ans=0.0 2023-06-25 14:03:33,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-25 14:03:48,799 INFO [train.py:996] (0/4) Epoch 11, batch 28800, loss[loss=0.2383, simple_loss=0.3139, pruned_loss=0.08139, over 21757.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3133, pruned_loss=0.08334, over 4273297.46 frames. ], batch size: 298, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:04:29,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.612e+02 1.083e+03 1.492e+03 3.378e+03, threshold=2.166e+03, percent-clipped=11.0 2023-06-25 14:04:34,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2002596.0, ans=0.025 2023-06-25 14:05:14,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2002716.0, ans=0.1 2023-06-25 14:05:23,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2002716.0, ans=0.125 2023-06-25 14:05:29,414 INFO [train.py:996] (0/4) Epoch 11, batch 28850, loss[loss=0.2465, simple_loss=0.3053, pruned_loss=0.09384, over 21488.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.314, pruned_loss=0.08455, over 4282658.13 frames. ], batch size: 144, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:05:36,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2002776.0, ans=0.015 2023-06-25 14:05:47,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2002776.0, ans=0.09899494936611666 2023-06-25 14:06:13,696 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:07:22,647 INFO [train.py:996] (0/4) Epoch 11, batch 28900, loss[loss=0.2385, simple_loss=0.3095, pruned_loss=0.08377, over 21691.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3169, pruned_loss=0.08581, over 4282340.82 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:08:00,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.503e+02 7.472e+02 9.917e+02 1.436e+03 2.913e+03, threshold=1.983e+03, percent-clipped=5.0 2023-06-25 14:08:35,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2003256.0, ans=0.2 2023-06-25 14:09:08,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2003316.0, ans=0.125 2023-06-25 14:09:17,154 INFO [train.py:996] (0/4) Epoch 11, batch 28950, loss[loss=0.3458, simple_loss=0.4191, pruned_loss=0.1362, over 21480.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3195, pruned_loss=0.08548, over 4277265.90 frames. ], batch size: 507, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:10:41,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2003616.0, ans=0.0 2023-06-25 14:11:06,038 INFO [train.py:996] (0/4) Epoch 11, batch 29000, loss[loss=0.2483, simple_loss=0.3309, pruned_loss=0.08288, over 21432.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3214, pruned_loss=0.08437, over 4276210.68 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:11:11,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2003676.0, ans=0.0 2023-06-25 14:11:48,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.370e+02 8.594e+02 1.350e+03 2.116e+03 4.440e+03, threshold=2.700e+03, percent-clipped=27.0 2023-06-25 14:12:03,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2003796.0, ans=0.125 2023-06-25 14:12:03,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2003796.0, ans=0.125 2023-06-25 14:12:21,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2003856.0, ans=0.125 2023-06-25 14:12:25,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2003856.0, ans=0.0 2023-06-25 14:12:27,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2003856.0, ans=0.125 2023-06-25 14:12:52,696 INFO [train.py:996] (0/4) Epoch 11, batch 29050, loss[loss=0.2096, simple_loss=0.2763, pruned_loss=0.07142, over 21515.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3199, pruned_loss=0.08535, over 4278397.72 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:12:58,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2003976.0, ans=0.125 2023-06-25 14:13:25,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2004036.0, ans=0.125 2023-06-25 14:13:32,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-25 14:13:33,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2004036.0, ans=0.0 2023-06-25 14:13:40,161 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.64 vs. limit=22.5 2023-06-25 14:13:43,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-25 14:14:05,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2004156.0, ans=0.125 2023-06-25 14:14:07,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2004156.0, ans=0.0 2023-06-25 14:14:37,991 INFO [train.py:996] (0/4) Epoch 11, batch 29100, loss[loss=0.19, simple_loss=0.2605, pruned_loss=0.05975, over 21667.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3121, pruned_loss=0.08277, over 4270099.10 frames. ], batch size: 282, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:14:48,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2004276.0, ans=0.0 2023-06-25 14:15:19,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.995e+02 7.507e+02 9.912e+02 1.574e+03 3.418e+03, threshold=1.982e+03, percent-clipped=5.0 2023-06-25 14:16:24,471 INFO [train.py:996] (0/4) Epoch 11, batch 29150, loss[loss=0.239, simple_loss=0.3161, pruned_loss=0.08091, over 21678.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3109, pruned_loss=0.08111, over 4266072.66 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:16:47,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2004576.0, ans=0.125 2023-06-25 14:17:18,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2004696.0, ans=0.1 2023-06-25 14:17:56,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2004816.0, ans=0.0 2023-06-25 14:18:08,698 INFO [train.py:996] (0/4) Epoch 11, batch 29200, loss[loss=0.2212, simple_loss=0.2815, pruned_loss=0.08042, over 21179.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3062, pruned_loss=0.08003, over 4268677.10 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 14:18:18,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-25 14:18:49,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.775e+02 8.207e+02 1.113e+03 1.658e+03 3.096e+03, threshold=2.226e+03, percent-clipped=9.0 2023-06-25 14:19:24,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-25 14:19:58,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2005116.0, ans=0.0 2023-06-25 14:20:00,633 INFO [train.py:996] (0/4) Epoch 11, batch 29250, loss[loss=0.2138, simple_loss=0.2914, pruned_loss=0.06806, over 21096.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3043, pruned_loss=0.07748, over 4271029.44 frames. ], batch size: 143, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 14:20:19,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2005176.0, ans=0.125 2023-06-25 14:20:50,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2005296.0, ans=0.125 2023-06-25 14:21:06,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2005356.0, ans=0.0 2023-06-25 14:21:42,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2005416.0, ans=0.125 2023-06-25 14:21:47,250 INFO [train.py:996] (0/4) Epoch 11, batch 29300, loss[loss=0.268, simple_loss=0.3253, pruned_loss=0.1054, over 21463.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3071, pruned_loss=0.0775, over 4272846.42 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:22:25,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 9.346e+02 1.272e+03 1.765e+03 3.710e+03, threshold=2.544e+03, percent-clipped=11.0 2023-06-25 14:22:27,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-25 14:22:58,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2005656.0, ans=0.125 2023-06-25 14:23:00,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.94 vs. limit=10.0 2023-06-25 14:23:04,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2005656.0, ans=0.125 2023-06-25 14:23:20,852 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-25 14:23:38,597 INFO [train.py:996] (0/4) Epoch 11, batch 29350, loss[loss=0.2197, simple_loss=0.3004, pruned_loss=0.06949, over 21549.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3026, pruned_loss=0.07672, over 4268807.73 frames. ], batch size: 230, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:24:43,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2005956.0, ans=0.0 2023-06-25 14:25:26,792 INFO [train.py:996] (0/4) Epoch 11, batch 29400, loss[loss=0.1851, simple_loss=0.2602, pruned_loss=0.05504, over 21780.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3027, pruned_loss=0.07467, over 4264704.53 frames. ], batch size: 282, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:25:28,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2006076.0, ans=0.125 2023-06-25 14:26:04,476 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.130e+02 8.606e+02 1.280e+03 1.886e+03 3.409e+03, threshold=2.560e+03, percent-clipped=11.0 2023-06-25 14:26:58,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2006316.0, ans=0.125 2023-06-25 14:27:00,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-25 14:27:06,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2006316.0, ans=0.125 2023-06-25 14:27:15,011 INFO [train.py:996] (0/4) Epoch 11, batch 29450, loss[loss=0.2259, simple_loss=0.3123, pruned_loss=0.06974, over 21413.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3006, pruned_loss=0.07353, over 4265445.84 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:27:17,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2006376.0, ans=0.125 2023-06-25 14:28:07,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2006496.0, ans=0.1 2023-06-25 14:29:00,501 INFO [train.py:996] (0/4) Epoch 11, batch 29500, loss[loss=0.241, simple_loss=0.3064, pruned_loss=0.08778, over 21845.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3052, pruned_loss=0.07738, over 4272217.35 frames. ], batch size: 441, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:29:30,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2006736.0, ans=0.1 2023-06-25 14:29:44,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.728e+02 1.187e+03 1.757e+03 3.879e+03, threshold=2.373e+03, percent-clipped=3.0 2023-06-25 14:30:10,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2006856.0, ans=0.125 2023-06-25 14:30:32,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2006916.0, ans=0.125 2023-06-25 14:30:33,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2006916.0, ans=0.125 2023-06-25 14:30:34,362 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-25 14:30:37,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-25 14:30:48,743 INFO [train.py:996] (0/4) Epoch 11, batch 29550, loss[loss=0.2326, simple_loss=0.2998, pruned_loss=0.08268, over 21662.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3047, pruned_loss=0.07893, over 4283412.08 frames. ], batch size: 263, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 14:31:17,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2007036.0, ans=0.0 2023-06-25 14:31:48,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=2007096.0, ans=0.02 2023-06-25 14:31:54,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2007156.0, ans=0.125 2023-06-25 14:31:58,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2007156.0, ans=0.2 2023-06-25 14:32:42,228 INFO [train.py:996] (0/4) Epoch 11, batch 29600, loss[loss=0.2635, simple_loss=0.3506, pruned_loss=0.08823, over 21636.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3118, pruned_loss=0.08172, over 4286064.12 frames. ], batch size: 263, lr: 2.59e-03, grad_scale: 32.0 2023-06-25 14:33:22,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.467e+02 8.361e+02 1.294e+03 2.319e+03 6.850e+03, threshold=2.587e+03, percent-clipped=23.0 2023-06-25 14:33:34,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2007396.0, ans=0.125 2023-06-25 14:33:44,324 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:34:02,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2007456.0, ans=0.125 2023-06-25 14:34:24,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=2007516.0, ans=22.5 2023-06-25 14:34:27,156 INFO [train.py:996] (0/4) Epoch 11, batch 29650, loss[loss=0.2289, simple_loss=0.297, pruned_loss=0.08036, over 21456.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.308, pruned_loss=0.07778, over 4282351.36 frames. ], batch size: 131, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:34:34,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2007576.0, ans=0.125 2023-06-25 14:34:56,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2007636.0, ans=0.2 2023-06-25 14:35:09,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-25 14:35:40,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2007756.0, ans=0.0 2023-06-25 14:36:06,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2007816.0, ans=0.0 2023-06-25 14:36:16,273 INFO [train.py:996] (0/4) Epoch 11, batch 29700, loss[loss=0.2326, simple_loss=0.3498, pruned_loss=0.05777, over 19815.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3082, pruned_loss=0.07768, over 4271561.59 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:36:17,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-25 14:36:20,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2007876.0, ans=0.125 2023-06-25 14:36:25,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2007876.0, ans=0.125 2023-06-25 14:37:02,471 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.011e+02 9.154e+02 1.304e+03 2.529e+03 6.535e+03, threshold=2.607e+03, percent-clipped=22.0 2023-06-25 14:37:42,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2008116.0, ans=0.1 2023-06-25 14:38:01,629 INFO [train.py:996] (0/4) Epoch 11, batch 29750, loss[loss=0.2138, simple_loss=0.2893, pruned_loss=0.06913, over 21226.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3149, pruned_loss=0.0779, over 4280543.01 frames. ], batch size: 143, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:38:02,194 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:38:03,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2008176.0, ans=0.2 2023-06-25 14:38:09,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=2008176.0, ans=15.0 2023-06-25 14:38:45,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2008296.0, ans=0.0 2023-06-25 14:38:50,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2008296.0, ans=0.1 2023-06-25 14:39:24,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2008416.0, ans=0.09899494936611666 2023-06-25 14:39:34,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2008416.0, ans=10.0 2023-06-25 14:39:42,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2008416.0, ans=0.0 2023-06-25 14:39:45,424 INFO [train.py:996] (0/4) Epoch 11, batch 29800, loss[loss=0.2311, simple_loss=0.3017, pruned_loss=0.08023, over 21328.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3154, pruned_loss=0.07805, over 4279079.07 frames. ], batch size: 144, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:39:47,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2008476.0, ans=0.125 2023-06-25 14:40:28,309 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:40:30,964 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.324e+02 8.347e+02 1.266e+03 1.868e+03 3.431e+03, threshold=2.532e+03, percent-clipped=7.0 2023-06-25 14:41:18,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-25 14:41:30,274 INFO [train.py:996] (0/4) Epoch 11, batch 29850, loss[loss=0.2205, simple_loss=0.3003, pruned_loss=0.07031, over 21542.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3113, pruned_loss=0.07643, over 4283141.18 frames. ], batch size: 131, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:41:50,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2008776.0, ans=0.125 2023-06-25 14:42:36,610 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-06-25 14:42:44,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2008956.0, ans=0.125 2023-06-25 14:42:44,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2008956.0, ans=0.125 2023-06-25 14:43:01,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=2009016.0, ans=0.5 2023-06-25 14:43:06,333 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:43:16,128 INFO [train.py:996] (0/4) Epoch 11, batch 29900, loss[loss=0.2377, simple_loss=0.3614, pruned_loss=0.05701, over 19805.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3091, pruned_loss=0.0772, over 4285618.68 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:43:19,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2009076.0, ans=0.125 2023-06-25 14:43:56,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2009136.0, ans=0.1 2023-06-25 14:44:02,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.086e+02 7.684e+02 1.155e+03 1.725e+03 4.466e+03, threshold=2.311e+03, percent-clipped=10.0 2023-06-25 14:44:51,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2009316.0, ans=0.125 2023-06-25 14:44:52,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2009316.0, ans=0.0 2023-06-25 14:45:08,297 INFO [train.py:996] (0/4) Epoch 11, batch 29950, loss[loss=0.278, simple_loss=0.3459, pruned_loss=0.105, over 21347.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.313, pruned_loss=0.08073, over 4277830.12 frames. ], batch size: 176, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:45:18,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-25 14:46:17,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-25 14:46:22,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2009556.0, ans=0.0 2023-06-25 14:46:49,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2009616.0, ans=0.035 2023-06-25 14:46:51,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=22.5 2023-06-25 14:46:54,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2009676.0, ans=0.125 2023-06-25 14:46:55,291 INFO [train.py:996] (0/4) Epoch 11, batch 30000, loss[loss=0.2155, simple_loss=0.3041, pruned_loss=0.06347, over 21651.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3147, pruned_loss=0.08116, over 4269600.26 frames. ], batch size: 263, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 14:46:55,293 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 14:47:14,793 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2475, simple_loss=0.3451, pruned_loss=0.07497, over 1796401.00 frames. 2023-06-25 14:47:14,794 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 14:47:32,132 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-25 14:48:03,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.762e+02 8.420e+02 1.340e+03 1.867e+03 3.638e+03, threshold=2.681e+03, percent-clipped=9.0 2023-06-25 14:48:04,516 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-25 14:48:57,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2009916.0, ans=0.125 2023-06-25 14:49:16,228 INFO [train.py:996] (0/4) Epoch 11, batch 30050, loss[loss=0.2395, simple_loss=0.3781, pruned_loss=0.05047, over 20788.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3181, pruned_loss=0.07792, over 4272010.64 frames. ], batch size: 607, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:49:55,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2010096.0, ans=0.125 2023-06-25 14:49:57,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2010096.0, ans=0.125 2023-06-25 14:50:28,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2010156.0, ans=0.125 2023-06-25 14:50:43,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2010216.0, ans=0.1 2023-06-25 14:51:01,291 INFO [train.py:996] (0/4) Epoch 11, batch 30100, loss[loss=0.2237, simple_loss=0.2909, pruned_loss=0.07824, over 21781.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3171, pruned_loss=0.07733, over 4265888.47 frames. ], batch size: 351, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:51:18,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2010336.0, ans=0.125 2023-06-25 14:51:44,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.182e+02 9.788e+02 1.555e+03 2.396e+03 5.388e+03, threshold=3.111e+03, percent-clipped=17.0 2023-06-25 14:51:53,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2010396.0, ans=0.0 2023-06-25 14:52:37,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2010516.0, ans=0.2 2023-06-25 14:52:49,219 INFO [train.py:996] (0/4) Epoch 11, batch 30150, loss[loss=0.2088, simple_loss=0.2759, pruned_loss=0.07083, over 21614.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3119, pruned_loss=0.07842, over 4266742.91 frames. ], batch size: 298, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:53:06,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2010576.0, ans=0.0 2023-06-25 14:53:40,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2010696.0, ans=0.1 2023-06-25 14:54:15,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2010756.0, ans=0.1 2023-06-25 14:54:18,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2010756.0, ans=0.04949747468305833 2023-06-25 14:54:28,392 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-25 14:54:46,759 INFO [train.py:996] (0/4) Epoch 11, batch 30200, loss[loss=0.3152, simple_loss=0.4001, pruned_loss=0.1152, over 21408.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3153, pruned_loss=0.07843, over 4270515.74 frames. ], batch size: 507, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:54:58,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2010876.0, ans=0.125 2023-06-25 14:55:12,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.38 vs. limit=10.0 2023-06-25 14:55:33,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2010996.0, ans=0.125 2023-06-25 14:55:34,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 8.137e+02 1.157e+03 1.769e+03 3.974e+03, threshold=2.314e+03, percent-clipped=2.0 2023-06-25 14:55:42,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2010996.0, ans=0.07 2023-06-25 14:55:58,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2011056.0, ans=0.0 2023-06-25 14:56:06,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2011056.0, ans=0.1 2023-06-25 14:56:34,989 INFO [train.py:996] (0/4) Epoch 11, batch 30250, loss[loss=0.3263, simple_loss=0.4285, pruned_loss=0.112, over 21255.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3232, pruned_loss=0.08097, over 4269760.37 frames. ], batch size: 549, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:57:37,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2011296.0, ans=0.0 2023-06-25 14:58:02,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2011416.0, ans=0.125 2023-06-25 14:58:20,285 INFO [train.py:996] (0/4) Epoch 11, batch 30300, loss[loss=0.2344, simple_loss=0.2936, pruned_loss=0.08762, over 22017.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3214, pruned_loss=0.08164, over 4270692.60 frames. ], batch size: 103, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:58:34,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2011476.0, ans=0.0 2023-06-25 14:59:10,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2011596.0, ans=0.2 2023-06-25 14:59:11,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2011596.0, ans=0.2 2023-06-25 14:59:14,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 1.032e+03 1.380e+03 1.875e+03 4.556e+03, threshold=2.761e+03, percent-clipped=17.0 2023-06-25 14:59:28,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=2011656.0, ans=0.05 2023-06-25 14:59:41,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2011656.0, ans=0.0 2023-06-25 15:00:09,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2011716.0, ans=0.125 2023-06-25 15:00:21,512 INFO [train.py:996] (0/4) Epoch 11, batch 30350, loss[loss=0.3428, simple_loss=0.4322, pruned_loss=0.1267, over 21456.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3205, pruned_loss=0.08251, over 4269474.69 frames. ], batch size: 471, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 15:00:40,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-25 15:01:01,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2011896.0, ans=0.125 2023-06-25 15:01:09,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2011896.0, ans=0.125 2023-06-25 15:01:13,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2011956.0, ans=0.0 2023-06-25 15:01:18,896 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:01:29,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2012016.0, ans=0.0 2023-06-25 15:01:43,071 INFO [train.py:996] (0/4) Epoch 11, batch 30400, loss[loss=0.2152, simple_loss=0.2733, pruned_loss=0.07857, over 20263.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3147, pruned_loss=0.08018, over 4260592.29 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 15:02:23,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2012196.0, ans=0.2 2023-06-25 15:02:24,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.345e+02 1.116e+03 1.633e+03 2.614e+03 1.022e+04, threshold=3.266e+03, percent-clipped=19.0 2023-06-25 15:02:37,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2012256.0, ans=0.2 2023-06-25 15:03:00,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2012316.0, ans=0.05 2023-06-25 15:03:03,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2012316.0, ans=0.0 2023-06-25 15:03:11,945 INFO [train.py:996] (0/4) Epoch 11, batch 30450, loss[loss=0.2735, simple_loss=0.3633, pruned_loss=0.09187, over 19977.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3156, pruned_loss=0.0793, over 4201611.04 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 15:03:33,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2012436.0, ans=0.2 2023-06-25 15:03:43,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2012436.0, ans=0.125 2023-06-25 15:04:09,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2012556.0, ans=0.0 2023-06-25 15:04:27,244 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-11.pt 2023-06-25 15:06:14,790 INFO [train.py:996] (0/4) Epoch 12, batch 0, loss[loss=0.224, simple_loss=0.2849, pruned_loss=0.08155, over 21538.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2849, pruned_loss=0.08155, over 21538.00 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:06:14,792 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 15:06:38,468 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.246, simple_loss=0.3509, pruned_loss=0.07057, over 1796401.00 frames. 2023-06-25 15:06:38,469 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 15:06:45,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2012646.0, ans=0.125 2023-06-25 15:06:58,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-25 15:07:29,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.737e+02 2.108e+03 3.291e+03 4.750e+03 1.246e+04, threshold=6.583e+03, percent-clipped=51.0 2023-06-25 15:07:40,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2012826.0, ans=0.125 2023-06-25 15:07:44,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2012826.0, ans=0.0 2023-06-25 15:08:24,010 INFO [train.py:996] (0/4) Epoch 12, batch 50, loss[loss=0.2623, simple_loss=0.3326, pruned_loss=0.09604, over 21406.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.324, pruned_loss=0.08309, over 961914.56 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:08:34,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2012946.0, ans=0.125 2023-06-25 15:09:23,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2013126.0, ans=0.125 2023-06-25 15:09:37,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-25 15:09:38,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2013186.0, ans=0.125 2023-06-25 15:10:07,041 INFO [train.py:996] (0/4) Epoch 12, batch 100, loss[loss=0.3172, simple_loss=0.4023, pruned_loss=0.116, over 21547.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.34, pruned_loss=0.08514, over 1691813.59 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:10:15,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2013246.0, ans=0.125 2023-06-25 15:10:16,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-25 15:10:37,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2013306.0, ans=0.125 2023-06-25 15:10:45,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2013366.0, ans=0.1 2023-06-25 15:11:03,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.950e+02 8.744e+02 1.296e+03 2.082e+03 4.002e+03, threshold=2.593e+03, percent-clipped=0.0 2023-06-25 15:11:43,853 INFO [train.py:996] (0/4) Epoch 12, batch 150, loss[loss=0.2338, simple_loss=0.3346, pruned_loss=0.06653, over 21399.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3395, pruned_loss=0.08477, over 2257920.52 frames. ], batch size: 211, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:11:53,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2013546.0, ans=0.2 2023-06-25 15:12:01,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2013546.0, ans=0.025 2023-06-25 15:13:09,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-25 15:13:32,400 INFO [train.py:996] (0/4) Epoch 12, batch 200, loss[loss=0.1924, simple_loss=0.2777, pruned_loss=0.05352, over 21404.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3363, pruned_loss=0.08477, over 2704725.98 frames. ], batch size: 194, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:13:48,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2013846.0, ans=0.2 2023-06-25 15:13:50,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-25 15:14:31,670 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.459e+02 8.933e+02 1.301e+03 1.792e+03 3.949e+03, threshold=2.602e+03, percent-clipped=5.0 2023-06-25 15:15:00,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2014086.0, ans=0.125 2023-06-25 15:15:07,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2014086.0, ans=0.125 2023-06-25 15:15:20,334 INFO [train.py:996] (0/4) Epoch 12, batch 250, loss[loss=0.2103, simple_loss=0.2852, pruned_loss=0.06768, over 21883.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3307, pruned_loss=0.08333, over 3061485.31 frames. ], batch size: 124, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:15:30,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2014146.0, ans=0.1 2023-06-25 15:17:00,579 INFO [train.py:996] (0/4) Epoch 12, batch 300, loss[loss=0.1775, simple_loss=0.2401, pruned_loss=0.05745, over 21602.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3261, pruned_loss=0.08299, over 3331945.91 frames. ], batch size: 231, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:17:20,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-25 15:17:37,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2014506.0, ans=0.0 2023-06-25 15:18:01,871 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.795e+02 8.285e+02 1.102e+03 1.636e+03 4.756e+03, threshold=2.203e+03, percent-clipped=8.0 2023-06-25 15:18:12,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-25 15:18:16,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2014626.0, ans=0.0 2023-06-25 15:18:49,641 INFO [train.py:996] (0/4) Epoch 12, batch 350, loss[loss=0.2209, simple_loss=0.2881, pruned_loss=0.07684, over 20091.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3169, pruned_loss=0.08028, over 3534625.33 frames. ], batch size: 704, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:19:01,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-25 15:19:03,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-25 15:19:43,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2014866.0, ans=0.2 2023-06-25 15:19:45,183 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:20:06,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2014926.0, ans=0.2 2023-06-25 15:20:08,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2014926.0, ans=0.125 2023-06-25 15:20:17,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2014986.0, ans=0.0 2023-06-25 15:20:37,694 INFO [train.py:996] (0/4) Epoch 12, batch 400, loss[loss=0.2168, simple_loss=0.2803, pruned_loss=0.07662, over 21308.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3105, pruned_loss=0.07863, over 3697506.38 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:20:53,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2015046.0, ans=0.0 2023-06-25 15:21:05,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=8.0 2023-06-25 15:21:37,414 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.810e+02 9.305e+02 1.280e+03 1.857e+03 4.239e+03, threshold=2.560e+03, percent-clipped=17.0 2023-06-25 15:22:04,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-25 15:22:23,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2015346.0, ans=0.2 2023-06-25 15:22:24,716 INFO [train.py:996] (0/4) Epoch 12, batch 450, loss[loss=0.2833, simple_loss=0.3861, pruned_loss=0.09021, over 21686.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3056, pruned_loss=0.07747, over 3832609.88 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:22:42,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2015346.0, ans=0.05 2023-06-25 15:22:45,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2015406.0, ans=0.125 2023-06-25 15:22:56,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2015406.0, ans=0.2 2023-06-25 15:23:33,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=12.0 2023-06-25 15:23:40,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2015526.0, ans=0.1 2023-06-25 15:24:15,400 INFO [train.py:996] (0/4) Epoch 12, batch 500, loss[loss=0.2432, simple_loss=0.3436, pruned_loss=0.07144, over 21742.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3134, pruned_loss=0.07713, over 3934537.76 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:24:56,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2015766.0, ans=0.125 2023-06-25 15:25:14,861 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.486e+02 9.539e+02 1.426e+03 2.120e+03 6.298e+03, threshold=2.852e+03, percent-clipped=19.0 2023-06-25 15:26:02,525 INFO [train.py:996] (0/4) Epoch 12, batch 550, loss[loss=0.2468, simple_loss=0.3676, pruned_loss=0.06295, over 19898.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3186, pruned_loss=0.07675, over 4016666.95 frames. ], batch size: 703, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:26:16,043 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-336000.pt 2023-06-25 15:27:04,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-25 15:27:39,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2016186.0, ans=0.125 2023-06-25 15:27:48,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-25 15:27:49,317 INFO [train.py:996] (0/4) Epoch 12, batch 600, loss[loss=0.2614, simple_loss=0.3795, pruned_loss=0.07159, over 21239.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3213, pruned_loss=0.07686, over 4072396.16 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:28:49,901 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.802e+02 1.034e+03 1.598e+03 2.241e+03 5.970e+03, threshold=3.196e+03, percent-clipped=11.0 2023-06-25 15:28:54,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2016426.0, ans=0.04949747468305833 2023-06-25 15:29:38,766 INFO [train.py:996] (0/4) Epoch 12, batch 650, loss[loss=0.2081, simple_loss=0.283, pruned_loss=0.06664, over 19916.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3185, pruned_loss=0.07754, over 4116980.45 frames. ], batch size: 704, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:31:28,088 INFO [train.py:996] (0/4) Epoch 12, batch 700, loss[loss=0.2538, simple_loss=0.328, pruned_loss=0.08973, over 21773.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3159, pruned_loss=0.07852, over 4148512.73 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:32:28,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.353e+02 8.196e+02 1.234e+03 2.056e+03 5.759e+03, threshold=2.467e+03, percent-clipped=11.0 2023-06-25 15:32:29,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=12.0 2023-06-25 15:32:46,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2017086.0, ans=0.07 2023-06-25 15:32:50,848 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-25 15:33:16,922 INFO [train.py:996] (0/4) Epoch 12, batch 750, loss[loss=0.2435, simple_loss=0.2996, pruned_loss=0.09368, over 21729.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3146, pruned_loss=0.07956, over 4179381.55 frames. ], batch size: 298, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:34:14,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-25 15:34:29,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-25 15:35:08,461 INFO [train.py:996] (0/4) Epoch 12, batch 800, loss[loss=0.2096, simple_loss=0.2779, pruned_loss=0.07059, over 21871.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3092, pruned_loss=0.07885, over 4209128.22 frames. ], batch size: 283, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:35:42,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2017506.0, ans=0.1 2023-06-25 15:35:51,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=2017566.0, ans=0.5 2023-06-25 15:35:51,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2017566.0, ans=0.0 2023-06-25 15:36:10,840 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.338e+02 9.265e+02 1.322e+03 1.960e+03 3.991e+03, threshold=2.645e+03, percent-clipped=16.0 2023-06-25 15:36:13,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-25 15:36:45,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2017686.0, ans=0.125 2023-06-25 15:36:58,023 INFO [train.py:996] (0/4) Epoch 12, batch 850, loss[loss=0.2188, simple_loss=0.304, pruned_loss=0.06681, over 21651.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3073, pruned_loss=0.07801, over 4232255.92 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:37:32,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2017806.0, ans=0.125 2023-06-25 15:37:46,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2017866.0, ans=0.125 2023-06-25 15:37:56,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.13 vs. limit=15.0 2023-06-25 15:38:38,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=2017986.0, ans=0.2 2023-06-25 15:38:46,361 INFO [train.py:996] (0/4) Epoch 12, batch 900, loss[loss=0.2093, simple_loss=0.307, pruned_loss=0.0558, over 21829.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3048, pruned_loss=0.07803, over 4248890.03 frames. ], batch size: 372, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:38:46,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2018046.0, ans=0.2 2023-06-25 15:38:49,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-25 15:39:03,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2018046.0, ans=0.04949747468305833 2023-06-25 15:39:12,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-25 15:39:35,045 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:39:49,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.898e+02 1.048e+03 1.588e+03 2.681e+03 4.714e+03, threshold=3.177e+03, percent-clipped=25.0 2023-06-25 15:40:23,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2018286.0, ans=0.125 2023-06-25 15:40:32,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2018286.0, ans=0.125 2023-06-25 15:40:37,413 INFO [train.py:996] (0/4) Epoch 12, batch 950, loss[loss=0.2342, simple_loss=0.3111, pruned_loss=0.07861, over 21864.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3039, pruned_loss=0.07735, over 4258397.85 frames. ], batch size: 107, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:40:46,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-25 15:40:50,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2018346.0, ans=0.035 2023-06-25 15:40:52,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2018346.0, ans=0.125 2023-06-25 15:42:25,967 INFO [train.py:996] (0/4) Epoch 12, batch 1000, loss[loss=0.2666, simple_loss=0.3448, pruned_loss=0.0942, over 21376.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3038, pruned_loss=0.07614, over 4262917.06 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:42:46,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-25 15:43:29,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.398e+02 9.331e+02 1.352e+03 1.940e+03 3.326e+03, threshold=2.703e+03, percent-clipped=1.0 2023-06-25 15:43:42,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2018826.0, ans=0.04949747468305833 2023-06-25 15:44:04,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2018886.0, ans=0.07 2023-06-25 15:44:21,949 INFO [train.py:996] (0/4) Epoch 12, batch 1050, loss[loss=0.2351, simple_loss=0.3024, pruned_loss=0.08388, over 21327.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3042, pruned_loss=0.07713, over 4276274.58 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:44:47,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2019006.0, ans=0.1 2023-06-25 15:44:51,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-25 15:44:59,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2019006.0, ans=0.125 2023-06-25 15:45:02,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=2019066.0, ans=0.05 2023-06-25 15:45:35,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2019126.0, ans=0.1 2023-06-25 15:46:05,027 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:46:11,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-25 15:46:13,563 INFO [train.py:996] (0/4) Epoch 12, batch 1100, loss[loss=0.2171, simple_loss=0.3176, pruned_loss=0.05828, over 21700.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3068, pruned_loss=0.07695, over 4275669.81 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:46:17,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2019246.0, ans=0.1 2023-06-25 15:46:26,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2019246.0, ans=0.07 2023-06-25 15:46:38,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2019306.0, ans=0.1 2023-06-25 15:46:43,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2019306.0, ans=0.125 2023-06-25 15:47:07,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2019366.0, ans=0.07 2023-06-25 15:47:12,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.833e+02 8.143e+02 1.241e+03 1.792e+03 5.093e+03, threshold=2.482e+03, percent-clipped=8.0 2023-06-25 15:47:24,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-25 15:47:59,974 INFO [train.py:996] (0/4) Epoch 12, batch 1150, loss[loss=0.2389, simple_loss=0.3248, pruned_loss=0.07648, over 21747.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3088, pruned_loss=0.07654, over 4279191.29 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:48:38,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2019666.0, ans=0.2 2023-06-25 15:48:50,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-25 15:49:37,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2019786.0, ans=0.0 2023-06-25 15:49:56,078 INFO [train.py:996] (0/4) Epoch 12, batch 1200, loss[loss=0.198, simple_loss=0.2661, pruned_loss=0.06495, over 21266.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3094, pruned_loss=0.0772, over 4283418.96 frames. ], batch size: 608, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:50:23,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2019906.0, ans=0.125 2023-06-25 15:50:59,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2020026.0, ans=0.0 2023-06-25 15:51:00,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.676e+02 8.295e+02 1.207e+03 1.708e+03 3.534e+03, threshold=2.414e+03, percent-clipped=4.0 2023-06-25 15:51:47,250 INFO [train.py:996] (0/4) Epoch 12, batch 1250, loss[loss=0.2245, simple_loss=0.3105, pruned_loss=0.06923, over 16654.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3095, pruned_loss=0.07813, over 4273309.16 frames. ], batch size: 61, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:52:05,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2020206.0, ans=0.125 2023-06-25 15:52:16,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2020206.0, ans=0.04949747468305833 2023-06-25 15:52:29,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2020266.0, ans=0.125 2023-06-25 15:52:57,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2020326.0, ans=0.04949747468305833 2023-06-25 15:53:41,565 INFO [train.py:996] (0/4) Epoch 12, batch 1300, loss[loss=0.2385, simple_loss=0.3246, pruned_loss=0.07615, over 21756.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3107, pruned_loss=0.07918, over 4280242.69 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:53:48,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2020446.0, ans=0.1 2023-06-25 15:53:58,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2020506.0, ans=0.125 2023-06-25 15:54:46,925 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.616e+02 1.419e+03 1.872e+03 2.553e+03 5.619e+03, threshold=3.744e+03, percent-clipped=29.0 2023-06-25 15:54:59,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2020626.0, ans=0.1 2023-06-25 15:55:25,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2020686.0, ans=0.125 2023-06-25 15:55:30,095 INFO [train.py:996] (0/4) Epoch 12, batch 1350, loss[loss=0.26, simple_loss=0.335, pruned_loss=0.09247, over 21690.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3123, pruned_loss=0.07949, over 4284094.71 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:55:46,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2020806.0, ans=0.0 2023-06-25 15:56:42,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.12 vs. limit=10.0 2023-06-25 15:57:04,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2020986.0, ans=0.1 2023-06-25 15:57:12,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2020986.0, ans=0.0 2023-06-25 15:57:18,671 INFO [train.py:996] (0/4) Epoch 12, batch 1400, loss[loss=0.2111, simple_loss=0.2689, pruned_loss=0.07668, over 21559.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3102, pruned_loss=0.0795, over 4281439.25 frames. ], batch size: 196, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:58:30,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2021226.0, ans=0.0 2023-06-25 15:58:31,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 6.812e+02 9.835e+02 1.527e+03 2.832e+03, threshold=1.967e+03, percent-clipped=0.0 2023-06-25 15:58:38,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2021226.0, ans=0.2 2023-06-25 15:59:07,938 INFO [train.py:996] (0/4) Epoch 12, batch 1450, loss[loss=0.2706, simple_loss=0.339, pruned_loss=0.1011, over 21433.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3089, pruned_loss=0.0792, over 4281965.62 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:00:01,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2021466.0, ans=0.1 2023-06-25 16:00:25,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2021526.0, ans=0.125 2023-06-25 16:00:54,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2021586.0, ans=0.125 2023-06-25 16:00:56,917 INFO [train.py:996] (0/4) Epoch 12, batch 1500, loss[loss=0.2263, simple_loss=0.3036, pruned_loss=0.07452, over 21888.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.309, pruned_loss=0.0804, over 4286330.36 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:01:04,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.92 vs. limit=10.0 2023-06-25 16:02:04,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.651e+02 8.922e+02 1.288e+03 1.827e+03 4.851e+03, threshold=2.577e+03, percent-clipped=21.0 2023-06-25 16:02:43,233 INFO [train.py:996] (0/4) Epoch 12, batch 1550, loss[loss=0.2133, simple_loss=0.2886, pruned_loss=0.06904, over 21492.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3088, pruned_loss=0.08036, over 4277471.07 frames. ], batch size: 211, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:03:16,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2022006.0, ans=0.0 2023-06-25 16:03:18,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2022006.0, ans=0.0 2023-06-25 16:03:38,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2022066.0, ans=0.2 2023-06-25 16:03:52,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=2022066.0, ans=15.0 2023-06-25 16:04:05,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2022126.0, ans=0.0 2023-06-25 16:04:16,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2022126.0, ans=0.125 2023-06-25 16:04:36,329 INFO [train.py:996] (0/4) Epoch 12, batch 1600, loss[loss=0.1966, simple_loss=0.263, pruned_loss=0.06507, over 21204.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3082, pruned_loss=0.08002, over 4284036.91 frames. ], batch size: 159, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:05:10,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-25 16:05:31,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2022366.0, ans=0.125 2023-06-25 16:05:47,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2022366.0, ans=0.09899494936611666 2023-06-25 16:06:00,529 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 9.608e+02 1.362e+03 1.874e+03 5.231e+03, threshold=2.724e+03, percent-clipped=11.0 2023-06-25 16:06:38,386 INFO [train.py:996] (0/4) Epoch 12, batch 1650, loss[loss=0.1984, simple_loss=0.2911, pruned_loss=0.05285, over 21771.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3072, pruned_loss=0.07929, over 4282438.35 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:07:26,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2022666.0, ans=0.0 2023-06-25 16:07:56,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2022726.0, ans=0.125 2023-06-25 16:08:05,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2022726.0, ans=0.0 2023-06-25 16:08:10,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2022786.0, ans=0.1 2023-06-25 16:08:31,818 INFO [train.py:996] (0/4) Epoch 12, batch 1700, loss[loss=0.2401, simple_loss=0.3155, pruned_loss=0.08234, over 21162.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3099, pruned_loss=0.0803, over 4279918.16 frames. ], batch size: 143, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:08:43,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2022846.0, ans=0.125 2023-06-25 16:08:50,746 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:09:45,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2023026.0, ans=10.0 2023-06-25 16:09:45,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2023026.0, ans=0.125 2023-06-25 16:09:48,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 9.466e+02 1.193e+03 1.792e+03 3.263e+03, threshold=2.387e+03, percent-clipped=3.0 2023-06-25 16:10:32,432 INFO [train.py:996] (0/4) Epoch 12, batch 1750, loss[loss=0.1344, simple_loss=0.188, pruned_loss=0.04038, over 17033.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.31, pruned_loss=0.07941, over 4277930.07 frames. ], batch size: 60, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:11:53,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2023326.0, ans=0.125 2023-06-25 16:12:16,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-25 16:12:29,661 INFO [train.py:996] (0/4) Epoch 12, batch 1800, loss[loss=0.2088, simple_loss=0.309, pruned_loss=0.05429, over 21628.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3077, pruned_loss=0.07597, over 4277598.56 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:12:52,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-25 16:13:31,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2023566.0, ans=0.1 2023-06-25 16:13:40,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.675e+02 9.606e+02 1.409e+03 2.077e+03 5.009e+03, threshold=2.818e+03, percent-clipped=17.0 2023-06-25 16:13:54,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2023626.0, ans=0.0 2023-06-25 16:14:21,237 INFO [train.py:996] (0/4) Epoch 12, batch 1850, loss[loss=0.2284, simple_loss=0.308, pruned_loss=0.0744, over 21557.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.307, pruned_loss=0.07325, over 4279268.32 frames. ], batch size: 441, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:14:23,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2023746.0, ans=0.2 2023-06-25 16:14:55,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-25 16:16:21,237 INFO [train.py:996] (0/4) Epoch 12, batch 1900, loss[loss=0.2064, simple_loss=0.2881, pruned_loss=0.06239, over 21764.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3097, pruned_loss=0.07444, over 4277621.49 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:16:22,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-25 16:16:23,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2024046.0, ans=0.2 2023-06-25 16:16:48,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2024106.0, ans=0.1 2023-06-25 16:17:08,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2024166.0, ans=0.0 2023-06-25 16:17:30,698 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.998e+02 9.175e+02 1.448e+03 2.003e+03 3.751e+03, threshold=2.896e+03, percent-clipped=10.0 2023-06-25 16:18:08,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2024286.0, ans=0.125 2023-06-25 16:18:12,823 INFO [train.py:996] (0/4) Epoch 12, batch 1950, loss[loss=0.2213, simple_loss=0.2989, pruned_loss=0.07184, over 21546.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3064, pruned_loss=0.07456, over 4269874.63 frames. ], batch size: 212, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:18:16,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2024346.0, ans=0.125 2023-06-25 16:18:39,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2024406.0, ans=0.0 2023-06-25 16:18:59,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2024466.0, ans=0.0 2023-06-25 16:19:01,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2024466.0, ans=0.0 2023-06-25 16:20:01,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2024586.0, ans=0.2 2023-06-25 16:20:06,020 INFO [train.py:996] (0/4) Epoch 12, batch 2000, loss[loss=0.217, simple_loss=0.2825, pruned_loss=0.07572, over 21714.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3006, pruned_loss=0.07351, over 4275491.65 frames. ], batch size: 299, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:20:08,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2024646.0, ans=0.1 2023-06-25 16:20:22,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2024706.0, ans=0.0 2023-06-25 16:21:05,823 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:21:14,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2024826.0, ans=0.125 2023-06-25 16:21:15,641 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 9.911e+02 1.522e+03 2.173e+03 4.229e+03, threshold=3.044e+03, percent-clipped=10.0 2023-06-25 16:21:56,980 INFO [train.py:996] (0/4) Epoch 12, batch 2050, loss[loss=0.2378, simple_loss=0.312, pruned_loss=0.08176, over 21882.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2999, pruned_loss=0.0726, over 4272862.21 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:22:11,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2024946.0, ans=0.125 2023-06-25 16:22:11,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-25 16:23:07,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2025126.0, ans=0.125 2023-06-25 16:23:34,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2025186.0, ans=0.125 2023-06-25 16:23:44,375 INFO [train.py:996] (0/4) Epoch 12, batch 2100, loss[loss=0.2456, simple_loss=0.3242, pruned_loss=0.08345, over 21727.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3032, pruned_loss=0.07393, over 4278213.35 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:23:50,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2025246.0, ans=0.125 2023-06-25 16:24:23,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2025306.0, ans=0.1 2023-06-25 16:24:25,872 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:24:29,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2025366.0, ans=0.125 2023-06-25 16:24:57,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.106e+02 8.786e+02 1.413e+03 2.170e+03 3.783e+03, threshold=2.827e+03, percent-clipped=9.0 2023-06-25 16:25:38,362 INFO [train.py:996] (0/4) Epoch 12, batch 2150, loss[loss=0.2163, simple_loss=0.2858, pruned_loss=0.07339, over 21321.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.304, pruned_loss=0.07574, over 4275389.24 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:25:46,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2025546.0, ans=0.125 2023-06-25 16:25:56,594 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-25 16:26:18,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2025606.0, ans=0.0 2023-06-25 16:26:51,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.37 vs. limit=10.0 2023-06-25 16:27:16,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2025786.0, ans=0.2 2023-06-25 16:27:31,534 INFO [train.py:996] (0/4) Epoch 12, batch 2200, loss[loss=0.245, simple_loss=0.3159, pruned_loss=0.08705, over 21792.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.304, pruned_loss=0.07636, over 4277667.61 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:27:37,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2025846.0, ans=0.0 2023-06-25 16:28:36,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2026026.0, ans=0.125 2023-06-25 16:28:46,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.633e+02 9.347e+02 1.379e+03 2.152e+03 4.543e+03, threshold=2.758e+03, percent-clipped=14.0 2023-06-25 16:28:49,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=12.0 2023-06-25 16:29:23,301 INFO [train.py:996] (0/4) Epoch 12, batch 2250, loss[loss=0.1777, simple_loss=0.2467, pruned_loss=0.05431, over 21756.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3026, pruned_loss=0.07488, over 4283535.54 frames. ], batch size: 118, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:29:32,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2026146.0, ans=0.0 2023-06-25 16:29:32,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2026146.0, ans=0.2 2023-06-25 16:29:44,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2026206.0, ans=0.0 2023-06-25 16:29:53,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2026206.0, ans=0.0 2023-06-25 16:30:09,801 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-25 16:30:53,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2026326.0, ans=0.125 2023-06-25 16:31:12,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2026386.0, ans=0.125 2023-06-25 16:31:12,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2026386.0, ans=0.2 2023-06-25 16:31:14,984 INFO [train.py:996] (0/4) Epoch 12, batch 2300, loss[loss=0.2051, simple_loss=0.2714, pruned_loss=0.06943, over 21615.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.298, pruned_loss=0.07455, over 4274712.43 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:31:22,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2026446.0, ans=0.95 2023-06-25 16:31:25,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2026446.0, ans=0.0 2023-06-25 16:32:07,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2026566.0, ans=0.125 2023-06-25 16:32:31,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.590e+02 8.645e+02 1.249e+03 1.812e+03 4.519e+03, threshold=2.497e+03, percent-clipped=11.0 2023-06-25 16:32:54,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2026686.0, ans=0.2 2023-06-25 16:33:01,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2026686.0, ans=0.125 2023-06-25 16:33:07,261 INFO [train.py:996] (0/4) Epoch 12, batch 2350, loss[loss=0.2834, simple_loss=0.3417, pruned_loss=0.1126, over 21827.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2983, pruned_loss=0.07616, over 4276729.81 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:34:49,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2026986.0, ans=0.0 2023-06-25 16:34:59,632 INFO [train.py:996] (0/4) Epoch 12, batch 2400, loss[loss=0.2507, simple_loss=0.3484, pruned_loss=0.07648, over 17337.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3043, pruned_loss=0.07801, over 4274664.99 frames. ], batch size: 60, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 16:36:28,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 8.583e+02 1.300e+03 1.930e+03 5.128e+03, threshold=2.600e+03, percent-clipped=13.0 2023-06-25 16:36:34,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2027226.0, ans=0.125 2023-06-25 16:36:43,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-25 16:37:04,199 INFO [train.py:996] (0/4) Epoch 12, batch 2450, loss[loss=0.2507, simple_loss=0.3042, pruned_loss=0.09857, over 21880.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3088, pruned_loss=0.07952, over 4278280.92 frames. ], batch size: 98, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:37:14,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2027346.0, ans=0.2 2023-06-25 16:37:20,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2027406.0, ans=0.125 2023-06-25 16:37:37,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2027406.0, ans=0.125 2023-06-25 16:37:41,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2027406.0, ans=0.0 2023-06-25 16:38:13,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-25 16:38:53,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2027646.0, ans=0.125 2023-06-25 16:38:53,697 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-25 16:38:54,299 INFO [train.py:996] (0/4) Epoch 12, batch 2500, loss[loss=0.2464, simple_loss=0.358, pruned_loss=0.06736, over 20887.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3083, pruned_loss=0.07828, over 4276421.59 frames. ], batch size: 609, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:38:54,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2027646.0, ans=0.125 2023-06-25 16:39:30,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2027706.0, ans=0.125 2023-06-25 16:40:01,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.23 vs. limit=12.0 2023-06-25 16:40:11,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.584e+02 1.091e+03 1.592e+03 2.289e+03 5.240e+03, threshold=3.184e+03, percent-clipped=19.0 2023-06-25 16:40:22,428 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:40:29,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=22.5 2023-06-25 16:40:44,924 INFO [train.py:996] (0/4) Epoch 12, batch 2550, loss[loss=0.2263, simple_loss=0.3055, pruned_loss=0.07359, over 21804.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3048, pruned_loss=0.07688, over 4275427.02 frames. ], batch size: 124, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:41:00,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2028006.0, ans=0.125 2023-06-25 16:41:06,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.23 vs. limit=15.0 2023-06-25 16:42:05,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2028126.0, ans=0.0 2023-06-25 16:42:26,421 INFO [train.py:996] (0/4) Epoch 12, batch 2600, loss[loss=0.2427, simple_loss=0.3185, pruned_loss=0.08345, over 21734.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3052, pruned_loss=0.07809, over 4265982.03 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:43:00,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2028306.0, ans=0.5 2023-06-25 16:43:41,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2028426.0, ans=0.125 2023-06-25 16:43:49,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.618e+02 9.833e+02 1.370e+03 2.330e+03 4.697e+03, threshold=2.739e+03, percent-clipped=12.0 2023-06-25 16:44:04,800 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:44:09,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2028486.0, ans=0.0 2023-06-25 16:44:24,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2028546.0, ans=0.125 2023-06-25 16:44:25,808 INFO [train.py:996] (0/4) Epoch 12, batch 2650, loss[loss=0.2503, simple_loss=0.318, pruned_loss=0.09133, over 21723.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3085, pruned_loss=0.08042, over 4268912.69 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:44:30,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-25 16:44:49,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2028606.0, ans=0.0 2023-06-25 16:45:31,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2028666.0, ans=0.0 2023-06-25 16:46:18,774 INFO [train.py:996] (0/4) Epoch 12, batch 2700, loss[loss=0.1813, simple_loss=0.235, pruned_loss=0.06374, over 20734.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3054, pruned_loss=0.07945, over 4260429.08 frames. ], batch size: 607, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:46:36,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-25 16:47:19,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2028966.0, ans=0.125 2023-06-25 16:47:35,851 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.499e+02 8.174e+02 1.316e+03 1.880e+03 3.948e+03, threshold=2.631e+03, percent-clipped=11.0 2023-06-25 16:47:36,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2029026.0, ans=0.0 2023-06-25 16:47:46,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2029086.0, ans=0.125 2023-06-25 16:48:09,371 INFO [train.py:996] (0/4) Epoch 12, batch 2750, loss[loss=0.245, simple_loss=0.3085, pruned_loss=0.0908, over 21539.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3044, pruned_loss=0.07958, over 4261210.87 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:49:18,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2029326.0, ans=0.0 2023-06-25 16:49:35,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.45 vs. limit=22.5 2023-06-25 16:49:55,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2029386.0, ans=0.2 2023-06-25 16:50:00,335 INFO [train.py:996] (0/4) Epoch 12, batch 2800, loss[loss=0.2586, simple_loss=0.325, pruned_loss=0.09616, over 21238.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3106, pruned_loss=0.08071, over 4268061.53 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 16:51:01,282 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-25 16:51:26,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.572e+02 1.005e+03 1.362e+03 2.253e+03 4.999e+03, threshold=2.724e+03, percent-clipped=18.0 2023-06-25 16:51:34,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2029686.0, ans=0.125 2023-06-25 16:51:46,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.77 vs. limit=15.0 2023-06-25 16:51:53,152 INFO [train.py:996] (0/4) Epoch 12, batch 2850, loss[loss=0.1852, simple_loss=0.2591, pruned_loss=0.05558, over 21677.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3102, pruned_loss=0.08054, over 4263808.76 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:52:00,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2029746.0, ans=0.1 2023-06-25 16:52:23,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2029806.0, ans=0.0 2023-06-25 16:52:30,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2029806.0, ans=0.1 2023-06-25 16:53:04,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-25 16:53:38,285 INFO [train.py:996] (0/4) Epoch 12, batch 2900, loss[loss=0.2347, simple_loss=0.3212, pruned_loss=0.07415, over 21430.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3114, pruned_loss=0.08165, over 4267798.89 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:53:51,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2030046.0, ans=0.1 2023-06-25 16:54:52,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2030226.0, ans=0.125 2023-06-25 16:54:54,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2030226.0, ans=0.2 2023-06-25 16:54:56,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.061e+02 9.423e+02 1.341e+03 2.226e+03 4.607e+03, threshold=2.681e+03, percent-clipped=12.0 2023-06-25 16:55:05,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2030226.0, ans=0.0 2023-06-25 16:55:28,752 INFO [train.py:996] (0/4) Epoch 12, batch 2950, loss[loss=0.3044, simple_loss=0.3599, pruned_loss=0.1245, over 21787.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3132, pruned_loss=0.08216, over 4272861.03 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:55:50,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2030406.0, ans=0.2 2023-06-25 16:55:50,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2030406.0, ans=0.125 2023-06-25 16:55:52,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2030406.0, ans=0.0 2023-06-25 16:56:07,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2030406.0, ans=0.0 2023-06-25 16:56:08,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2030406.0, ans=0.125 2023-06-25 16:56:26,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=12.0 2023-06-25 16:57:14,412 INFO [train.py:996] (0/4) Epoch 12, batch 3000, loss[loss=0.2621, simple_loss=0.3294, pruned_loss=0.09741, over 21505.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3166, pruned_loss=0.08185, over 4279111.44 frames. ], batch size: 194, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:57:14,414 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 16:57:41,082 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2513, simple_loss=0.3439, pruned_loss=0.07939, over 1796401.00 frames. 2023-06-25 16:57:41,083 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 16:57:54,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2030646.0, ans=0.1 2023-06-25 16:58:22,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2030766.0, ans=0.125 2023-06-25 16:58:52,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.014e+02 9.167e+02 1.270e+03 1.811e+03 4.329e+03, threshold=2.541e+03, percent-clipped=6.0 2023-06-25 16:58:53,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2030826.0, ans=0.1 2023-06-25 16:59:07,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2030886.0, ans=0.0 2023-06-25 16:59:25,825 INFO [train.py:996] (0/4) Epoch 12, batch 3050, loss[loss=0.2664, simple_loss=0.3638, pruned_loss=0.08454, over 21342.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3172, pruned_loss=0.08054, over 4279003.68 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:00:01,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2031006.0, ans=0.125 2023-06-25 17:00:49,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2031186.0, ans=0.125 2023-06-25 17:01:18,220 INFO [train.py:996] (0/4) Epoch 12, batch 3100, loss[loss=0.2181, simple_loss=0.3216, pruned_loss=0.05731, over 21239.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3166, pruned_loss=0.07985, over 4278829.98 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:02:01,793 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:02:09,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031366.0, ans=0.1 2023-06-25 17:02:15,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2031426.0, ans=0.125 2023-06-25 17:02:25,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.476e+02 8.628e+02 1.556e+03 2.270e+03 3.749e+03, threshold=3.112e+03, percent-clipped=16.0 2023-06-25 17:03:06,724 INFO [train.py:996] (0/4) Epoch 12, batch 3150, loss[loss=0.2553, simple_loss=0.3521, pruned_loss=0.07928, over 21662.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3171, pruned_loss=0.07949, over 4277336.53 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:03:07,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2031546.0, ans=0.1 2023-06-25 17:03:12,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2031546.0, ans=0.125 2023-06-25 17:03:37,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-06-25 17:03:56,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2031666.0, ans=0.125 2023-06-25 17:04:42,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2031786.0, ans=0.125 2023-06-25 17:05:01,277 INFO [train.py:996] (0/4) Epoch 12, batch 3200, loss[loss=0.2314, simple_loss=0.3339, pruned_loss=0.06441, over 21227.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3161, pruned_loss=0.07919, over 4270593.58 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:05:20,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-25 17:05:35,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2031906.0, ans=0.95 2023-06-25 17:06:07,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-25 17:06:21,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.588e+02 9.417e+02 1.305e+03 2.002e+03 3.314e+03, threshold=2.610e+03, percent-clipped=4.0 2023-06-25 17:06:43,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2032146.0, ans=0.2 2023-06-25 17:06:45,612 INFO [train.py:996] (0/4) Epoch 12, batch 3250, loss[loss=0.2199, simple_loss=0.2857, pruned_loss=0.07705, over 21990.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3162, pruned_loss=0.08, over 4273785.19 frames. ], batch size: 103, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:07:17,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2032206.0, ans=0.0 2023-06-25 17:08:13,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2032386.0, ans=0.0 2023-06-25 17:08:28,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2032386.0, ans=0.125 2023-06-25 17:08:38,530 INFO [train.py:996] (0/4) Epoch 12, batch 3300, loss[loss=0.268, simple_loss=0.3463, pruned_loss=0.09487, over 21364.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3109, pruned_loss=0.07925, over 4264128.39 frames. ], batch size: 549, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:08:50,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2032446.0, ans=0.1 2023-06-25 17:08:50,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2032446.0, ans=0.125 2023-06-25 17:09:06,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-25 17:09:21,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2032506.0, ans=0.2 2023-06-25 17:09:27,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2032566.0, ans=0.125 2023-06-25 17:09:42,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-25 17:09:55,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2032626.0, ans=0.2 2023-06-25 17:09:56,060 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.590e+02 8.857e+02 1.384e+03 2.131e+03 4.581e+03, threshold=2.768e+03, percent-clipped=14.0 2023-06-25 17:10:25,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2032686.0, ans=0.2 2023-06-25 17:10:28,070 INFO [train.py:996] (0/4) Epoch 12, batch 3350, loss[loss=0.2293, simple_loss=0.3162, pruned_loss=0.07121, over 21729.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3128, pruned_loss=0.07917, over 4272688.09 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:10:43,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-25 17:12:15,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=15.0 2023-06-25 17:12:17,392 INFO [train.py:996] (0/4) Epoch 12, batch 3400, loss[loss=0.2229, simple_loss=0.2988, pruned_loss=0.07346, over 21373.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3124, pruned_loss=0.0797, over 4280605.45 frames. ], batch size: 471, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:13:48,686 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.332e+02 9.099e+02 1.349e+03 1.796e+03 3.997e+03, threshold=2.698e+03, percent-clipped=5.0 2023-06-25 17:14:00,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=15.0 2023-06-25 17:14:13,576 INFO [train.py:996] (0/4) Epoch 12, batch 3450, loss[loss=0.22, simple_loss=0.288, pruned_loss=0.07597, over 21147.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3078, pruned_loss=0.07885, over 4279578.40 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:14:22,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-25 17:14:51,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2033406.0, ans=0.125 2023-06-25 17:14:59,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2033406.0, ans=0.0 2023-06-25 17:15:47,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2033586.0, ans=0.1 2023-06-25 17:15:54,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2033586.0, ans=0.2 2023-06-25 17:16:10,328 INFO [train.py:996] (0/4) Epoch 12, batch 3500, loss[loss=0.2729, simple_loss=0.3472, pruned_loss=0.09928, over 21775.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3185, pruned_loss=0.08334, over 4281783.30 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:16:53,581 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:16:53,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2033706.0, ans=0.125 2023-06-25 17:17:04,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=2033766.0, ans=0.95 2023-06-25 17:17:32,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.249e+02 1.085e+03 1.477e+03 2.105e+03 4.608e+03, threshold=2.953e+03, percent-clipped=10.0 2023-06-25 17:17:36,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=2033826.0, ans=0.5 2023-06-25 17:17:53,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2033886.0, ans=0.1 2023-06-25 17:18:17,200 INFO [train.py:996] (0/4) Epoch 12, batch 3550, loss[loss=0.276, simple_loss=0.3171, pruned_loss=0.1174, over 21295.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3204, pruned_loss=0.08467, over 4274040.94 frames. ], batch size: 507, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:18:20,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2033946.0, ans=0.1 2023-06-25 17:18:25,439 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-25 17:19:21,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2034126.0, ans=0.1 2023-06-25 17:19:28,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2034126.0, ans=0.0 2023-06-25 17:20:12,184 INFO [train.py:996] (0/4) Epoch 12, batch 3600, loss[loss=0.2619, simple_loss=0.3098, pruned_loss=0.107, over 21223.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3148, pruned_loss=0.08369, over 4266550.93 frames. ], batch size: 471, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:20:39,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2034306.0, ans=0.0 2023-06-25 17:20:41,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2034306.0, ans=0.0 2023-06-25 17:20:49,094 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-25 17:21:00,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2034366.0, ans=0.1 2023-06-25 17:21:27,851 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 9.639e+02 1.612e+03 2.381e+03 4.879e+03, threshold=3.225e+03, percent-clipped=14.0 2023-06-25 17:21:58,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2034486.0, ans=0.125 2023-06-25 17:21:59,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2034486.0, ans=0.2 2023-06-25 17:22:02,917 INFO [train.py:996] (0/4) Epoch 12, batch 3650, loss[loss=0.2304, simple_loss=0.2922, pruned_loss=0.08424, over 21504.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3136, pruned_loss=0.08352, over 4268238.02 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:22:27,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2034606.0, ans=0.2 2023-06-25 17:23:18,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2034726.0, ans=10.0 2023-06-25 17:23:51,421 INFO [train.py:996] (0/4) Epoch 12, batch 3700, loss[loss=0.2324, simple_loss=0.3168, pruned_loss=0.074, over 21625.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3128, pruned_loss=0.08278, over 4270762.74 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:24:07,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2034906.0, ans=0.95 2023-06-25 17:24:10,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=2034906.0, ans=10.0 2023-06-25 17:24:15,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2034906.0, ans=0.0 2023-06-25 17:25:05,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.252e+02 8.433e+02 1.159e+03 1.645e+03 2.871e+03, threshold=2.319e+03, percent-clipped=0.0 2023-06-25 17:25:31,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2035086.0, ans=0.0 2023-06-25 17:25:39,738 INFO [train.py:996] (0/4) Epoch 12, batch 3750, loss[loss=0.1958, simple_loss=0.2759, pruned_loss=0.05781, over 21755.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3118, pruned_loss=0.08223, over 4278899.90 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:25:58,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2035206.0, ans=0.125 2023-06-25 17:27:11,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2035326.0, ans=0.125 2023-06-25 17:27:30,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2035446.0, ans=0.125 2023-06-25 17:27:31,421 INFO [train.py:996] (0/4) Epoch 12, batch 3800, loss[loss=0.2148, simple_loss=0.2892, pruned_loss=0.07024, over 21630.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3097, pruned_loss=0.08068, over 4279350.80 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:27:39,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2035446.0, ans=0.125 2023-06-25 17:28:19,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.55 vs. limit=10.0 2023-06-25 17:29:02,410 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.463e+02 9.532e+02 1.366e+03 2.202e+03 4.372e+03, threshold=2.732e+03, percent-clipped=24.0 2023-06-25 17:29:07,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2035686.0, ans=0.125 2023-06-25 17:29:17,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-25 17:29:24,677 INFO [train.py:996] (0/4) Epoch 12, batch 3850, loss[loss=0.1874, simple_loss=0.2503, pruned_loss=0.0623, over 21320.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.308, pruned_loss=0.08068, over 4277761.71 frames. ], batch size: 144, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:29:32,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2035746.0, ans=0.125 2023-06-25 17:29:39,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2035746.0, ans=0.0 2023-06-25 17:29:43,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-25 17:30:59,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2035986.0, ans=0.125 2023-06-25 17:31:15,973 INFO [train.py:996] (0/4) Epoch 12, batch 3900, loss[loss=0.2212, simple_loss=0.293, pruned_loss=0.07466, over 21646.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3057, pruned_loss=0.08096, over 4277365.11 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:31:16,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2036046.0, ans=0.1 2023-06-25 17:31:32,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2036106.0, ans=0.035 2023-06-25 17:31:44,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2036106.0, ans=0.125 2023-06-25 17:32:38,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0 2023-06-25 17:32:41,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.422e+02 7.358e+02 1.058e+03 1.629e+03 3.913e+03, threshold=2.115e+03, percent-clipped=2.0 2023-06-25 17:32:58,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2036286.0, ans=0.0 2023-06-25 17:33:04,660 INFO [train.py:996] (0/4) Epoch 12, batch 3950, loss[loss=0.1722, simple_loss=0.248, pruned_loss=0.04819, over 21160.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3066, pruned_loss=0.08023, over 4280988.10 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:33:05,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2036346.0, ans=0.125 2023-06-25 17:33:40,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2036406.0, ans=0.125 2023-06-25 17:34:56,637 INFO [train.py:996] (0/4) Epoch 12, batch 4000, loss[loss=0.1826, simple_loss=0.2465, pruned_loss=0.05938, over 21292.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3, pruned_loss=0.07701, over 4280776.56 frames. ], batch size: 160, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:35:38,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2036766.0, ans=0.125 2023-06-25 17:35:43,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2036766.0, ans=0.07 2023-06-25 17:35:53,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2036766.0, ans=0.125 2023-06-25 17:36:21,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2036826.0, ans=0.125 2023-06-25 17:36:22,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2036826.0, ans=0.125 2023-06-25 17:36:26,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2036826.0, ans=0.0 2023-06-25 17:36:31,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.824e+02 9.183e+02 1.353e+03 2.384e+03 4.707e+03, threshold=2.707e+03, percent-clipped=29.0 2023-06-25 17:36:35,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2036886.0, ans=0.0 2023-06-25 17:36:50,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2036946.0, ans=0.0 2023-06-25 17:36:51,632 INFO [train.py:996] (0/4) Epoch 12, batch 4050, loss[loss=0.2075, simple_loss=0.2933, pruned_loss=0.06086, over 21819.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3018, pruned_loss=0.07588, over 4287793.35 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:37:15,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-25 17:37:47,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2037066.0, ans=0.0 2023-06-25 17:38:34,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-25 17:38:44,044 INFO [train.py:996] (0/4) Epoch 12, batch 4100, loss[loss=0.2266, simple_loss=0.3054, pruned_loss=0.07386, over 21819.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3031, pruned_loss=0.07615, over 4294839.00 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:39:16,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2037306.0, ans=0.125 2023-06-25 17:39:22,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-25 17:39:56,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2037366.0, ans=0.0 2023-06-25 17:40:04,411 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-25 17:40:14,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2037426.0, ans=0.0 2023-06-25 17:40:15,663 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.136e+02 8.322e+02 1.179e+03 1.731e+03 4.243e+03, threshold=2.358e+03, percent-clipped=9.0 2023-06-25 17:40:37,189 INFO [train.py:996] (0/4) Epoch 12, batch 4150, loss[loss=0.1725, simple_loss=0.2632, pruned_loss=0.04095, over 21611.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3027, pruned_loss=0.07323, over 4296001.02 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:41:12,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2037606.0, ans=0.0 2023-06-25 17:41:14,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-25 17:42:14,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2037786.0, ans=0.125 2023-06-25 17:42:39,768 INFO [train.py:996] (0/4) Epoch 12, batch 4200, loss[loss=0.1892, simple_loss=0.2729, pruned_loss=0.05279, over 21505.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3063, pruned_loss=0.07415, over 4286366.59 frames. ], batch size: 195, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:43:24,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2037906.0, ans=0.1 2023-06-25 17:43:39,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2037966.0, ans=0.125 2023-06-25 17:43:45,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2038026.0, ans=0.0 2023-06-25 17:43:58,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.587e+02 1.298e+03 1.935e+03 2.585e+03 6.035e+03, threshold=3.870e+03, percent-clipped=37.0 2023-06-25 17:44:26,800 INFO [train.py:996] (0/4) Epoch 12, batch 4250, loss[loss=0.2548, simple_loss=0.3449, pruned_loss=0.0824, over 21587.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3108, pruned_loss=0.07522, over 4286956.26 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:44:47,569 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:45:08,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2038206.0, ans=0.125 2023-06-25 17:45:32,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2038326.0, ans=0.1 2023-06-25 17:45:44,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2038326.0, ans=0.2 2023-06-25 17:45:45,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2038326.0, ans=0.95 2023-06-25 17:45:53,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=2038326.0, ans=22.5 2023-06-25 17:46:02,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2038386.0, ans=0.125 2023-06-25 17:46:20,950 INFO [train.py:996] (0/4) Epoch 12, batch 4300, loss[loss=0.23, simple_loss=0.3208, pruned_loss=0.0696, over 21227.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3152, pruned_loss=0.0766, over 4287033.68 frames. ], batch size: 549, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:46:34,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=2038446.0, ans=0.05 2023-06-25 17:47:33,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-25 17:47:38,153 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:47:48,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2038626.0, ans=0.125 2023-06-25 17:47:49,728 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.415e+02 1.013e+03 1.560e+03 2.387e+03 5.571e+03, threshold=3.121e+03, percent-clipped=6.0 2023-06-25 17:47:52,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2038686.0, ans=0.0 2023-06-25 17:48:07,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=2038686.0, ans=0.1 2023-06-25 17:48:21,412 INFO [train.py:996] (0/4) Epoch 12, batch 4350, loss[loss=0.2052, simple_loss=0.2676, pruned_loss=0.07141, over 21460.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3152, pruned_loss=0.07575, over 4280855.87 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:48:32,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2038746.0, ans=0.125 2023-06-25 17:48:41,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2038746.0, ans=0.0 2023-06-25 17:48:41,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2038746.0, ans=0.125 2023-06-25 17:48:41,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2038746.0, ans=0.0 2023-06-25 17:49:08,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2038866.0, ans=0.125 2023-06-25 17:49:27,341 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.17 vs. limit=10.0 2023-06-25 17:50:21,661 INFO [train.py:996] (0/4) Epoch 12, batch 4400, loss[loss=0.2001, simple_loss=0.2871, pruned_loss=0.05658, over 21270.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3108, pruned_loss=0.07514, over 4284967.40 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:51:15,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2039166.0, ans=0.125 2023-06-25 17:51:49,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.648e+02 9.673e+02 1.616e+03 2.348e+03 4.188e+03, threshold=3.231e+03, percent-clipped=7.0 2023-06-25 17:52:17,333 INFO [train.py:996] (0/4) Epoch 12, batch 4450, loss[loss=0.335, simple_loss=0.4122, pruned_loss=0.1289, over 21513.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3196, pruned_loss=0.07807, over 4283838.37 frames. ], batch size: 507, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:54:08,249 INFO [train.py:996] (0/4) Epoch 12, batch 4500, loss[loss=0.2275, simple_loss=0.295, pruned_loss=0.07996, over 21282.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.32, pruned_loss=0.07979, over 4289490.54 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:54:32,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-25 17:55:41,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.020e+02 9.450e+02 1.326e+03 2.168e+03 4.753e+03, threshold=2.653e+03, percent-clipped=7.0 2023-06-25 17:55:44,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2039886.0, ans=0.0 2023-06-25 17:56:06,469 INFO [train.py:996] (0/4) Epoch 12, batch 4550, loss[loss=0.2682, simple_loss=0.3435, pruned_loss=0.09648, over 21786.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3216, pruned_loss=0.07971, over 4289500.65 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:56:21,070 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-340000.pt 2023-06-25 17:56:55,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-25 17:57:59,514 INFO [train.py:996] (0/4) Epoch 12, batch 4600, loss[loss=0.1959, simple_loss=0.2764, pruned_loss=0.05772, over 21756.00 frames. ], tot_loss[loss=0.243, simple_loss=0.323, pruned_loss=0.08152, over 4289695.24 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:58:18,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-25 17:58:57,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2040366.0, ans=0.1 2023-06-25 17:59:34,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.831e+02 1.036e+03 1.391e+03 1.906e+03 3.939e+03, threshold=2.783e+03, percent-clipped=3.0 2023-06-25 17:59:37,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2040486.0, ans=0.0 2023-06-25 17:59:53,363 INFO [train.py:996] (0/4) Epoch 12, batch 4650, loss[loss=0.181, simple_loss=0.2594, pruned_loss=0.05125, over 21771.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3168, pruned_loss=0.07943, over 4288962.80 frames. ], batch size: 371, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:00:15,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2040606.0, ans=0.0 2023-06-25 18:01:16,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2040726.0, ans=0.125 2023-06-25 18:01:47,033 INFO [train.py:996] (0/4) Epoch 12, batch 4700, loss[loss=0.1973, simple_loss=0.2703, pruned_loss=0.06218, over 21803.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3089, pruned_loss=0.07799, over 4290429.06 frames. ], batch size: 118, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:02:21,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2040906.0, ans=0.0 2023-06-25 18:02:42,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2040966.0, ans=0.125 2023-06-25 18:03:19,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.793e+02 9.702e+02 1.366e+03 2.356e+03 4.995e+03, threshold=2.732e+03, percent-clipped=18.0 2023-06-25 18:03:39,173 INFO [train.py:996] (0/4) Epoch 12, batch 4750, loss[loss=0.1952, simple_loss=0.2612, pruned_loss=0.06461, over 21674.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3067, pruned_loss=0.07864, over 4278889.64 frames. ], batch size: 264, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:04:08,452 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 18:04:20,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-25 18:04:30,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2041266.0, ans=0.125 2023-06-25 18:04:40,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-25 18:04:41,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.40 vs. limit=15.0 2023-06-25 18:05:26,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2041386.0, ans=0.0 2023-06-25 18:05:28,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2041446.0, ans=0.0 2023-06-25 18:05:29,027 INFO [train.py:996] (0/4) Epoch 12, batch 4800, loss[loss=0.1778, simple_loss=0.2416, pruned_loss=0.05699, over 21579.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3053, pruned_loss=0.07872, over 4278959.20 frames. ], batch size: 213, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 18:05:36,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2041446.0, ans=0.125 2023-06-25 18:06:52,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2041626.0, ans=0.125 2023-06-25 18:06:56,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.862e+02 9.240e+02 1.237e+03 1.888e+03 3.806e+03, threshold=2.475e+03, percent-clipped=7.0 2023-06-25 18:07:13,715 INFO [train.py:996] (0/4) Epoch 12, batch 4850, loss[loss=0.272, simple_loss=0.3339, pruned_loss=0.105, over 21717.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3036, pruned_loss=0.07785, over 4283846.38 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:07:29,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.40 vs. limit=15.0 2023-06-25 18:07:40,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2041806.0, ans=0.125 2023-06-25 18:08:21,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2041866.0, ans=0.125 2023-06-25 18:08:46,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2041926.0, ans=0.05 2023-06-25 18:09:06,959 INFO [train.py:996] (0/4) Epoch 12, batch 4900, loss[loss=0.2494, simple_loss=0.324, pruned_loss=0.08737, over 21329.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3044, pruned_loss=0.07823, over 4285224.13 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:09:24,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2042046.0, ans=0.0 2023-06-25 18:09:27,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-06-25 18:09:57,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-25 18:10:36,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.253e+02 9.394e+02 1.371e+03 2.232e+03 4.474e+03, threshold=2.741e+03, percent-clipped=21.0 2023-06-25 18:10:41,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2042286.0, ans=0.125 2023-06-25 18:10:55,749 INFO [train.py:996] (0/4) Epoch 12, batch 4950, loss[loss=0.196, simple_loss=0.3116, pruned_loss=0.04025, over 21167.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3075, pruned_loss=0.0768, over 4282753.79 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:11:04,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2042346.0, ans=0.125 2023-06-25 18:11:18,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.96 vs. limit=10.0 2023-06-25 18:11:33,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2042406.0, ans=0.125 2023-06-25 18:12:17,029 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.61 vs. limit=10.0 2023-06-25 18:12:45,012 INFO [train.py:996] (0/4) Epoch 12, batch 5000, loss[loss=0.2224, simple_loss=0.2976, pruned_loss=0.07357, over 20146.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3053, pruned_loss=0.07326, over 4271329.75 frames. ], batch size: 703, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:13:28,577 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-25 18:13:38,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2042766.0, ans=0.0 2023-06-25 18:13:38,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2042766.0, ans=0.2 2023-06-25 18:13:52,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2042826.0, ans=0.125 2023-06-25 18:14:08,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2042826.0, ans=0.125 2023-06-25 18:14:19,308 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 9.354e+02 1.620e+03 2.183e+03 4.157e+03, threshold=3.240e+03, percent-clipped=13.0 2023-06-25 18:14:34,564 INFO [train.py:996] (0/4) Epoch 12, batch 5050, loss[loss=0.2782, simple_loss=0.3351, pruned_loss=0.1107, over 21716.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3059, pruned_loss=0.07484, over 4277272.09 frames. ], batch size: 473, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:14:45,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2042946.0, ans=0.125 2023-06-25 18:16:11,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-25 18:16:24,387 INFO [train.py:996] (0/4) Epoch 12, batch 5100, loss[loss=0.2184, simple_loss=0.2883, pruned_loss=0.07423, over 21805.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3038, pruned_loss=0.07514, over 4287515.51 frames. ], batch size: 102, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:16:43,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2043246.0, ans=0.125 2023-06-25 18:16:56,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-25 18:17:04,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2043366.0, ans=0.125 2023-06-25 18:17:39,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2043426.0, ans=0.125 2023-06-25 18:18:00,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.858e+02 8.212e+02 1.178e+03 1.481e+03 2.753e+03, threshold=2.355e+03, percent-clipped=0.0 2023-06-25 18:18:11,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2043486.0, ans=0.1 2023-06-25 18:18:15,864 INFO [train.py:996] (0/4) Epoch 12, batch 5150, loss[loss=0.2194, simple_loss=0.297, pruned_loss=0.07095, over 17166.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3035, pruned_loss=0.07628, over 4289729.25 frames. ], batch size: 60, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:18:47,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2043606.0, ans=0.2 2023-06-25 18:18:51,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2043606.0, ans=0.125 2023-06-25 18:19:26,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2043726.0, ans=0.125 2023-06-25 18:19:27,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2043726.0, ans=0.0 2023-06-25 18:19:48,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2043726.0, ans=0.025 2023-06-25 18:20:04,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-25 18:20:10,116 INFO [train.py:996] (0/4) Epoch 12, batch 5200, loss[loss=0.2338, simple_loss=0.3318, pruned_loss=0.06789, over 21644.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3047, pruned_loss=0.07671, over 4288167.14 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:20:29,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2043846.0, ans=0.125 2023-06-25 18:21:13,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2043966.0, ans=0.0 2023-06-25 18:21:18,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-06-25 18:21:44,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.871e+02 1.084e+03 1.828e+03 2.790e+03 5.969e+03, threshold=3.657e+03, percent-clipped=36.0 2023-06-25 18:22:00,090 INFO [train.py:996] (0/4) Epoch 12, batch 5250, loss[loss=0.2596, simple_loss=0.3411, pruned_loss=0.08908, over 21697.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3085, pruned_loss=0.07576, over 4290004.61 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:22:14,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2044146.0, ans=0.1 2023-06-25 18:22:38,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2044206.0, ans=0.2 2023-06-25 18:23:04,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.94 vs. limit=15.0 2023-06-25 18:23:12,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2044326.0, ans=0.0 2023-06-25 18:23:39,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2044386.0, ans=0.0 2023-06-25 18:23:52,674 INFO [train.py:996] (0/4) Epoch 12, batch 5300, loss[loss=0.2628, simple_loss=0.3171, pruned_loss=0.1043, over 21796.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3076, pruned_loss=0.07676, over 4298231.32 frames. ], batch size: 508, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:24:05,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-25 18:24:17,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2044506.0, ans=0.035 2023-06-25 18:24:19,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-25 18:24:26,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2044506.0, ans=0.1 2023-06-25 18:24:42,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2044566.0, ans=0.125 2023-06-25 18:24:50,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2044566.0, ans=0.125 2023-06-25 18:24:59,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2044626.0, ans=0.04949747468305833 2023-06-25 18:24:59,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2044626.0, ans=0.1 2023-06-25 18:25:20,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.228e+02 8.444e+02 1.453e+03 2.194e+03 4.490e+03, threshold=2.906e+03, percent-clipped=2.0 2023-06-25 18:25:39,052 INFO [train.py:996] (0/4) Epoch 12, batch 5350, loss[loss=0.2606, simple_loss=0.3313, pruned_loss=0.0949, over 21722.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3074, pruned_loss=0.07905, over 4308318.42 frames. ], batch size: 112, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:25:49,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2044746.0, ans=0.0 2023-06-25 18:26:47,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-25 18:26:57,714 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:26:57,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2044926.0, ans=0.0 2023-06-25 18:27:05,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-25 18:27:09,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2044986.0, ans=0.125 2023-06-25 18:27:28,473 INFO [train.py:996] (0/4) Epoch 12, batch 5400, loss[loss=0.2257, simple_loss=0.2934, pruned_loss=0.07895, over 21875.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3052, pruned_loss=0.07999, over 4310846.76 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:27:58,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2045106.0, ans=0.0 2023-06-25 18:28:08,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2045106.0, ans=0.125 2023-06-25 18:28:49,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2045226.0, ans=0.125 2023-06-25 18:28:56,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2045286.0, ans=0.1 2023-06-25 18:29:03,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.530e+02 9.524e+02 1.484e+03 2.155e+03 3.165e+03, threshold=2.968e+03, percent-clipped=5.0 2023-06-25 18:29:17,838 INFO [train.py:996] (0/4) Epoch 12, batch 5450, loss[loss=0.1986, simple_loss=0.2903, pruned_loss=0.0534, over 21796.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3069, pruned_loss=0.07853, over 4307182.01 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:29:32,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2045346.0, ans=0.125 2023-06-25 18:29:35,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2045346.0, ans=0.0 2023-06-25 18:30:27,556 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-25 18:31:07,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-25 18:31:20,615 INFO [train.py:996] (0/4) Epoch 12, batch 5500, loss[loss=0.186, simple_loss=0.2771, pruned_loss=0.04747, over 21422.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3118, pruned_loss=0.07477, over 4301904.41 frames. ], batch size: 211, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:32:54,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.103e+02 8.816e+02 1.255e+03 2.207e+03 4.493e+03, threshold=2.511e+03, percent-clipped=9.0 2023-06-25 18:33:15,533 INFO [train.py:996] (0/4) Epoch 12, batch 5550, loss[loss=0.2048, simple_loss=0.3181, pruned_loss=0.04578, over 21163.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3136, pruned_loss=0.07236, over 4296428.28 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:33:57,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=15.0 2023-06-25 18:34:49,424 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:35:15,797 INFO [train.py:996] (0/4) Epoch 12, batch 5600, loss[loss=0.2711, simple_loss=0.3658, pruned_loss=0.08823, over 21774.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3115, pruned_loss=0.06994, over 4290851.72 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:36:42,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=12.0 2023-06-25 18:36:43,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.414e+02 9.096e+02 1.468e+03 2.250e+03 5.132e+03, threshold=2.936e+03, percent-clipped=21.0 2023-06-25 18:36:47,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2046486.0, ans=0.2 2023-06-25 18:36:52,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-25 18:36:59,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2046486.0, ans=0.1 2023-06-25 18:37:00,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2046486.0, ans=0.125 2023-06-25 18:37:03,773 INFO [train.py:996] (0/4) Epoch 12, batch 5650, loss[loss=0.2655, simple_loss=0.3346, pruned_loss=0.09822, over 21721.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3154, pruned_loss=0.07254, over 4288335.95 frames. ], batch size: 389, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:37:39,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-25 18:37:53,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-25 18:38:23,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2046726.0, ans=0.0 2023-06-25 18:38:53,833 INFO [train.py:996] (0/4) Epoch 12, batch 5700, loss[loss=0.2584, simple_loss=0.3443, pruned_loss=0.08626, over 21499.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3139, pruned_loss=0.07443, over 4292569.05 frames. ], batch size: 508, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:39:03,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2046846.0, ans=0.2 2023-06-25 18:39:27,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.67 vs. limit=15.0 2023-06-25 18:40:20,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2047026.0, ans=0.0 2023-06-25 18:40:34,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.374e+02 7.754e+02 1.090e+03 1.653e+03 4.716e+03, threshold=2.180e+03, percent-clipped=6.0 2023-06-25 18:40:48,829 INFO [train.py:996] (0/4) Epoch 12, batch 5750, loss[loss=0.2518, simple_loss=0.3623, pruned_loss=0.0707, over 19779.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.313, pruned_loss=0.07139, over 4280999.12 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:41:37,238 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-06-25 18:41:43,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2047266.0, ans=0.125 2023-06-25 18:42:45,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-25 18:42:46,113 INFO [train.py:996] (0/4) Epoch 12, batch 5800, loss[loss=0.2392, simple_loss=0.3373, pruned_loss=0.07057, over 21592.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3097, pruned_loss=0.06928, over 4274831.09 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:43:24,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2047506.0, ans=0.0 2023-06-25 18:43:41,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2047566.0, ans=0.125 2023-06-25 18:44:17,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.217e+02 1.046e+03 1.663e+03 2.546e+03 5.272e+03, threshold=3.326e+03, percent-clipped=34.0 2023-06-25 18:44:31,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2047686.0, ans=0.125 2023-06-25 18:44:42,294 INFO [train.py:996] (0/4) Epoch 12, batch 5850, loss[loss=0.2566, simple_loss=0.3502, pruned_loss=0.08149, over 20111.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3101, pruned_loss=0.06652, over 4263816.74 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:44:58,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2047746.0, ans=0.125 2023-06-25 18:45:01,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2047746.0, ans=0.125 2023-06-25 18:45:13,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2047806.0, ans=0.0 2023-06-25 18:45:42,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2047866.0, ans=0.125 2023-06-25 18:45:53,174 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.32 vs. limit=10.0 2023-06-25 18:46:17,736 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:46:34,310 INFO [train.py:996] (0/4) Epoch 12, batch 5900, loss[loss=0.2039, simple_loss=0.2788, pruned_loss=0.0645, over 21306.00 frames. ], tot_loss[loss=0.212, simple_loss=0.3022, pruned_loss=0.06085, over 4263721.97 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:46:43,403 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:46:48,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2048046.0, ans=0.025 2023-06-25 18:46:58,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2048106.0, ans=0.0 2023-06-25 18:47:05,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2048106.0, ans=0.0 2023-06-25 18:47:05,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=2048106.0, ans=15.0 2023-06-25 18:48:00,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 8.415e+02 1.286e+03 1.722e+03 5.284e+03, threshold=2.571e+03, percent-clipped=3.0 2023-06-25 18:48:16,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2048286.0, ans=0.1 2023-06-25 18:48:16,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2048286.0, ans=0.1 2023-06-25 18:48:16,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2048286.0, ans=0.05 2023-06-25 18:48:22,677 INFO [train.py:996] (0/4) Epoch 12, batch 5950, loss[loss=0.2239, simple_loss=0.2888, pruned_loss=0.0795, over 21919.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.3005, pruned_loss=0.06363, over 4263285.69 frames. ], batch size: 373, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:49:02,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.85 vs. limit=10.0 2023-06-25 18:49:44,738 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 18:50:16,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2048586.0, ans=0.0 2023-06-25 18:50:19,489 INFO [train.py:996] (0/4) Epoch 12, batch 6000, loss[loss=0.2245, simple_loss=0.282, pruned_loss=0.08352, over 21267.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2977, pruned_loss=0.06648, over 4257183.76 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 32.0 2023-06-25 18:50:19,502 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 18:50:35,678 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.2103, 2.6174, 2.5196, 2.3497], device='cuda:0') 2023-06-25 18:50:36,761 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2561, simple_loss=0.3516, pruned_loss=0.08031, over 1796401.00 frames. 2023-06-25 18:50:36,762 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 18:50:58,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=22.5 2023-06-25 18:51:02,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=15.0 2023-06-25 18:51:52,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2048826.0, ans=0.015 2023-06-25 18:52:13,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.382e+02 1.134e+03 1.580e+03 2.203e+03 4.281e+03, threshold=3.160e+03, percent-clipped=13.0 2023-06-25 18:52:17,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2048886.0, ans=0.125 2023-06-25 18:52:25,714 INFO [train.py:996] (0/4) Epoch 12, batch 6050, loss[loss=0.1922, simple_loss=0.2692, pruned_loss=0.05765, over 21607.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2929, pruned_loss=0.06847, over 4247203.25 frames. ], batch size: 415, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:52:38,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2048946.0, ans=0.0 2023-06-25 18:52:49,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2049006.0, ans=0.1 2023-06-25 18:54:12,133 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.15 vs. limit=6.0 2023-06-25 18:54:14,013 INFO [train.py:996] (0/4) Epoch 12, batch 6100, loss[loss=0.2009, simple_loss=0.2844, pruned_loss=0.05866, over 21871.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2919, pruned_loss=0.06722, over 4260240.83 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:54:40,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2049306.0, ans=0.0 2023-06-25 18:55:44,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2049486.0, ans=0.125 2023-06-25 18:55:51,222 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.240e+02 1.094e+03 1.641e+03 2.565e+03 7.490e+03, threshold=3.281e+03, percent-clipped=16.0 2023-06-25 18:56:00,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2049486.0, ans=0.1 2023-06-25 18:56:03,317 INFO [train.py:996] (0/4) Epoch 12, batch 6150, loss[loss=0.2244, simple_loss=0.2909, pruned_loss=0.07892, over 21261.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2936, pruned_loss=0.06984, over 4265769.48 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:56:24,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2049606.0, ans=0.0 2023-06-25 18:56:31,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2049606.0, ans=0.125 2023-06-25 18:56:34,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2049606.0, ans=0.07 2023-06-25 18:56:57,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2049666.0, ans=0.0 2023-06-25 18:57:11,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.88 vs. limit=15.0 2023-06-25 18:57:17,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2049726.0, ans=0.0 2023-06-25 18:57:56,203 INFO [train.py:996] (0/4) Epoch 12, batch 6200, loss[loss=0.3536, simple_loss=0.4256, pruned_loss=0.1408, over 21652.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2968, pruned_loss=0.0706, over 4265754.38 frames. ], batch size: 509, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:58:03,656 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-06-25 18:59:12,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-06-25 18:59:26,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-25 18:59:28,921 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 1.039e+03 1.350e+03 2.072e+03 4.321e+03, threshold=2.700e+03, percent-clipped=2.0 2023-06-25 18:59:45,880 INFO [train.py:996] (0/4) Epoch 12, batch 6250, loss[loss=0.2272, simple_loss=0.3371, pruned_loss=0.05862, over 21674.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3034, pruned_loss=0.07086, over 4274791.09 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:00:10,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2050206.0, ans=0.0 2023-06-25 19:01:34,300 INFO [train.py:996] (0/4) Epoch 12, batch 6300, loss[loss=0.212, simple_loss=0.3359, pruned_loss=0.04411, over 20784.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3057, pruned_loss=0.0693, over 4270439.79 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:01:40,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-25 19:01:40,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-25 19:01:48,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2050446.0, ans=0.125 2023-06-25 19:03:03,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-25 19:03:09,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2050686.0, ans=0.125 2023-06-25 19:03:12,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.168e+02 9.742e+02 1.547e+03 2.074e+03 3.988e+03, threshold=3.093e+03, percent-clipped=9.0 2023-06-25 19:03:19,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-25 19:03:22,592 INFO [train.py:996] (0/4) Epoch 12, batch 6350, loss[loss=0.2671, simple_loss=0.3429, pruned_loss=0.09563, over 21423.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3081, pruned_loss=0.07279, over 4272320.71 frames. ], batch size: 131, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:03:40,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2050746.0, ans=0.125 2023-06-25 19:04:04,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-25 19:04:49,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2050926.0, ans=0.1 2023-06-25 19:05:02,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2050986.0, ans=0.125 2023-06-25 19:05:15,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.79 vs. limit=22.5 2023-06-25 19:05:15,856 INFO [train.py:996] (0/4) Epoch 12, batch 6400, loss[loss=0.2892, simple_loss=0.355, pruned_loss=0.1117, over 21458.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3145, pruned_loss=0.07772, over 4271449.85 frames. ], batch size: 211, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:06:54,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.078e+02 9.080e+02 1.264e+03 1.614e+03 4.055e+03, threshold=2.529e+03, percent-clipped=6.0 2023-06-25 19:07:09,042 INFO [train.py:996] (0/4) Epoch 12, batch 6450, loss[loss=0.2166, simple_loss=0.3017, pruned_loss=0.06577, over 21827.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3152, pruned_loss=0.0767, over 4275870.91 frames. ], batch size: 102, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:07:25,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2051346.0, ans=0.5 2023-06-25 19:08:05,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2051466.0, ans=0.125 2023-06-25 19:08:11,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2051466.0, ans=0.0 2023-06-25 19:08:20,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-25 19:08:35,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2051586.0, ans=0.125 2023-06-25 19:08:59,164 INFO [train.py:996] (0/4) Epoch 12, batch 6500, loss[loss=0.2632, simple_loss=0.3161, pruned_loss=0.1051, over 21836.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3081, pruned_loss=0.07615, over 4279996.09 frames. ], batch size: 98, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:09:46,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-25 19:09:52,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2051766.0, ans=0.2 2023-06-25 19:10:00,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2051766.0, ans=0.1 2023-06-25 19:10:05,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2051826.0, ans=0.0 2023-06-25 19:10:35,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.489e+02 7.750e+02 1.055e+03 1.738e+03 4.061e+03, threshold=2.109e+03, percent-clipped=10.0 2023-06-25 19:10:52,971 INFO [train.py:996] (0/4) Epoch 12, batch 6550, loss[loss=0.214, simple_loss=0.2899, pruned_loss=0.0691, over 21733.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3054, pruned_loss=0.07464, over 4279631.62 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:10:56,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2051946.0, ans=0.125 2023-06-25 19:11:40,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2052066.0, ans=0.0 2023-06-25 19:11:42,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2052066.0, ans=0.025 2023-06-25 19:11:48,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2052066.0, ans=0.025 2023-06-25 19:12:09,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2052126.0, ans=0.125 2023-06-25 19:12:41,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2052246.0, ans=0.125 2023-06-25 19:12:42,077 INFO [train.py:996] (0/4) Epoch 12, batch 6600, loss[loss=0.2256, simple_loss=0.2849, pruned_loss=0.0832, over 21745.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3004, pruned_loss=0.07452, over 4280686.33 frames. ], batch size: 300, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:14:22,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.983e+02 6.750e+02 1.048e+03 1.422e+03 4.566e+03, threshold=2.096e+03, percent-clipped=9.0 2023-06-25 19:14:36,291 INFO [train.py:996] (0/4) Epoch 12, batch 6650, loss[loss=0.1878, simple_loss=0.2638, pruned_loss=0.05585, over 21671.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.295, pruned_loss=0.07222, over 4270709.21 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:14:49,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2052546.0, ans=0.0 2023-06-25 19:14:58,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-06-25 19:15:33,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2052726.0, ans=0.125 2023-06-25 19:15:33,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2052726.0, ans=0.09899494936611666 2023-06-25 19:15:49,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-25 19:16:05,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2052786.0, ans=0.2 2023-06-25 19:16:24,126 INFO [train.py:996] (0/4) Epoch 12, batch 6700, loss[loss=0.2398, simple_loss=0.3069, pruned_loss=0.0864, over 21554.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2917, pruned_loss=0.07203, over 4268425.01 frames. ], batch size: 442, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:16:24,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2052846.0, ans=0.0 2023-06-25 19:17:08,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2052966.0, ans=0.125 2023-06-25 19:17:59,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.379e+02 9.118e+02 1.454e+03 2.169e+03 5.767e+03, threshold=2.907e+03, percent-clipped=27.0 2023-06-25 19:18:13,648 INFO [train.py:996] (0/4) Epoch 12, batch 6750, loss[loss=0.2133, simple_loss=0.2807, pruned_loss=0.07293, over 21455.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2908, pruned_loss=0.07279, over 4274897.69 frames. ], batch size: 212, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:18:17,734 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-25 19:18:21,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-25 19:18:30,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2053206.0, ans=0.2 2023-06-25 19:19:00,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2053266.0, ans=0.1 2023-06-25 19:19:00,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2053266.0, ans=0.125 2023-06-25 19:19:15,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=2053326.0, ans=0.05 2023-06-25 19:19:15,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2053326.0, ans=0.2 2023-06-25 19:19:29,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2053326.0, ans=0.125 2023-06-25 19:19:29,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2053326.0, ans=0.0 2023-06-25 19:19:57,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-25 19:20:02,662 INFO [train.py:996] (0/4) Epoch 12, batch 6800, loss[loss=0.2154, simple_loss=0.284, pruned_loss=0.07341, over 21474.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2941, pruned_loss=0.07463, over 4276667.51 frames. ], batch size: 212, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:20:10,453 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-25 19:20:45,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2053566.0, ans=0.025 2023-06-25 19:21:29,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.142e+02 1.053e+03 1.480e+03 2.142e+03 3.474e+03, threshold=2.960e+03, percent-clipped=10.0 2023-06-25 19:21:42,862 INFO [train.py:996] (0/4) Epoch 12, batch 6850, loss[loss=0.2663, simple_loss=0.3126, pruned_loss=0.11, over 21722.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2918, pruned_loss=0.07591, over 4272403.99 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:22:07,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2053806.0, ans=0.035 2023-06-25 19:22:11,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2053806.0, ans=0.125 2023-06-25 19:23:06,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2053926.0, ans=0.04949747468305833 2023-06-25 19:23:27,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2053986.0, ans=0.2 2023-06-25 19:23:39,668 INFO [train.py:996] (0/4) Epoch 12, batch 6900, loss[loss=0.2048, simple_loss=0.2755, pruned_loss=0.06703, over 21340.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2921, pruned_loss=0.07606, over 4280141.20 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:24:04,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2054106.0, ans=0.125 2023-06-25 19:25:20,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.318e+02 8.019e+02 1.199e+03 1.523e+03 3.693e+03, threshold=2.398e+03, percent-clipped=1.0 2023-06-25 19:25:21,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2054286.0, ans=0.035 2023-06-25 19:25:28,994 INFO [train.py:996] (0/4) Epoch 12, batch 6950, loss[loss=0.2726, simple_loss=0.3425, pruned_loss=0.1014, over 21434.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2942, pruned_loss=0.07268, over 4281202.74 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:26:05,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2054406.0, ans=0.2 2023-06-25 19:26:57,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2054526.0, ans=0.125 2023-06-25 19:27:17,512 INFO [train.py:996] (0/4) Epoch 12, batch 7000, loss[loss=0.2492, simple_loss=0.3127, pruned_loss=0.09287, over 21725.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2971, pruned_loss=0.0755, over 4282095.49 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:27:40,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-25 19:27:41,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-25 19:27:46,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2054706.0, ans=0.125 2023-06-25 19:28:56,688 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.600e+02 9.157e+02 1.451e+03 1.908e+03 5.399e+03, threshold=2.901e+03, percent-clipped=14.0 2023-06-25 19:29:05,349 INFO [train.py:996] (0/4) Epoch 12, batch 7050, loss[loss=0.1987, simple_loss=0.2852, pruned_loss=0.05608, over 21748.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2935, pruned_loss=0.07336, over 4278135.45 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:29:27,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2055006.0, ans=0.2 2023-06-25 19:29:36,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2055006.0, ans=0.1 2023-06-25 19:30:58,066 INFO [train.py:996] (0/4) Epoch 12, batch 7100, loss[loss=0.163, simple_loss=0.241, pruned_loss=0.04248, over 16348.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3012, pruned_loss=0.07637, over 4271429.15 frames. ], batch size: 61, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:31:15,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2055306.0, ans=0.0 2023-06-25 19:31:22,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2055306.0, ans=0.0 2023-06-25 19:31:43,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2055366.0, ans=0.125 2023-06-25 19:31:49,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2055366.0, ans=0.0 2023-06-25 19:32:05,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2055426.0, ans=0.125 2023-06-25 19:32:13,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-25 19:32:23,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-25 19:32:32,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.001e+02 9.554e+02 1.238e+03 1.810e+03 4.265e+03, threshold=2.476e+03, percent-clipped=5.0 2023-06-25 19:32:44,678 INFO [train.py:996] (0/4) Epoch 12, batch 7150, loss[loss=0.3049, simple_loss=0.3666, pruned_loss=0.1216, over 21381.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2969, pruned_loss=0.07299, over 4262971.88 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:33:57,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2055726.0, ans=0.025 2023-06-25 19:34:31,904 INFO [train.py:996] (0/4) Epoch 12, batch 7200, loss[loss=0.2384, simple_loss=0.3157, pruned_loss=0.08054, over 21875.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2995, pruned_loss=0.07519, over 4263135.05 frames. ], batch size: 107, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:34:44,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2055846.0, ans=0.035 2023-06-25 19:36:05,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2056086.0, ans=0.125 2023-06-25 19:36:17,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.460e+02 1.008e+03 1.554e+03 2.511e+03 5.348e+03, threshold=3.107e+03, percent-clipped=25.0 2023-06-25 19:36:21,721 INFO [train.py:996] (0/4) Epoch 12, batch 7250, loss[loss=0.2266, simple_loss=0.2856, pruned_loss=0.0838, over 21488.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2966, pruned_loss=0.07572, over 4267585.49 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:36:29,783 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.36 vs. limit=10.0 2023-06-25 19:37:56,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2056386.0, ans=0.0 2023-06-25 19:38:09,785 INFO [train.py:996] (0/4) Epoch 12, batch 7300, loss[loss=0.2448, simple_loss=0.2901, pruned_loss=0.09975, over 21305.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2907, pruned_loss=0.07544, over 4265143.31 frames. ], batch size: 473, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:38:45,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2056506.0, ans=0.0 2023-06-25 19:39:02,633 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:39:15,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2056626.0, ans=0.125 2023-06-25 19:39:19,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2056626.0, ans=0.0 2023-06-25 19:39:46,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2056686.0, ans=0.2 2023-06-25 19:39:54,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.105e+02 8.292e+02 1.311e+03 1.863e+03 3.675e+03, threshold=2.622e+03, percent-clipped=4.0 2023-06-25 19:39:58,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2056746.0, ans=0.0 2023-06-25 19:39:59,544 INFO [train.py:996] (0/4) Epoch 12, batch 7350, loss[loss=0.2311, simple_loss=0.2997, pruned_loss=0.08128, over 21790.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2888, pruned_loss=0.07617, over 4260288.94 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:40:47,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2056806.0, ans=0.125 2023-06-25 19:40:48,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.45 vs. limit=22.5 2023-06-25 19:40:54,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2056866.0, ans=0.125 2023-06-25 19:41:21,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2056926.0, ans=0.125 2023-06-25 19:41:25,137 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:42:01,535 INFO [train.py:996] (0/4) Epoch 12, batch 7400, loss[loss=0.2035, simple_loss=0.2963, pruned_loss=0.05541, over 21732.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2964, pruned_loss=0.07763, over 4261702.52 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:42:59,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=2057166.0, ans=22.5 2023-06-25 19:42:59,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-25 19:43:43,652 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 8.354e+02 1.413e+03 2.372e+03 4.608e+03, threshold=2.826e+03, percent-clipped=17.0 2023-06-25 19:43:45,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2057286.0, ans=0.125 2023-06-25 19:43:48,739 INFO [train.py:996] (0/4) Epoch 12, batch 7450, loss[loss=0.1947, simple_loss=0.2684, pruned_loss=0.06044, over 21656.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2954, pruned_loss=0.07487, over 4262014.91 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:44:40,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2057466.0, ans=0.1 2023-06-25 19:45:39,199 INFO [train.py:996] (0/4) Epoch 12, batch 7500, loss[loss=0.2902, simple_loss=0.3901, pruned_loss=0.09518, over 21753.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3013, pruned_loss=0.07688, over 4266987.24 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:47:23,851 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.023e+02 9.552e+02 1.495e+03 2.068e+03 4.249e+03, threshold=2.990e+03, percent-clipped=12.0 2023-06-25 19:47:36,209 INFO [train.py:996] (0/4) Epoch 12, batch 7550, loss[loss=0.1854, simple_loss=0.282, pruned_loss=0.04439, over 21694.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3088, pruned_loss=0.07596, over 4271957.00 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:48:03,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2058006.0, ans=10.0 2023-06-25 19:48:22,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2058066.0, ans=0.125 2023-06-25 19:49:17,685 INFO [train.py:996] (0/4) Epoch 12, batch 7600, loss[loss=0.2247, simple_loss=0.3088, pruned_loss=0.07035, over 21685.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3067, pruned_loss=0.07406, over 4268706.45 frames. ], batch size: 389, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:49:32,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2058246.0, ans=0.2 2023-06-25 19:49:34,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=2058246.0, ans=22.5 2023-06-25 19:50:34,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2058426.0, ans=0.125 2023-06-25 19:51:01,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 8.115e+02 1.091e+03 1.750e+03 3.922e+03, threshold=2.181e+03, percent-clipped=5.0 2023-06-25 19:51:12,221 INFO [train.py:996] (0/4) Epoch 12, batch 7650, loss[loss=0.1716, simple_loss=0.2764, pruned_loss=0.03341, over 20754.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3051, pruned_loss=0.07564, over 4278315.33 frames. ], batch size: 609, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:51:13,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-25 19:52:23,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2058726.0, ans=0.125 2023-06-25 19:52:30,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2058726.0, ans=0.1 2023-06-25 19:52:57,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2058786.0, ans=0.1 2023-06-25 19:53:04,317 INFO [train.py:996] (0/4) Epoch 12, batch 7700, loss[loss=0.2499, simple_loss=0.3204, pruned_loss=0.08973, over 21832.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3087, pruned_loss=0.07938, over 4281376.07 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:53:32,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-25 19:53:44,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2058966.0, ans=0.125 2023-06-25 19:53:59,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2058966.0, ans=0.0 2023-06-25 19:54:41,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-25 19:54:52,164 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.071e+02 8.849e+02 1.361e+03 2.010e+03 4.988e+03, threshold=2.722e+03, percent-clipped=19.0 2023-06-25 19:55:00,912 INFO [train.py:996] (0/4) Epoch 12, batch 7750, loss[loss=0.2908, simple_loss=0.4122, pruned_loss=0.08465, over 21228.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3165, pruned_loss=0.08052, over 4280862.38 frames. ], batch size: 549, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:56:26,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2059386.0, ans=0.0 2023-06-25 19:56:50,884 INFO [train.py:996] (0/4) Epoch 12, batch 7800, loss[loss=0.2225, simple_loss=0.2929, pruned_loss=0.0761, over 21589.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3179, pruned_loss=0.08095, over 4274826.62 frames. ], batch size: 263, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:57:37,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=15.0 2023-06-25 19:57:52,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2059626.0, ans=0.2 2023-06-25 19:58:08,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2059626.0, ans=0.125 2023-06-25 19:58:11,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2059626.0, ans=0.125 2023-06-25 19:58:38,501 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.766e+02 1.116e+03 1.579e+03 2.542e+03 4.639e+03, threshold=3.158e+03, percent-clipped=18.0 2023-06-25 19:58:42,201 INFO [train.py:996] (0/4) Epoch 12, batch 7850, loss[loss=0.2567, simple_loss=0.3496, pruned_loss=0.08193, over 20819.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3099, pruned_loss=0.07932, over 4266198.04 frames. ], batch size: 609, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:58:47,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2059746.0, ans=0.0 2023-06-25 19:59:07,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2059806.0, ans=0.1 2023-06-25 19:59:08,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2059806.0, ans=0.125 2023-06-25 19:59:16,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2059806.0, ans=0.0 2023-06-25 19:59:46,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2059926.0, ans=0.125 2023-06-25 20:00:21,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2059986.0, ans=0.1 2023-06-25 20:00:23,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2059986.0, ans=0.2 2023-06-25 20:00:23,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-25 20:00:33,871 INFO [train.py:996] (0/4) Epoch 12, batch 7900, loss[loss=0.3373, simple_loss=0.4275, pruned_loss=0.1235, over 21413.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3053, pruned_loss=0.07735, over 4256536.98 frames. ], batch size: 507, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:00:38,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.67 vs. limit=15.0 2023-06-25 20:01:44,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-06-25 20:02:07,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2060226.0, ans=0.125 2023-06-25 20:02:17,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2060286.0, ans=0.5 2023-06-25 20:02:23,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.116e+02 9.466e+02 1.589e+03 2.565e+03 7.022e+03, threshold=3.178e+03, percent-clipped=11.0 2023-06-25 20:02:27,445 INFO [train.py:996] (0/4) Epoch 12, batch 7950, loss[loss=0.1969, simple_loss=0.2797, pruned_loss=0.05702, over 21477.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3093, pruned_loss=0.07672, over 4261888.33 frames. ], batch size: 211, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:03:17,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2060466.0, ans=0.2 2023-06-25 20:03:35,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2060466.0, ans=0.1 2023-06-25 20:03:37,535 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:04:00,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2060586.0, ans=0.0 2023-06-25 20:04:21,432 INFO [train.py:996] (0/4) Epoch 12, batch 8000, loss[loss=0.2599, simple_loss=0.3765, pruned_loss=0.07166, over 19916.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.315, pruned_loss=0.07888, over 4262690.42 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:04:56,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2060706.0, ans=0.0 2023-06-25 20:05:10,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-25 20:05:57,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2060886.0, ans=0.125 2023-06-25 20:06:06,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2060886.0, ans=0.125 2023-06-25 20:06:08,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2060886.0, ans=0.1 2023-06-25 20:06:16,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.154e+02 1.106e+03 1.585e+03 2.816e+03 6.535e+03, threshold=3.170e+03, percent-clipped=17.0 2023-06-25 20:06:25,708 INFO [train.py:996] (0/4) Epoch 12, batch 8050, loss[loss=0.2375, simple_loss=0.3193, pruned_loss=0.07791, over 20059.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3191, pruned_loss=0.07961, over 4262353.57 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:06:41,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-25 20:07:07,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2061066.0, ans=0.1 2023-06-25 20:07:28,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2061126.0, ans=0.5 2023-06-25 20:07:37,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2061126.0, ans=0.2 2023-06-25 20:07:40,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2061126.0, ans=0.1 2023-06-25 20:07:56,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2061186.0, ans=0.125 2023-06-25 20:08:14,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2061246.0, ans=0.05 2023-06-25 20:08:15,654 INFO [train.py:996] (0/4) Epoch 12, batch 8100, loss[loss=0.2248, simple_loss=0.3044, pruned_loss=0.07262, over 21893.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3181, pruned_loss=0.08043, over 4271302.08 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:08:55,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=15.0 2023-06-25 20:09:01,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.85 vs. limit=6.0 2023-06-25 20:09:13,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2061366.0, ans=0.0 2023-06-25 20:09:35,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=22.5 2023-06-25 20:09:42,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2061426.0, ans=0.5 2023-06-25 20:10:04,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2061486.0, ans=0.125 2023-06-25 20:10:11,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.662e+02 1.217e+03 1.664e+03 2.613e+03 6.278e+03, threshold=3.327e+03, percent-clipped=14.0 2023-06-25 20:10:14,532 INFO [train.py:996] (0/4) Epoch 12, batch 8150, loss[loss=0.2215, simple_loss=0.3066, pruned_loss=0.06815, over 21468.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3259, pruned_loss=0.08179, over 4273343.91 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:10:31,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2023-06-25 20:11:15,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2061726.0, ans=0.2 2023-06-25 20:12:03,282 INFO [train.py:996] (0/4) Epoch 12, batch 8200, loss[loss=0.1934, simple_loss=0.2425, pruned_loss=0.07214, over 20189.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.316, pruned_loss=0.07928, over 4268841.44 frames. ], batch size: 703, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:13:50,888 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.075e+02 8.315e+02 1.225e+03 2.435e+03 5.897e+03, threshold=2.449e+03, percent-clipped=16.0 2023-06-25 20:13:54,419 INFO [train.py:996] (0/4) Epoch 12, batch 8250, loss[loss=0.2392, simple_loss=0.3387, pruned_loss=0.06983, over 21716.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3124, pruned_loss=0.07838, over 4260795.59 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:14:05,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2062146.0, ans=0.025 2023-06-25 20:15:02,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2062326.0, ans=0.125 2023-06-25 20:15:27,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2062386.0, ans=0.125 2023-06-25 20:15:31,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2062386.0, ans=0.2 2023-06-25 20:15:44,097 INFO [train.py:996] (0/4) Epoch 12, batch 8300, loss[loss=0.2694, simple_loss=0.353, pruned_loss=0.09293, over 21516.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3112, pruned_loss=0.07572, over 4266052.04 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:16:01,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2062506.0, ans=0.0 2023-06-25 20:17:19,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2062686.0, ans=0.0 2023-06-25 20:17:29,291 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.578e+02 9.193e+02 1.531e+03 2.133e+03 6.656e+03, threshold=3.063e+03, percent-clipped=18.0 2023-06-25 20:17:32,754 INFO [train.py:996] (0/4) Epoch 12, batch 8350, loss[loss=0.2077, simple_loss=0.2864, pruned_loss=0.06449, over 21698.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3086, pruned_loss=0.0732, over 4269863.32 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:17:42,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2062746.0, ans=0.125 2023-06-25 20:18:35,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2062866.0, ans=0.125 2023-06-25 20:19:26,210 INFO [train.py:996] (0/4) Epoch 12, batch 8400, loss[loss=0.2665, simple_loss=0.3899, pruned_loss=0.07152, over 20755.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3075, pruned_loss=0.07159, over 4270664.82 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 32.0 2023-06-25 20:19:36,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2063046.0, ans=0.125 2023-06-25 20:19:39,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-25 20:21:02,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2063286.0, ans=0.2 2023-06-25 20:21:02,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2063286.0, ans=0.0 2023-06-25 20:21:14,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.481e+02 1.044e+03 1.681e+03 2.727e+03 5.790e+03, threshold=3.363e+03, percent-clipped=19.0 2023-06-25 20:21:14,126 INFO [train.py:996] (0/4) Epoch 12, batch 8450, loss[loss=0.2054, simple_loss=0.2805, pruned_loss=0.06511, over 21828.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3053, pruned_loss=0.0712, over 4282010.32 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:21:23,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2063346.0, ans=0.1 2023-06-25 20:22:54,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2063586.0, ans=0.125 2023-06-25 20:23:03,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2063646.0, ans=0.125 2023-06-25 20:23:04,276 INFO [train.py:996] (0/4) Epoch 12, batch 8500, loss[loss=0.2194, simple_loss=0.2795, pruned_loss=0.07963, over 21698.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3011, pruned_loss=0.07251, over 4272971.21 frames. ], batch size: 264, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:23:09,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2063646.0, ans=0.1 2023-06-25 20:23:24,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-25 20:24:38,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2063826.0, ans=0.0 2023-06-25 20:24:47,318 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.52 vs. limit=10.0 2023-06-25 20:24:58,605 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.271e+02 9.731e+02 1.377e+03 2.109e+03 5.965e+03, threshold=2.755e+03, percent-clipped=8.0 2023-06-25 20:24:58,638 INFO [train.py:996] (0/4) Epoch 12, batch 8550, loss[loss=0.2286, simple_loss=0.2994, pruned_loss=0.07885, over 21778.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3063, pruned_loss=0.07516, over 4266507.76 frames. ], batch size: 118, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:25:19,023 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-344000.pt 2023-06-25 20:25:46,578 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.68 vs. limit=6.0 2023-06-25 20:26:15,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=8.0 2023-06-25 20:26:19,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2064126.0, ans=0.125 2023-06-25 20:26:57,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-25 20:26:57,803 INFO [train.py:996] (0/4) Epoch 12, batch 8600, loss[loss=0.2532, simple_loss=0.3309, pruned_loss=0.08773, over 21743.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3123, pruned_loss=0.07767, over 4264291.86 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:27:40,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2064306.0, ans=0.125 2023-06-25 20:28:28,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2064486.0, ans=0.2 2023-06-25 20:28:30,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2064486.0, ans=0.125 2023-06-25 20:28:41,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2064486.0, ans=0.0 2023-06-25 20:28:50,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2064486.0, ans=0.0 2023-06-25 20:28:53,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.484e+02 8.709e+02 1.194e+03 1.940e+03 4.638e+03, threshold=2.389e+03, percent-clipped=12.0 2023-06-25 20:28:53,092 INFO [train.py:996] (0/4) Epoch 12, batch 8650, loss[loss=0.1931, simple_loss=0.2984, pruned_loss=0.04395, over 21778.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3186, pruned_loss=0.0798, over 4263992.59 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:30:10,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2064726.0, ans=0.0 2023-06-25 20:30:22,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2064786.0, ans=0.125 2023-06-25 20:30:26,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 20:30:40,450 INFO [train.py:996] (0/4) Epoch 12, batch 8700, loss[loss=0.2396, simple_loss=0.2904, pruned_loss=0.09443, over 21221.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3099, pruned_loss=0.07566, over 4266921.49 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:30:57,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2064906.0, ans=0.125 2023-06-25 20:31:52,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2065026.0, ans=0.0 2023-06-25 20:32:25,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.287e+02 9.432e+02 1.371e+03 2.158e+03 4.053e+03, threshold=2.743e+03, percent-clipped=19.0 2023-06-25 20:32:25,128 INFO [train.py:996] (0/4) Epoch 12, batch 8750, loss[loss=0.2524, simple_loss=0.3238, pruned_loss=0.09048, over 21899.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.306, pruned_loss=0.07607, over 4269553.65 frames. ], batch size: 107, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:32:46,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2065206.0, ans=0.2 2023-06-25 20:32:52,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2065206.0, ans=0.125 2023-06-25 20:32:52,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2065206.0, ans=0.07 2023-06-25 20:32:59,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2065206.0, ans=0.1 2023-06-25 20:33:29,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2065266.0, ans=0.0 2023-06-25 20:34:21,901 INFO [train.py:996] (0/4) Epoch 12, batch 8800, loss[loss=0.2738, simple_loss=0.3535, pruned_loss=0.09709, over 21286.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3157, pruned_loss=0.07909, over 4271538.40 frames. ], batch size: 548, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:35:04,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2065506.0, ans=0.1 2023-06-25 20:35:13,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-25 20:36:12,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.737e+02 1.070e+03 1.502e+03 2.029e+03 4.867e+03, threshold=3.004e+03, percent-clipped=9.0 2023-06-25 20:36:12,879 INFO [train.py:996] (0/4) Epoch 12, batch 8850, loss[loss=0.2794, simple_loss=0.3274, pruned_loss=0.1156, over 21391.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3211, pruned_loss=0.08035, over 4275263.72 frames. ], batch size: 508, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:36:36,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 20:36:40,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2065806.0, ans=0.0 2023-06-25 20:37:51,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2065986.0, ans=0.09899494936611666 2023-06-25 20:38:14,937 INFO [train.py:996] (0/4) Epoch 12, batch 8900, loss[loss=0.2019, simple_loss=0.2742, pruned_loss=0.06483, over 21542.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3153, pruned_loss=0.07921, over 4274334.88 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:38:26,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2066046.0, ans=0.2 2023-06-25 20:38:28,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-25 20:38:31,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-25 20:39:01,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2066166.0, ans=0.2 2023-06-25 20:39:10,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2066166.0, ans=0.125 2023-06-25 20:39:33,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2066226.0, ans=0.125 2023-06-25 20:39:48,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2066286.0, ans=0.125 2023-06-25 20:40:06,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2066286.0, ans=0.07 2023-06-25 20:40:07,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2066346.0, ans=0.0 2023-06-25 20:40:09,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.332e+02 9.490e+02 1.490e+03 1.972e+03 6.536e+03, threshold=2.979e+03, percent-clipped=12.0 2023-06-25 20:40:09,333 INFO [train.py:996] (0/4) Epoch 12, batch 8950, loss[loss=0.2415, simple_loss=0.3242, pruned_loss=0.0794, over 21857.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3173, pruned_loss=0.07839, over 4276748.62 frames. ], batch size: 317, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:41:31,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2066526.0, ans=0.025 2023-06-25 20:41:32,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-25 20:41:57,053 INFO [train.py:996] (0/4) Epoch 12, batch 9000, loss[loss=0.2203, simple_loss=0.2986, pruned_loss=0.071, over 21579.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.311, pruned_loss=0.07831, over 4277220.38 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:41:57,055 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 20:42:15,055 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2658, simple_loss=0.3589, pruned_loss=0.08634, over 1796401.00 frames. 2023-06-25 20:42:15,056 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 20:42:19,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2066646.0, ans=0.125 2023-06-25 20:42:56,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.58 vs. limit=22.5 2023-06-25 20:43:19,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2066766.0, ans=0.2 2023-06-25 20:43:41,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.22 vs. limit=22.5 2023-06-25 20:44:02,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.409e+02 9.189e+02 1.168e+03 1.657e+03 3.184e+03, threshold=2.336e+03, percent-clipped=2.0 2023-06-25 20:44:02,576 INFO [train.py:996] (0/4) Epoch 12, batch 9050, loss[loss=0.2195, simple_loss=0.3007, pruned_loss=0.06916, over 21351.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3053, pruned_loss=0.07514, over 4276840.58 frames. ], batch size: 549, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:44:04,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2066946.0, ans=0.125 2023-06-25 20:45:59,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-25 20:45:59,857 INFO [train.py:996] (0/4) Epoch 12, batch 9100, loss[loss=0.2275, simple_loss=0.3261, pruned_loss=0.06447, over 21618.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3099, pruned_loss=0.07777, over 4275926.84 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:46:14,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2067246.0, ans=0.125 2023-06-25 20:46:55,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-25 20:47:02,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2067366.0, ans=0.125 2023-06-25 20:47:40,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2067486.0, ans=0.125 2023-06-25 20:47:40,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2067486.0, ans=0.125 2023-06-25 20:47:56,040 INFO [train.py:996] (0/4) Epoch 12, batch 9150, loss[loss=0.2028, simple_loss=0.292, pruned_loss=0.05679, over 21325.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3131, pruned_loss=0.07567, over 4275329.46 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:47:57,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.614e+02 8.747e+02 1.495e+03 2.212e+03 5.275e+03, threshold=2.990e+03, percent-clipped=21.0 2023-06-25 20:48:51,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2067666.0, ans=0.125 2023-06-25 20:49:44,679 INFO [train.py:996] (0/4) Epoch 12, batch 9200, loss[loss=0.2812, simple_loss=0.354, pruned_loss=0.1042, over 21802.00 frames. ], tot_loss[loss=0.234, simple_loss=0.316, pruned_loss=0.07598, over 4272608.47 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:50:23,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2067906.0, ans=0.125 2023-06-25 20:51:31,765 INFO [train.py:996] (0/4) Epoch 12, batch 9250, loss[loss=0.2717, simple_loss=0.337, pruned_loss=0.1032, over 21325.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3181, pruned_loss=0.07873, over 4277060.29 frames. ], batch size: 548, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:51:33,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.008e+02 1.206e+03 1.769e+03 2.365e+03 5.380e+03, threshold=3.537e+03, percent-clipped=9.0 2023-06-25 20:51:37,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2068146.0, ans=0.125 2023-06-25 20:52:14,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2068206.0, ans=0.0 2023-06-25 20:52:23,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-25 20:52:29,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2068266.0, ans=0.1 2023-06-25 20:53:22,062 INFO [train.py:996] (0/4) Epoch 12, batch 9300, loss[loss=0.2117, simple_loss=0.2919, pruned_loss=0.06573, over 21520.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3129, pruned_loss=0.07842, over 4271710.42 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:53:39,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2068446.0, ans=0.125 2023-06-25 20:53:39,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2068446.0, ans=0.2 2023-06-25 20:54:04,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2068506.0, ans=0.125 2023-06-25 20:54:50,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2068626.0, ans=0.125 2023-06-25 20:55:18,078 INFO [train.py:996] (0/4) Epoch 12, batch 9350, loss[loss=0.287, simple_loss=0.3539, pruned_loss=0.1101, over 21260.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3189, pruned_loss=0.07924, over 4268957.88 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:55:19,921 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.869e+02 1.391e+03 2.110e+03 3.228e+03 6.570e+03, threshold=4.220e+03, percent-clipped=18.0 2023-06-25 20:55:57,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2068866.0, ans=0.125 2023-06-25 20:56:21,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2068926.0, ans=0.125 2023-06-25 20:56:23,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-25 20:57:06,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2068986.0, ans=0.0 2023-06-25 20:57:09,367 INFO [train.py:996] (0/4) Epoch 12, batch 9400, loss[loss=0.2163, simple_loss=0.2929, pruned_loss=0.06989, over 21508.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3203, pruned_loss=0.07993, over 4259107.08 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:57:21,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2069046.0, ans=0.025 2023-06-25 20:57:50,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2069106.0, ans=0.2 2023-06-25 20:58:12,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=22.5 2023-06-25 20:58:39,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-25 20:58:57,587 INFO [train.py:996] (0/4) Epoch 12, batch 9450, loss[loss=0.3178, simple_loss=0.4377, pruned_loss=0.09895, over 19713.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3144, pruned_loss=0.07862, over 4262559.29 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:58:59,291 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.200e+02 9.646e+02 1.392e+03 2.055e+03 4.300e+03, threshold=2.785e+03, percent-clipped=2.0 2023-06-25 20:59:01,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2069346.0, ans=0.125 2023-06-25 20:59:53,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2069466.0, ans=0.125 2023-06-25 21:00:45,750 INFO [train.py:996] (0/4) Epoch 12, batch 9500, loss[loss=0.2092, simple_loss=0.2733, pruned_loss=0.0725, over 21137.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3069, pruned_loss=0.07731, over 4260634.53 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:01:54,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2069826.0, ans=0.125 2023-06-25 21:02:37,412 INFO [train.py:996] (0/4) Epoch 12, batch 9550, loss[loss=0.2566, simple_loss=0.335, pruned_loss=0.08907, over 21774.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3078, pruned_loss=0.07827, over 4264645.09 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:02:40,602 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.839e+02 1.229e+03 2.098e+03 2.991e+03 5.309e+03, threshold=4.197e+03, percent-clipped=32.0 2023-06-25 21:02:49,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2069946.0, ans=0.125 2023-06-25 21:03:35,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2070066.0, ans=0.0 2023-06-25 21:04:29,477 INFO [train.py:996] (0/4) Epoch 12, batch 9600, loss[loss=0.2192, simple_loss=0.2889, pruned_loss=0.0748, over 21783.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3127, pruned_loss=0.07985, over 4271366.94 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:04:30,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2070246.0, ans=15.0 2023-06-25 21:04:47,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2070306.0, ans=0.0 2023-06-25 21:05:35,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-25 21:05:59,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2070426.0, ans=0.0 2023-06-25 21:06:19,450 INFO [train.py:996] (0/4) Epoch 12, batch 9650, loss[loss=0.2655, simple_loss=0.3402, pruned_loss=0.09546, over 21707.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3121, pruned_loss=0.07927, over 4272864.21 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:06:23,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.038e+02 1.081e+03 1.726e+03 2.542e+03 4.912e+03, threshold=3.453e+03, percent-clipped=2.0 2023-06-25 21:06:38,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2070546.0, ans=0.0 2023-06-25 21:07:07,344 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-25 21:07:45,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2070726.0, ans=0.0 2023-06-25 21:07:45,667 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-25 21:08:13,747 INFO [train.py:996] (0/4) Epoch 12, batch 9700, loss[loss=0.2793, simple_loss=0.353, pruned_loss=0.1028, over 21395.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3152, pruned_loss=0.0797, over 4277006.18 frames. ], batch size: 548, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:08:19,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2070846.0, ans=0.04949747468305833 2023-06-25 21:08:36,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2070906.0, ans=0.0 2023-06-25 21:09:11,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.79 vs. limit=6.0 2023-06-25 21:09:54,414 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-25 21:10:01,548 INFO [train.py:996] (0/4) Epoch 12, batch 9750, loss[loss=0.2107, simple_loss=0.2729, pruned_loss=0.07427, over 21577.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3107, pruned_loss=0.07895, over 4273972.91 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:10:04,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.823e+02 1.223e+03 1.776e+03 2.509e+03 4.467e+03, threshold=3.552e+03, percent-clipped=5.0 2023-06-25 21:10:10,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2071146.0, ans=0.125 2023-06-25 21:10:21,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2071146.0, ans=0.0 2023-06-25 21:11:05,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2071326.0, ans=0.5 2023-06-25 21:11:20,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2071326.0, ans=0.025 2023-06-25 21:11:32,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2071386.0, ans=0.0 2023-06-25 21:11:43,855 INFO [train.py:996] (0/4) Epoch 12, batch 9800, loss[loss=0.2339, simple_loss=0.3092, pruned_loss=0.07924, over 21908.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3091, pruned_loss=0.07896, over 4273628.14 frames. ], batch size: 333, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:12:26,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2071506.0, ans=0.07 2023-06-25 21:13:06,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2071626.0, ans=0.0 2023-06-25 21:13:29,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2071686.0, ans=0.125 2023-06-25 21:13:35,343 INFO [train.py:996] (0/4) Epoch 12, batch 9850, loss[loss=0.1829, simple_loss=0.2463, pruned_loss=0.05976, over 21618.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3049, pruned_loss=0.0784, over 4283305.09 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:13:44,003 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.466e+02 9.389e+02 1.315e+03 1.721e+03 3.595e+03, threshold=2.631e+03, percent-clipped=2.0 2023-06-25 21:13:44,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2071746.0, ans=0.07 2023-06-25 21:15:31,659 INFO [train.py:996] (0/4) Epoch 12, batch 9900, loss[loss=0.2662, simple_loss=0.3369, pruned_loss=0.0977, over 21833.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3024, pruned_loss=0.07837, over 4286063.14 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:16:12,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2072166.0, ans=0.0 2023-06-25 21:16:13,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2072166.0, ans=0.125 2023-06-25 21:17:16,886 INFO [train.py:996] (0/4) Epoch 12, batch 9950, loss[loss=0.2677, simple_loss=0.3448, pruned_loss=0.09531, over 21407.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.302, pruned_loss=0.07989, over 4278949.80 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:17:25,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.617e+02 8.355e+02 1.275e+03 2.214e+03 4.972e+03, threshold=2.550e+03, percent-clipped=15.0 2023-06-25 21:17:31,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2072346.0, ans=0.125 2023-06-25 21:19:16,258 INFO [train.py:996] (0/4) Epoch 12, batch 10000, loss[loss=0.2085, simple_loss=0.2879, pruned_loss=0.06458, over 21961.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2983, pruned_loss=0.07851, over 4271695.86 frames. ], batch size: 317, lr: 2.44e-03, grad_scale: 32.0 2023-06-25 21:19:36,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2072646.0, ans=0.2 2023-06-25 21:20:17,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2072766.0, ans=0.125 2023-06-25 21:20:19,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2072766.0, ans=0.125 2023-06-25 21:20:20,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2072766.0, ans=0.0 2023-06-25 21:21:00,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2072886.0, ans=0.1 2023-06-25 21:21:04,424 INFO [train.py:996] (0/4) Epoch 12, batch 10050, loss[loss=0.2032, simple_loss=0.2861, pruned_loss=0.06019, over 21066.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2982, pruned_loss=0.07791, over 4271034.59 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:21:16,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.942e+02 8.782e+02 1.169e+03 1.851e+03 4.390e+03, threshold=2.338e+03, percent-clipped=10.0 2023-06-25 21:21:26,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-25 21:21:35,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=2073006.0, ans=6.0 2023-06-25 21:22:05,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2073066.0, ans=0.125 2023-06-25 21:22:06,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.13 vs. limit=10.0 2023-06-25 21:22:55,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2073186.0, ans=0.125 2023-06-25 21:23:02,172 INFO [train.py:996] (0/4) Epoch 12, batch 10100, loss[loss=0.2312, simple_loss=0.3007, pruned_loss=0.08087, over 21821.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2965, pruned_loss=0.07636, over 4265535.32 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:23:19,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=15.0 2023-06-25 21:24:01,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-25 21:24:11,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2073426.0, ans=0.0 2023-06-25 21:24:23,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2073426.0, ans=0.125 2023-06-25 21:24:50,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-25 21:24:52,537 INFO [train.py:996] (0/4) Epoch 12, batch 10150, loss[loss=0.2648, simple_loss=0.3461, pruned_loss=0.09175, over 21734.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3035, pruned_loss=0.07888, over 4266517.06 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:24:58,888 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.161e+02 9.851e+02 1.656e+03 2.564e+03 6.129e+03, threshold=3.312e+03, percent-clipped=27.0 2023-06-25 21:25:01,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2073546.0, ans=0.05 2023-06-25 21:25:10,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2073606.0, ans=0.0 2023-06-25 21:25:19,471 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-25 21:25:33,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2073606.0, ans=0.125 2023-06-25 21:25:51,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2073666.0, ans=0.2 2023-06-25 21:26:30,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2073786.0, ans=0.0 2023-06-25 21:26:41,529 INFO [train.py:996] (0/4) Epoch 12, batch 10200, loss[loss=0.2173, simple_loss=0.2865, pruned_loss=0.0741, over 21802.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3033, pruned_loss=0.07748, over 4265244.14 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:26:46,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-25 21:27:07,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-25 21:27:32,143 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-25 21:27:35,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-25 21:27:38,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2073966.0, ans=0.2 2023-06-25 21:28:00,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2074026.0, ans=0.2 2023-06-25 21:28:12,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2074086.0, ans=0.0 2023-06-25 21:28:31,304 INFO [train.py:996] (0/4) Epoch 12, batch 10250, loss[loss=0.1513, simple_loss=0.2381, pruned_loss=0.03218, over 21627.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2972, pruned_loss=0.07113, over 4262980.78 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:28:38,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 7.759e+02 1.163e+03 1.703e+03 3.224e+03, threshold=2.326e+03, percent-clipped=0.0 2023-06-25 21:29:08,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2074206.0, ans=0.2 2023-06-25 21:29:10,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-25 21:29:36,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=8.0 2023-06-25 21:30:01,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2074326.0, ans=0.125 2023-06-25 21:30:24,254 INFO [train.py:996] (0/4) Epoch 12, batch 10300, loss[loss=0.2221, simple_loss=0.3103, pruned_loss=0.06695, over 21442.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3029, pruned_loss=0.07387, over 4269260.67 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:30:45,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-06-25 21:31:00,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2074506.0, ans=0.125 2023-06-25 21:32:03,155 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:32:16,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-25 21:32:24,459 INFO [train.py:996] (0/4) Epoch 12, batch 10350, loss[loss=0.1993, simple_loss=0.2626, pruned_loss=0.06802, over 21541.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3058, pruned_loss=0.07452, over 4264597.97 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:32:31,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.446e+02 9.833e+02 1.578e+03 2.772e+03 4.867e+03, threshold=3.157e+03, percent-clipped=30.0 2023-06-25 21:32:46,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2074806.0, ans=0.1 2023-06-25 21:32:53,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2074806.0, ans=0.0 2023-06-25 21:34:19,106 INFO [train.py:996] (0/4) Epoch 12, batch 10400, loss[loss=0.179, simple_loss=0.2412, pruned_loss=0.05847, over 21488.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2993, pruned_loss=0.07341, over 4261946.55 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:36:14,873 INFO [train.py:996] (0/4) Epoch 12, batch 10450, loss[loss=0.2263, simple_loss=0.3136, pruned_loss=0.06944, over 21848.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3032, pruned_loss=0.0757, over 4259129.03 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:36:17,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2075346.0, ans=0.125 2023-06-25 21:36:22,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.406e+02 1.055e+03 1.797e+03 3.015e+03 7.446e+03, threshold=3.594e+03, percent-clipped=22.0 2023-06-25 21:36:42,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2075406.0, ans=0.0 2023-06-25 21:36:50,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.79 vs. limit=22.5 2023-06-25 21:38:00,982 INFO [train.py:996] (0/4) Epoch 12, batch 10500, loss[loss=0.2548, simple_loss=0.3199, pruned_loss=0.09484, over 21307.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3033, pruned_loss=0.07443, over 4258936.62 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:38:03,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2075646.0, ans=0.125 2023-06-25 21:38:05,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-25 21:38:22,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2075706.0, ans=0.0 2023-06-25 21:38:50,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2075766.0, ans=0.125 2023-06-25 21:38:56,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.39 vs. limit=15.0 2023-06-25 21:39:00,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.68 vs. limit=22.5 2023-06-25 21:39:22,750 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:39:53,027 INFO [train.py:996] (0/4) Epoch 12, batch 10550, loss[loss=0.2076, simple_loss=0.2724, pruned_loss=0.07137, over 21666.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.298, pruned_loss=0.07368, over 4253837.69 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:40:05,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.958e+02 1.039e+03 1.480e+03 2.237e+03 5.696e+03, threshold=2.960e+03, percent-clipped=5.0 2023-06-25 21:40:41,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-25 21:40:58,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2076126.0, ans=0.2 2023-06-25 21:41:08,706 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-25 21:41:27,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2076186.0, ans=0.125 2023-06-25 21:41:39,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2076186.0, ans=0.2 2023-06-25 21:41:44,556 INFO [train.py:996] (0/4) Epoch 12, batch 10600, loss[loss=0.2306, simple_loss=0.3481, pruned_loss=0.05656, over 20747.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2933, pruned_loss=0.07239, over 4257457.24 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:42:30,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2076366.0, ans=0.1 2023-06-25 21:43:04,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2076426.0, ans=0.125 2023-06-25 21:43:41,772 INFO [train.py:996] (0/4) Epoch 12, batch 10650, loss[loss=0.248, simple_loss=0.3282, pruned_loss=0.08393, over 21867.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2959, pruned_loss=0.07081, over 4248910.41 frames. ], batch size: 372, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:43:48,915 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.385e+02 8.599e+02 1.360e+03 2.205e+03 4.639e+03, threshold=2.719e+03, percent-clipped=13.0 2023-06-25 21:43:59,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2076546.0, ans=0.125 2023-06-25 21:44:02,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2076546.0, ans=0.0 2023-06-25 21:45:29,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2076786.0, ans=0.125 2023-06-25 21:45:32,598 INFO [train.py:996] (0/4) Epoch 12, batch 10700, loss[loss=0.1949, simple_loss=0.2588, pruned_loss=0.06548, over 16587.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2935, pruned_loss=0.07017, over 4244988.33 frames. ], batch size: 60, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:45:52,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2076846.0, ans=0.125 2023-06-25 21:46:43,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2077026.0, ans=0.125 2023-06-25 21:47:29,867 INFO [train.py:996] (0/4) Epoch 12, batch 10750, loss[loss=0.3494, simple_loss=0.4269, pruned_loss=0.1359, over 21607.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3058, pruned_loss=0.07481, over 4253333.07 frames. ], batch size: 508, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:47:33,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2077146.0, ans=0.1 2023-06-25 21:47:36,397 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.154e+02 8.331e+02 1.141e+03 1.485e+03 5.663e+03, threshold=2.282e+03, percent-clipped=4.0 2023-06-25 21:47:37,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2077146.0, ans=0.125 2023-06-25 21:47:40,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2077146.0, ans=0.125 2023-06-25 21:47:47,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2077206.0, ans=0.125 2023-06-25 21:48:03,052 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-25 21:48:14,831 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.22 vs. limit=6.0 2023-06-25 21:48:21,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2077266.0, ans=0.1 2023-06-25 21:48:39,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2077326.0, ans=0.0 2023-06-25 21:48:42,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-06-25 21:49:20,792 INFO [train.py:996] (0/4) Epoch 12, batch 10800, loss[loss=0.2863, simple_loss=0.3506, pruned_loss=0.111, over 21443.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3098, pruned_loss=0.07552, over 4259708.79 frames. ], batch size: 194, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 21:49:22,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=12.0 2023-06-25 21:50:26,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2077566.0, ans=0.2 2023-06-25 21:50:37,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2077626.0, ans=0.1 2023-06-25 21:51:00,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2077686.0, ans=0.125 2023-06-25 21:51:14,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2077746.0, ans=0.125 2023-06-25 21:51:15,770 INFO [train.py:996] (0/4) Epoch 12, batch 10850, loss[loss=0.1945, simple_loss=0.262, pruned_loss=0.06353, over 21601.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3112, pruned_loss=0.07599, over 4265731.00 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:51:31,524 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.495e+02 1.105e+03 1.666e+03 2.535e+03 5.598e+03, threshold=3.333e+03, percent-clipped=30.0 2023-06-25 21:51:51,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-25 21:52:17,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2077866.0, ans=0.035 2023-06-25 21:52:42,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.99 vs. limit=15.0 2023-06-25 21:53:16,021 INFO [train.py:996] (0/4) Epoch 12, batch 10900, loss[loss=0.2038, simple_loss=0.2639, pruned_loss=0.07185, over 21861.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3043, pruned_loss=0.07463, over 4264561.23 frames. ], batch size: 107, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:54:02,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2078166.0, ans=0.1 2023-06-25 21:54:30,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2078226.0, ans=0.125 2023-06-25 21:55:00,781 INFO [train.py:996] (0/4) Epoch 12, batch 10950, loss[loss=0.1857, simple_loss=0.2525, pruned_loss=0.05949, over 21298.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3008, pruned_loss=0.07316, over 4264799.23 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:55:18,497 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.435e+02 8.397e+02 1.288e+03 2.017e+03 5.203e+03, threshold=2.576e+03, percent-clipped=6.0 2023-06-25 21:55:21,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2078346.0, ans=0.125 2023-06-25 21:55:21,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2078346.0, ans=0.125 2023-06-25 21:55:54,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2078466.0, ans=0.0 2023-06-25 21:56:29,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2078586.0, ans=0.015 2023-06-25 21:56:47,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2078586.0, ans=0.125 2023-06-25 21:56:51,909 INFO [train.py:996] (0/4) Epoch 12, batch 11000, loss[loss=0.2009, simple_loss=0.2735, pruned_loss=0.06412, over 21610.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2997, pruned_loss=0.07344, over 4269555.27 frames. ], batch size: 212, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:57:25,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2078706.0, ans=0.125 2023-06-25 21:58:39,780 INFO [train.py:996] (0/4) Epoch 12, batch 11050, loss[loss=0.2091, simple_loss=0.2722, pruned_loss=0.07298, over 21800.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2972, pruned_loss=0.07462, over 4269118.85 frames. ], batch size: 107, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:58:40,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2078946.0, ans=0.0 2023-06-25 21:58:57,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.227e+02 8.641e+02 1.192e+03 1.708e+03 4.413e+03, threshold=2.383e+03, percent-clipped=10.0 2023-06-25 21:59:28,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-25 21:59:54,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2079126.0, ans=0.0 2023-06-25 22:00:30,687 INFO [train.py:996] (0/4) Epoch 12, batch 11100, loss[loss=0.206, simple_loss=0.2683, pruned_loss=0.0719, over 21850.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2953, pruned_loss=0.07476, over 4270240.80 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:00:43,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2079246.0, ans=0.0 2023-06-25 22:00:49,081 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:01:45,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-25 22:02:21,206 INFO [train.py:996] (0/4) Epoch 12, batch 11150, loss[loss=0.2176, simple_loss=0.2874, pruned_loss=0.07384, over 21679.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2922, pruned_loss=0.07449, over 4270466.23 frames. ], batch size: 112, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:02:24,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2079546.0, ans=0.2 2023-06-25 22:02:35,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=2079546.0, ans=0.2 2023-06-25 22:02:38,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.021e+02 8.746e+02 1.220e+03 1.810e+03 4.073e+03, threshold=2.441e+03, percent-clipped=6.0 2023-06-25 22:03:11,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2079666.0, ans=0.125 2023-06-25 22:03:27,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2079726.0, ans=10.0 2023-06-25 22:03:36,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-25 22:03:54,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2079786.0, ans=0.125 2023-06-25 22:04:09,074 INFO [train.py:996] (0/4) Epoch 12, batch 11200, loss[loss=0.2173, simple_loss=0.2797, pruned_loss=0.07745, over 21596.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2917, pruned_loss=0.07424, over 4272790.17 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:04:42,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2079906.0, ans=0.0 2023-06-25 22:04:42,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2079906.0, ans=0.5 2023-06-25 22:05:22,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2080026.0, ans=0.0 2023-06-25 22:05:55,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2080146.0, ans=0.125 2023-06-25 22:05:56,529 INFO [train.py:996] (0/4) Epoch 12, batch 11250, loss[loss=0.1937, simple_loss=0.2874, pruned_loss=0.05003, over 21694.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2901, pruned_loss=0.0741, over 4260814.01 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:06:14,390 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.198e+02 9.018e+02 1.214e+03 1.830e+03 3.568e+03, threshold=2.429e+03, percent-clipped=9.0 2023-06-25 22:06:27,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2080206.0, ans=0.125 2023-06-25 22:06:49,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2080266.0, ans=0.0 2023-06-25 22:06:55,041 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-25 22:07:18,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=22.5 2023-06-25 22:07:45,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2080446.0, ans=0.125 2023-06-25 22:07:46,511 INFO [train.py:996] (0/4) Epoch 12, batch 11300, loss[loss=0.2129, simple_loss=0.2947, pruned_loss=0.06551, over 21872.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2934, pruned_loss=0.07504, over 4272795.25 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:08:06,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2080446.0, ans=0.125 2023-06-25 22:08:09,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2080506.0, ans=0.95 2023-06-25 22:08:39,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2080566.0, ans=0.2 2023-06-25 22:09:41,342 INFO [train.py:996] (0/4) Epoch 12, batch 11350, loss[loss=0.2609, simple_loss=0.34, pruned_loss=0.09092, over 21742.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.295, pruned_loss=0.07409, over 4277768.72 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:09:54,131 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.747e+02 9.372e+02 1.256e+03 1.751e+03 3.011e+03, threshold=2.512e+03, percent-clipped=9.0 2023-06-25 22:10:07,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-25 22:10:27,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2080866.0, ans=0.125 2023-06-25 22:10:37,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2080866.0, ans=0.125 2023-06-25 22:10:43,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2080866.0, ans=0.0 2023-06-25 22:11:23,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2080986.0, ans=0.125 2023-06-25 22:11:26,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2080986.0, ans=0.125 2023-06-25 22:11:36,921 INFO [train.py:996] (0/4) Epoch 12, batch 11400, loss[loss=0.2578, simple_loss=0.3515, pruned_loss=0.08205, over 21242.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2994, pruned_loss=0.07628, over 4275907.71 frames. ], batch size: 549, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:12:07,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2081106.0, ans=0.125 2023-06-25 22:12:28,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2081166.0, ans=0.0 2023-06-25 22:12:30,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2081166.0, ans=0.95 2023-06-25 22:13:06,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2081226.0, ans=0.125 2023-06-25 22:13:11,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2081286.0, ans=0.09899494936611666 2023-06-25 22:13:15,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2081286.0, ans=0.125 2023-06-25 22:13:20,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2081286.0, ans=0.125 2023-06-25 22:13:26,728 INFO [train.py:996] (0/4) Epoch 12, batch 11450, loss[loss=0.2067, simple_loss=0.2756, pruned_loss=0.06887, over 21207.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3023, pruned_loss=0.07605, over 4282093.34 frames. ], batch size: 608, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:13:34,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2081346.0, ans=0.0 2023-06-25 22:13:46,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.691e+02 8.654e+02 1.291e+03 1.959e+03 4.523e+03, threshold=2.583e+03, percent-clipped=10.0 2023-06-25 22:14:11,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2081466.0, ans=0.2 2023-06-25 22:15:02,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2081586.0, ans=0.125 2023-06-25 22:15:15,781 INFO [train.py:996] (0/4) Epoch 12, batch 11500, loss[loss=0.2496, simple_loss=0.3274, pruned_loss=0.08591, over 21788.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.306, pruned_loss=0.07707, over 4280269.01 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:15:34,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-25 22:15:40,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2081706.0, ans=0.09899494936611666 2023-06-25 22:15:54,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2081706.0, ans=0.1 2023-06-25 22:15:54,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2081706.0, ans=0.05 2023-06-25 22:15:59,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2081766.0, ans=0.125 2023-06-25 22:16:16,282 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-25 22:17:10,040 INFO [train.py:996] (0/4) Epoch 12, batch 11550, loss[loss=0.2575, simple_loss=0.3556, pruned_loss=0.07968, over 21768.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3127, pruned_loss=0.07764, over 4279788.50 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:17:30,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.895e+02 1.016e+03 1.404e+03 2.173e+03 5.477e+03, threshold=2.808e+03, percent-clipped=17.0 2023-06-25 22:18:28,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2082126.0, ans=0.125 2023-06-25 22:19:09,012 INFO [train.py:996] (0/4) Epoch 12, batch 11600, loss[loss=0.2531, simple_loss=0.3375, pruned_loss=0.08434, over 21281.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3273, pruned_loss=0.07955, over 4274418.55 frames. ], batch size: 143, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:19:34,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2082306.0, ans=0.125 2023-06-25 22:19:36,666 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=15.0 2023-06-25 22:19:59,286 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:20:03,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2082366.0, ans=0.2 2023-06-25 22:20:06,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2082366.0, ans=0.125 2023-06-25 22:20:08,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2082366.0, ans=0.1 2023-06-25 22:20:58,725 INFO [train.py:996] (0/4) Epoch 12, batch 11650, loss[loss=0.2507, simple_loss=0.3271, pruned_loss=0.08711, over 21235.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3348, pruned_loss=0.08023, over 4273544.21 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:21:17,688 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.996e+02 1.003e+03 1.550e+03 2.140e+03 4.406e+03, threshold=3.101e+03, percent-clipped=13.0 2023-06-25 22:22:16,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2082726.0, ans=0.125 2023-06-25 22:22:18,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2082726.0, ans=0.1 2023-06-25 22:22:39,539 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:22:47,747 INFO [train.py:996] (0/4) Epoch 12, batch 11700, loss[loss=0.2085, simple_loss=0.2788, pruned_loss=0.06912, over 21844.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3281, pruned_loss=0.08001, over 4273006.14 frames. ], batch size: 102, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:23:46,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-25 22:23:51,523 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:24:15,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2083086.0, ans=0.0 2023-06-25 22:24:38,041 INFO [train.py:996] (0/4) Epoch 12, batch 11750, loss[loss=0.2405, simple_loss=0.307, pruned_loss=0.08702, over 21506.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3179, pruned_loss=0.07944, over 4271510.85 frames. ], batch size: 132, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:24:57,454 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.375e+02 1.055e+03 1.810e+03 2.598e+03 5.314e+03, threshold=3.620e+03, percent-clipped=16.0 2023-06-25 22:25:40,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-25 22:25:43,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2083326.0, ans=0.0 2023-06-25 22:25:57,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2083326.0, ans=0.2 2023-06-25 22:26:26,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2083446.0, ans=0.2 2023-06-25 22:26:32,984 INFO [train.py:996] (0/4) Epoch 12, batch 11800, loss[loss=0.259, simple_loss=0.3727, pruned_loss=0.07266, over 19721.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3191, pruned_loss=0.08119, over 4274954.05 frames. ], batch size: 703, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:27:26,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2083566.0, ans=0.125 2023-06-25 22:27:31,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2083626.0, ans=0.1 2023-06-25 22:28:11,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2083686.0, ans=0.0 2023-06-25 22:28:23,938 INFO [train.py:996] (0/4) Epoch 12, batch 11850, loss[loss=0.2376, simple_loss=0.3313, pruned_loss=0.07191, over 21666.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3191, pruned_loss=0.0806, over 4276498.09 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:28:24,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2083746.0, ans=0.0 2023-06-25 22:28:37,196 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.904e+02 9.109e+02 1.402e+03 2.284e+03 4.807e+03, threshold=2.803e+03, percent-clipped=4.0 2023-06-25 22:29:34,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2083926.0, ans=0.04949747468305833 2023-06-25 22:29:37,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2083926.0, ans=0.1 2023-06-25 22:29:37,459 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:30:12,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2084046.0, ans=0.125 2023-06-25 22:30:13,549 INFO [train.py:996] (0/4) Epoch 12, batch 11900, loss[loss=0.2053, simple_loss=0.2935, pruned_loss=0.05853, over 21745.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3179, pruned_loss=0.07806, over 4279797.14 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:31:20,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2084226.0, ans=0.0 2023-06-25 22:31:38,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2084226.0, ans=0.1 2023-06-25 22:31:47,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2084286.0, ans=0.2 2023-06-25 22:31:58,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2084286.0, ans=0.125 2023-06-25 22:32:05,347 INFO [train.py:996] (0/4) Epoch 12, batch 11950, loss[loss=0.2044, simple_loss=0.3325, pruned_loss=0.03818, over 20780.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3172, pruned_loss=0.07447, over 4269764.08 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:32:24,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.025e+02 8.288e+02 1.263e+03 1.955e+03 4.768e+03, threshold=2.526e+03, percent-clipped=7.0 2023-06-25 22:32:28,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2084406.0, ans=0.125 2023-06-25 22:32:40,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-25 22:32:44,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2084466.0, ans=0.125 2023-06-25 22:33:02,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2084466.0, ans=0.2 2023-06-25 22:33:54,632 INFO [train.py:996] (0/4) Epoch 12, batch 12000, loss[loss=0.1961, simple_loss=0.2628, pruned_loss=0.06468, over 21636.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3098, pruned_loss=0.07214, over 4270385.69 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 22:33:54,634 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 22:34:09,131 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7525, 3.2403, 3.1127, 4.2850], device='cuda:0') 2023-06-25 22:34:13,347 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.3007, 1.3350, 1.6727, 1.3023, 1.1158, 1.6254, 1.7136, 1.1977], device='cuda:0') 2023-06-25 22:34:15,494 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.0190, 3.2937, 2.4358, 3.6210, 2.1637, 3.4964, 3.0009, 2.6401], device='cuda:0') 2023-06-25 22:34:17,924 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2583, simple_loss=0.3504, pruned_loss=0.08306, over 1796401.00 frames. 2023-06-25 22:34:17,925 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-25 22:34:24,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2084646.0, ans=0.0 2023-06-25 22:34:35,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2084706.0, ans=0.07 2023-06-25 22:36:03,703 INFO [train.py:996] (0/4) Epoch 12, batch 12050, loss[loss=0.25, simple_loss=0.3732, pruned_loss=0.06338, over 19830.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3064, pruned_loss=0.0738, over 4268038.79 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:36:06,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2084946.0, ans=0.0 2023-06-25 22:36:18,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2084946.0, ans=0.125 2023-06-25 22:36:18,977 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 8.043e+02 1.290e+03 1.841e+03 5.321e+03, threshold=2.580e+03, percent-clipped=14.0 2023-06-25 22:36:19,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2085006.0, ans=0.125 2023-06-25 22:37:03,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2085066.0, ans=0.0 2023-06-25 22:37:07,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2085126.0, ans=0.125 2023-06-25 22:37:39,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-25 22:37:44,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-25 22:37:47,018 INFO [train.py:996] (0/4) Epoch 12, batch 12100, loss[loss=0.219, simple_loss=0.2984, pruned_loss=0.06978, over 21630.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3116, pruned_loss=0.07798, over 4272235.31 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:37:49,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-25 22:37:54,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2085246.0, ans=0.2 2023-06-25 22:38:41,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2085366.0, ans=0.035 2023-06-25 22:38:54,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.14 vs. limit=15.0 2023-06-25 22:39:10,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2085426.0, ans=0.125 2023-06-25 22:39:10,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2085426.0, ans=0.125 2023-06-25 22:39:33,607 INFO [train.py:996] (0/4) Epoch 12, batch 12150, loss[loss=0.2272, simple_loss=0.3234, pruned_loss=0.06545, over 20714.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3166, pruned_loss=0.07832, over 4267851.00 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:40:00,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.100e+02 9.375e+02 1.398e+03 2.052e+03 3.851e+03, threshold=2.797e+03, percent-clipped=12.0 2023-06-25 22:40:23,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2085666.0, ans=0.125 2023-06-25 22:40:32,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2085666.0, ans=0.0 2023-06-25 22:40:46,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-25 22:40:47,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-25 22:41:04,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2085786.0, ans=0.0 2023-06-25 22:41:26,878 INFO [train.py:996] (0/4) Epoch 12, batch 12200, loss[loss=0.1939, simple_loss=0.2565, pruned_loss=0.0657, over 21174.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3141, pruned_loss=0.07765, over 4272999.61 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:41:30,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2085846.0, ans=0.125 2023-06-25 22:42:10,820 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:42:33,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2086026.0, ans=0.2 2023-06-25 22:42:53,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2086086.0, ans=0.2 2023-06-25 22:42:55,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2086086.0, ans=0.125 2023-06-25 22:42:59,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2086086.0, ans=0.125 2023-06-25 22:43:08,626 INFO [train.py:996] (0/4) Epoch 12, batch 12250, loss[loss=0.1862, simple_loss=0.2683, pruned_loss=0.05204, over 21725.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3047, pruned_loss=0.07414, over 4258579.96 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:43:24,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.778e+02 9.164e+02 1.305e+03 1.876e+03 4.319e+03, threshold=2.609e+03, percent-clipped=7.0 2023-06-25 22:44:06,277 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:44:22,302 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.06 vs. limit=22.5 2023-06-25 22:44:36,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2086386.0, ans=10.0 2023-06-25 22:44:56,875 INFO [train.py:996] (0/4) Epoch 12, batch 12300, loss[loss=0.2763, simple_loss=0.3601, pruned_loss=0.09628, over 21534.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2986, pruned_loss=0.06955, over 4253695.67 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:45:26,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2086506.0, ans=0.125 2023-06-25 22:46:04,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2086626.0, ans=0.0 2023-06-25 22:46:22,016 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:46:23,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2086686.0, ans=0.125 2023-06-25 22:46:39,964 INFO [train.py:996] (0/4) Epoch 12, batch 12350, loss[loss=0.2704, simple_loss=0.3482, pruned_loss=0.09629, over 21476.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3019, pruned_loss=0.06974, over 4255267.68 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:47:02,271 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.943e+02 9.460e+02 1.568e+03 2.170e+03 4.986e+03, threshold=3.136e+03, percent-clipped=17.0 2023-06-25 22:47:12,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2086806.0, ans=0.025 2023-06-25 22:47:41,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-25 22:48:33,555 INFO [train.py:996] (0/4) Epoch 12, batch 12400, loss[loss=0.2323, simple_loss=0.3002, pruned_loss=0.08226, over 21854.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3051, pruned_loss=0.07248, over 4264206.39 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:48:54,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2087046.0, ans=0.125 2023-06-25 22:49:47,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2087226.0, ans=0.125 2023-06-25 22:50:21,719 INFO [train.py:996] (0/4) Epoch 12, batch 12450, loss[loss=0.286, simple_loss=0.3573, pruned_loss=0.1074, over 21579.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3098, pruned_loss=0.07619, over 4269494.66 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:50:44,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2087406.0, ans=0.0 2023-06-25 22:50:45,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.188e+02 9.401e+02 1.561e+03 2.264e+03 4.297e+03, threshold=3.121e+03, percent-clipped=10.0 2023-06-25 22:51:03,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2087406.0, ans=0.2 2023-06-25 22:52:09,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2087586.0, ans=0.125 2023-06-25 22:52:16,016 INFO [train.py:996] (0/4) Epoch 12, batch 12500, loss[loss=0.2882, simple_loss=0.3876, pruned_loss=0.09443, over 21890.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3192, pruned_loss=0.07901, over 4269897.92 frames. ], batch size: 372, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:52:41,341 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-06-25 22:53:12,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-25 22:53:53,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-25 22:53:56,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2087886.0, ans=0.2 2023-06-25 22:54:06,700 INFO [train.py:996] (0/4) Epoch 12, batch 12550, loss[loss=0.2686, simple_loss=0.3399, pruned_loss=0.09862, over 21433.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3249, pruned_loss=0.08139, over 4268512.07 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:54:07,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2087946.0, ans=0.0 2023-06-25 22:54:27,538 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-348000.pt 2023-06-25 22:54:32,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.421e+02 9.265e+02 1.193e+03 1.894e+03 3.118e+03, threshold=2.386e+03, percent-clipped=0.0 2023-06-25 22:54:47,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2088006.0, ans=0.125 2023-06-25 22:56:05,542 INFO [train.py:996] (0/4) Epoch 12, batch 12600, loss[loss=0.2163, simple_loss=0.2994, pruned_loss=0.06661, over 21393.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3239, pruned_loss=0.07935, over 4264020.95 frames. ], batch size: 211, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:56:22,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2088306.0, ans=0.1 2023-06-25 22:56:42,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.50 vs. limit=10.0 2023-06-25 22:56:55,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2088366.0, ans=0.09899494936611666 2023-06-25 22:56:59,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2088366.0, ans=0.2 2023-06-25 22:57:06,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-25 22:57:38,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2088486.0, ans=0.0 2023-06-25 22:57:51,804 INFO [train.py:996] (0/4) Epoch 12, batch 12650, loss[loss=0.2218, simple_loss=0.2912, pruned_loss=0.07615, over 21473.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3155, pruned_loss=0.07571, over 4260083.34 frames. ], batch size: 194, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:58:08,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.676e+02 8.577e+02 1.170e+03 1.835e+03 4.573e+03, threshold=2.341e+03, percent-clipped=16.0 2023-06-25 22:58:10,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2088606.0, ans=0.04949747468305833 2023-06-25 22:58:25,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-25 22:58:50,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.47 vs. limit=15.0 2023-06-25 22:59:29,317 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:59:40,871 INFO [train.py:996] (0/4) Epoch 12, batch 12700, loss[loss=0.2859, simple_loss=0.3561, pruned_loss=0.1079, over 21896.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3133, pruned_loss=0.07751, over 4265278.96 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:59:41,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2088846.0, ans=0.125 2023-06-25 22:59:45,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2088846.0, ans=0.125 2023-06-25 22:59:48,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2088846.0, ans=0.0 2023-06-25 22:59:50,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2088846.0, ans=0.125 2023-06-25 23:00:12,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2088906.0, ans=0.2 2023-06-25 23:00:27,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2088966.0, ans=0.125 2023-06-25 23:01:30,789 INFO [train.py:996] (0/4) Epoch 12, batch 12750, loss[loss=0.1993, simple_loss=0.2907, pruned_loss=0.05388, over 21763.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3148, pruned_loss=0.07778, over 4267446.31 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:01:51,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.914e+02 9.146e+02 1.295e+03 2.167e+03 3.972e+03, threshold=2.590e+03, percent-clipped=20.0 2023-06-25 23:01:55,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2089206.0, ans=0.0 2023-06-25 23:02:47,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2089326.0, ans=0.07 2023-06-25 23:02:58,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2089326.0, ans=0.1 2023-06-25 23:03:18,695 INFO [train.py:996] (0/4) Epoch 12, batch 12800, loss[loss=0.2806, simple_loss=0.3441, pruned_loss=0.1086, over 21799.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3136, pruned_loss=0.07834, over 4273316.72 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 23:03:32,550 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.49 vs. limit=10.0 2023-06-25 23:03:34,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.65 vs. limit=15.0 2023-06-25 23:03:41,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2089506.0, ans=0.0 2023-06-25 23:04:00,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-25 23:04:28,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2089626.0, ans=0.0 2023-06-25 23:04:44,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2089626.0, ans=0.125 2023-06-25 23:04:48,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2089626.0, ans=0.125 2023-06-25 23:04:53,862 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:04:56,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-25 23:05:13,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2089686.0, ans=0.1 2023-06-25 23:05:15,809 INFO [train.py:996] (0/4) Epoch 12, batch 12850, loss[loss=0.265, simple_loss=0.357, pruned_loss=0.08652, over 21502.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3151, pruned_loss=0.08005, over 4275061.25 frames. ], batch size: 508, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:05:41,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.120e+02 9.253e+02 1.228e+03 1.682e+03 4.174e+03, threshold=2.456e+03, percent-clipped=9.0 2023-06-25 23:06:33,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-25 23:06:36,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-25 23:06:51,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2089986.0, ans=0.125 2023-06-25 23:07:02,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2089986.0, ans=0.1 2023-06-25 23:07:15,434 INFO [train.py:996] (0/4) Epoch 12, batch 12900, loss[loss=0.2808, simple_loss=0.3619, pruned_loss=0.09984, over 21573.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3115, pruned_loss=0.07618, over 4276412.18 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:07:50,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-25 23:08:57,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2090286.0, ans=0.0 2023-06-25 23:09:05,328 INFO [train.py:996] (0/4) Epoch 12, batch 12950, loss[loss=0.2032, simple_loss=0.2807, pruned_loss=0.06279, over 20186.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3086, pruned_loss=0.07393, over 4272354.23 frames. ], batch size: 703, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:09:09,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2090346.0, ans=0.125 2023-06-25 23:09:30,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.722e+02 9.455e+02 1.379e+03 2.078e+03 4.999e+03, threshold=2.758e+03, percent-clipped=14.0 2023-06-25 23:10:17,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2090526.0, ans=0.2 2023-06-25 23:11:00,074 INFO [train.py:996] (0/4) Epoch 12, batch 13000, loss[loss=0.254, simple_loss=0.3315, pruned_loss=0.08823, over 20628.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3096, pruned_loss=0.07484, over 4270823.72 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:11:00,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2090646.0, ans=0.125 2023-06-25 23:12:03,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2090826.0, ans=0.125 2023-06-25 23:12:41,972 INFO [train.py:996] (0/4) Epoch 12, batch 13050, loss[loss=0.2692, simple_loss=0.3295, pruned_loss=0.1045, over 21803.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3052, pruned_loss=0.07274, over 4265969.10 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:12:43,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-25 23:13:05,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2091006.0, ans=0.125 2023-06-25 23:13:06,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.723e+02 9.569e+02 1.242e+03 1.863e+03 3.264e+03, threshold=2.484e+03, percent-clipped=2.0 2023-06-25 23:13:13,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.86 vs. limit=22.5 2023-06-25 23:13:43,796 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-25 23:13:50,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2091126.0, ans=0.125 2023-06-25 23:13:52,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-25 23:14:37,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2091246.0, ans=0.125 2023-06-25 23:14:38,560 INFO [train.py:996] (0/4) Epoch 12, batch 13100, loss[loss=0.2317, simple_loss=0.3113, pruned_loss=0.07607, over 21657.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3067, pruned_loss=0.07309, over 4274641.35 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:15:19,161 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-25 23:15:58,465 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-25 23:16:29,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2091546.0, ans=0.125 2023-06-25 23:16:31,087 INFO [train.py:996] (0/4) Epoch 12, batch 13150, loss[loss=0.2557, simple_loss=0.3263, pruned_loss=0.09257, over 21786.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3102, pruned_loss=0.07635, over 4273114.10 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:16:40,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2091546.0, ans=0.0 2023-06-25 23:16:45,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2091546.0, ans=0.0 2023-06-25 23:16:48,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2091546.0, ans=0.2 2023-06-25 23:16:55,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.201e+02 8.745e+02 1.410e+03 2.124e+03 5.219e+03, threshold=2.820e+03, percent-clipped=16.0 2023-06-25 23:17:21,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2091666.0, ans=0.0 2023-06-25 23:17:36,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2091666.0, ans=0.125 2023-06-25 23:18:09,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2091786.0, ans=0.0 2023-06-25 23:18:28,241 INFO [train.py:996] (0/4) Epoch 12, batch 13200, loss[loss=0.2548, simple_loss=0.3167, pruned_loss=0.0964, over 21276.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3086, pruned_loss=0.07597, over 4274690.47 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 23:18:39,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2091846.0, ans=0.125 2023-06-25 23:19:56,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2092086.0, ans=0.125 2023-06-25 23:20:16,900 INFO [train.py:996] (0/4) Epoch 12, batch 13250, loss[loss=0.2183, simple_loss=0.2903, pruned_loss=0.0732, over 21769.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3103, pruned_loss=0.07887, over 4281771.43 frames. ], batch size: 282, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:20:30,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2092146.0, ans=10.0 2023-06-25 23:20:39,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-25 23:20:45,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.712e+02 9.106e+02 1.656e+03 2.561e+03 5.361e+03, threshold=3.313e+03, percent-clipped=20.0 2023-06-25 23:20:49,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2092206.0, ans=0.125 2023-06-25 23:21:28,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2092326.0, ans=0.125 2023-06-25 23:22:12,988 INFO [train.py:996] (0/4) Epoch 12, batch 13300, loss[loss=0.2338, simple_loss=0.315, pruned_loss=0.07628, over 21493.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3136, pruned_loss=0.07846, over 4286691.82 frames. ], batch size: 211, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:23:00,735 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-25 23:23:34,364 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-06-25 23:24:01,794 INFO [train.py:996] (0/4) Epoch 12, batch 13350, loss[loss=0.2403, simple_loss=0.326, pruned_loss=0.07728, over 20671.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.318, pruned_loss=0.08081, over 4288314.09 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:24:06,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2092746.0, ans=0.2 2023-06-25 23:24:24,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=2092746.0, ans=0.05 2023-06-25 23:24:37,316 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.381e+02 8.931e+02 1.432e+03 2.029e+03 4.000e+03, threshold=2.864e+03, percent-clipped=9.0 2023-06-25 23:24:41,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2092806.0, ans=0.125 2023-06-25 23:25:43,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2092986.0, ans=0.0 2023-06-25 23:25:51,917 INFO [train.py:996] (0/4) Epoch 12, batch 13400, loss[loss=0.2845, simple_loss=0.343, pruned_loss=0.113, over 21522.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3193, pruned_loss=0.08279, over 4284133.20 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:27:02,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2093226.0, ans=0.125 2023-06-25 23:27:28,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2093286.0, ans=0.125 2023-06-25 23:27:50,039 INFO [train.py:996] (0/4) Epoch 12, batch 13450, loss[loss=0.2284, simple_loss=0.287, pruned_loss=0.08491, over 17220.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3204, pruned_loss=0.0851, over 4279573.15 frames. ], batch size: 60, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:27:50,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2093346.0, ans=0.125 2023-06-25 23:28:18,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.861e+02 9.284e+02 1.216e+03 1.797e+03 3.595e+03, threshold=2.431e+03, percent-clipped=4.0 2023-06-25 23:29:25,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2093586.0, ans=0.1 2023-06-25 23:29:26,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-25 23:29:47,953 INFO [train.py:996] (0/4) Epoch 12, batch 13500, loss[loss=0.145, simple_loss=0.1886, pruned_loss=0.05068, over 15779.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3105, pruned_loss=0.08115, over 4276499.66 frames. ], batch size: 61, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:29:52,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2093646.0, ans=0.1 2023-06-25 23:30:38,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-25 23:30:38,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.92 vs. limit=22.5 2023-06-25 23:30:47,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2093766.0, ans=0.125 2023-06-25 23:30:47,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2093766.0, ans=0.125 2023-06-25 23:30:51,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-25 23:31:21,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2093886.0, ans=0.1 2023-06-25 23:31:38,024 INFO [train.py:996] (0/4) Epoch 12, batch 13550, loss[loss=0.242, simple_loss=0.3416, pruned_loss=0.07121, over 21617.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3144, pruned_loss=0.07984, over 4279248.58 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:32:07,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.160e+02 9.861e+02 1.411e+03 2.332e+03 4.219e+03, threshold=2.822e+03, percent-clipped=19.0 2023-06-25 23:32:12,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2094006.0, ans=0.0 2023-06-25 23:32:23,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2094066.0, ans=0.2 2023-06-25 23:32:34,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2094066.0, ans=0.2 2023-06-25 23:33:23,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2094186.0, ans=0.5 2023-06-25 23:33:27,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2094186.0, ans=0.0 2023-06-25 23:33:29,584 INFO [train.py:996] (0/4) Epoch 12, batch 13600, loss[loss=0.2638, simple_loss=0.3327, pruned_loss=0.09741, over 21759.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3171, pruned_loss=0.08074, over 4285955.25 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:33:56,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2094306.0, ans=0.04949747468305833 2023-06-25 23:34:32,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2094426.0, ans=0.025 2023-06-25 23:35:20,572 INFO [train.py:996] (0/4) Epoch 12, batch 13650, loss[loss=0.2009, simple_loss=0.2675, pruned_loss=0.06713, over 21687.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3108, pruned_loss=0.077, over 4270515.45 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:35:28,363 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:35:43,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=22.5 2023-06-25 23:35:50,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.803e+02 6.699e+02 9.977e+02 1.659e+03 4.040e+03, threshold=1.995e+03, percent-clipped=8.0 2023-06-25 23:36:37,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.65 vs. limit=22.5 2023-06-25 23:36:40,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2094726.0, ans=0.0 2023-06-25 23:37:04,054 INFO [train.py:996] (0/4) Epoch 12, batch 13700, loss[loss=0.2176, simple_loss=0.3013, pruned_loss=0.06698, over 21646.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3038, pruned_loss=0.07658, over 4258947.65 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:37:33,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2094906.0, ans=0.1 2023-06-25 23:37:52,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2094966.0, ans=0.0 2023-06-25 23:38:27,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2095026.0, ans=0.125 2023-06-25 23:39:04,381 INFO [train.py:996] (0/4) Epoch 12, batch 13750, loss[loss=0.2117, simple_loss=0.2838, pruned_loss=0.06986, over 21560.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.304, pruned_loss=0.07692, over 4261911.02 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:39:15,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2095146.0, ans=0.0 2023-06-25 23:39:33,878 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.606e+02 1.009e+03 1.585e+03 2.865e+03 5.412e+03, threshold=3.169e+03, percent-clipped=34.0 2023-06-25 23:39:38,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2095206.0, ans=0.125 2023-06-25 23:40:04,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2095266.0, ans=0.0 2023-06-25 23:40:08,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-25 23:40:19,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-25 23:40:31,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2095386.0, ans=0.125 2023-06-25 23:40:39,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.44 vs. limit=12.0 2023-06-25 23:40:40,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2095386.0, ans=0.0 2023-06-25 23:40:54,926 INFO [train.py:996] (0/4) Epoch 12, batch 13800, loss[loss=0.2098, simple_loss=0.2858, pruned_loss=0.0669, over 21077.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3084, pruned_loss=0.07484, over 4262759.46 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:41:07,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.82 vs. limit=15.0 2023-06-25 23:41:16,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2095506.0, ans=0.125 2023-06-25 23:41:39,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2095506.0, ans=0.125 2023-06-25 23:41:49,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2095566.0, ans=0.2 2023-06-25 23:42:52,745 INFO [train.py:996] (0/4) Epoch 12, batch 13850, loss[loss=0.2652, simple_loss=0.3395, pruned_loss=0.09543, over 21584.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3132, pruned_loss=0.07492, over 4268310.74 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:42:58,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2095746.0, ans=0.0 2023-06-25 23:43:28,959 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.539e+02 1.083e+03 1.524e+03 2.080e+03 5.261e+03, threshold=3.047e+03, percent-clipped=9.0 2023-06-25 23:43:29,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2095806.0, ans=0.0 2023-06-25 23:43:34,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2095806.0, ans=0.125 2023-06-25 23:43:48,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2095866.0, ans=0.1 2023-06-25 23:44:27,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2095986.0, ans=0.125 2023-06-25 23:44:34,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2095986.0, ans=0.0 2023-06-25 23:44:39,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-25 23:44:49,846 INFO [train.py:996] (0/4) Epoch 12, batch 13900, loss[loss=0.2748, simple_loss=0.3371, pruned_loss=0.1063, over 21550.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3187, pruned_loss=0.07935, over 4273636.10 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:45:55,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-25 23:46:38,603 INFO [train.py:996] (0/4) Epoch 12, batch 13950, loss[loss=0.2261, simple_loss=0.3006, pruned_loss=0.07585, over 21781.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3185, pruned_loss=0.08181, over 4283521.91 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:47:08,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.165e+02 9.219e+02 1.176e+03 2.067e+03 4.872e+03, threshold=2.352e+03, percent-clipped=8.0 2023-06-25 23:47:08,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2096406.0, ans=0.0 2023-06-25 23:48:25,191 INFO [train.py:996] (0/4) Epoch 12, batch 14000, loss[loss=0.1818, simple_loss=0.2675, pruned_loss=0.04803, over 21784.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3145, pruned_loss=0.07981, over 4280702.00 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:48:29,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2096646.0, ans=0.1 2023-06-25 23:48:37,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2096646.0, ans=0.0 2023-06-25 23:48:50,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2096706.0, ans=0.0 2023-06-25 23:48:53,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2096706.0, ans=0.0 2023-06-25 23:49:00,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2096706.0, ans=0.0 2023-06-25 23:49:03,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2096766.0, ans=0.0 2023-06-25 23:49:07,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-25 23:49:16,161 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-25 23:50:04,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-25 23:50:07,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2096886.0, ans=0.025 2023-06-25 23:50:12,359 INFO [train.py:996] (0/4) Epoch 12, batch 14050, loss[loss=0.1803, simple_loss=0.2584, pruned_loss=0.05115, over 21702.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3088, pruned_loss=0.07586, over 4276250.79 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:50:27,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2096946.0, ans=0.125 2023-06-25 23:50:40,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2097006.0, ans=0.0 2023-06-25 23:50:41,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.215e+02 8.595e+02 1.187e+03 1.606e+03 3.647e+03, threshold=2.374e+03, percent-clipped=9.0 2023-06-25 23:51:40,346 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:52:05,849 INFO [train.py:996] (0/4) Epoch 12, batch 14100, loss[loss=0.219, simple_loss=0.2756, pruned_loss=0.08119, over 21831.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3028, pruned_loss=0.07613, over 4274476.20 frames. ], batch size: 98, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:52:55,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=2097426.0, ans=10.0 2023-06-25 23:52:57,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2097426.0, ans=0.125 2023-06-25 23:53:14,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2097426.0, ans=0.125 2023-06-25 23:53:26,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2097486.0, ans=0.2 2023-06-25 23:53:40,236 INFO [train.py:996] (0/4) Epoch 12, batch 14150, loss[loss=0.2304, simple_loss=0.3137, pruned_loss=0.07353, over 21359.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3061, pruned_loss=0.0764, over 4258160.55 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:53:50,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-25 23:53:51,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-25 23:54:15,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2097606.0, ans=0.125 2023-06-25 23:54:17,220 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.654e+02 9.098e+02 1.301e+03 1.898e+03 3.994e+03, threshold=2.602e+03, percent-clipped=15.0 2023-06-25 23:54:25,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2097666.0, ans=0.2 2023-06-25 23:55:19,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2097786.0, ans=0.0 2023-06-25 23:55:25,939 INFO [train.py:996] (0/4) Epoch 12, batch 14200, loss[loss=0.2216, simple_loss=0.2844, pruned_loss=0.07938, over 21634.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3066, pruned_loss=0.07538, over 4257108.33 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:56:10,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2097906.0, ans=0.0 2023-06-25 23:56:37,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2098026.0, ans=0.125 2023-06-25 23:56:46,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2098086.0, ans=0.2 2023-06-25 23:57:06,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2098086.0, ans=0.025 2023-06-25 23:57:06,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2098086.0, ans=0.125 2023-06-25 23:57:10,460 INFO [train.py:996] (0/4) Epoch 12, batch 14250, loss[loss=0.1956, simple_loss=0.2553, pruned_loss=0.06793, over 21480.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3014, pruned_loss=0.07579, over 4266449.80 frames. ], batch size: 195, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:57:53,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.950e+02 7.589e+02 1.025e+03 1.608e+03 3.154e+03, threshold=2.050e+03, percent-clipped=3.0 2023-06-25 23:58:19,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2098326.0, ans=0.035 2023-06-25 23:59:06,533 INFO [train.py:996] (0/4) Epoch 12, batch 14300, loss[loss=0.2264, simple_loss=0.3456, pruned_loss=0.0536, over 19644.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3039, pruned_loss=0.07452, over 4263195.84 frames. ], batch size: 702, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:59:40,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2098506.0, ans=0.125 2023-06-25 23:59:48,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2098506.0, ans=0.0 2023-06-26 00:00:23,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2098626.0, ans=0.1 2023-06-26 00:00:37,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2098686.0, ans=0.125 2023-06-26 00:00:41,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2098686.0, ans=0.125 2023-06-26 00:00:43,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2098686.0, ans=0.0 2023-06-26 00:00:57,838 INFO [train.py:996] (0/4) Epoch 12, batch 14350, loss[loss=0.2325, simple_loss=0.321, pruned_loss=0.07201, over 21740.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3108, pruned_loss=0.0749, over 4254433.64 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:01:21,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-26 00:01:31,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-26 00:01:35,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.907e+02 9.524e+02 1.577e+03 2.595e+03 6.111e+03, threshold=3.154e+03, percent-clipped=35.0 2023-06-26 00:01:56,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-26 00:02:22,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2098926.0, ans=0.1 2023-06-26 00:02:23,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2098986.0, ans=0.125 2023-06-26 00:02:52,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2099046.0, ans=0.125 2023-06-26 00:02:53,118 INFO [train.py:996] (0/4) Epoch 12, batch 14400, loss[loss=0.2388, simple_loss=0.3002, pruned_loss=0.08872, over 21480.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3084, pruned_loss=0.07603, over 4263496.16 frames. ], batch size: 195, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:02:57,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-26 00:03:07,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2099046.0, ans=0.5 2023-06-26 00:03:07,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2099046.0, ans=0.125 2023-06-26 00:03:20,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2099106.0, ans=0.1 2023-06-26 00:03:22,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2099106.0, ans=0.0 2023-06-26 00:03:42,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2099166.0, ans=0.2 2023-06-26 00:04:38,677 INFO [train.py:996] (0/4) Epoch 12, batch 14450, loss[loss=0.2134, simple_loss=0.279, pruned_loss=0.07394, over 21289.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3012, pruned_loss=0.07581, over 4263221.49 frames. ], batch size: 177, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:04:42,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2099346.0, ans=0.125 2023-06-26 00:04:49,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2099346.0, ans=0.125 2023-06-26 00:05:10,382 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.479e+02 8.096e+02 1.121e+03 1.806e+03 4.464e+03, threshold=2.243e+03, percent-clipped=9.0 2023-06-26 00:05:14,974 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-26 00:05:15,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-26 00:05:26,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2099466.0, ans=10.0 2023-06-26 00:06:00,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2099586.0, ans=0.125 2023-06-26 00:06:17,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2099586.0, ans=0.125 2023-06-26 00:06:30,332 INFO [train.py:996] (0/4) Epoch 12, batch 14500, loss[loss=0.2211, simple_loss=0.3045, pruned_loss=0.06881, over 21583.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2977, pruned_loss=0.07575, over 4267207.41 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:06:42,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2099646.0, ans=0.125 2023-06-26 00:07:11,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2099766.0, ans=0.2 2023-06-26 00:07:17,309 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-26 00:08:09,976 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=12.0 2023-06-26 00:08:26,789 INFO [train.py:996] (0/4) Epoch 12, batch 14550, loss[loss=0.2609, simple_loss=0.3381, pruned_loss=0.09182, over 21689.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.301, pruned_loss=0.07612, over 4265933.29 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:08:33,312 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-26 00:08:52,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.442e+02 9.003e+02 1.476e+03 2.430e+03 4.476e+03, threshold=2.953e+03, percent-clipped=26.0 2023-06-26 00:10:10,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2100186.0, ans=0.2 2023-06-26 00:10:16,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2100246.0, ans=0.125 2023-06-26 00:10:16,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2100246.0, ans=0.125 2023-06-26 00:10:17,533 INFO [train.py:996] (0/4) Epoch 12, batch 14600, loss[loss=0.2762, simple_loss=0.3494, pruned_loss=0.1015, over 21424.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3091, pruned_loss=0.0798, over 4264093.82 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:10:21,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2100246.0, ans=0.07 2023-06-26 00:10:37,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2100306.0, ans=0.035 2023-06-26 00:10:46,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2100306.0, ans=0.125 2023-06-26 00:10:58,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2100366.0, ans=0.07 2023-06-26 00:10:58,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2100366.0, ans=0.125 2023-06-26 00:11:30,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2100426.0, ans=0.125 2023-06-26 00:11:35,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2100486.0, ans=0.2 2023-06-26 00:12:06,926 INFO [train.py:996] (0/4) Epoch 12, batch 14650, loss[loss=0.1835, simple_loss=0.2724, pruned_loss=0.04731, over 21613.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3126, pruned_loss=0.07948, over 4261844.03 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:12:26,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2100606.0, ans=0.125 2023-06-26 00:12:32,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.972e+02 9.038e+02 1.259e+03 1.802e+03 4.365e+03, threshold=2.519e+03, percent-clipped=6.0 2023-06-26 00:12:52,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2100666.0, ans=0.2 2023-06-26 00:13:20,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2100726.0, ans=0.1 2023-06-26 00:13:57,532 INFO [train.py:996] (0/4) Epoch 12, batch 14700, loss[loss=0.204, simple_loss=0.2961, pruned_loss=0.05591, over 21582.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3076, pruned_loss=0.07414, over 4269497.42 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:14:26,880 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-26 00:14:31,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2100966.0, ans=0.125 2023-06-26 00:15:17,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2101026.0, ans=0.0 2023-06-26 00:15:41,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=2101086.0, ans=10.0 2023-06-26 00:15:49,412 INFO [train.py:996] (0/4) Epoch 12, batch 14750, loss[loss=0.3449, simple_loss=0.4181, pruned_loss=0.1358, over 21721.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3146, pruned_loss=0.07751, over 4266078.45 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:16:04,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2101146.0, ans=0.125 2023-06-26 00:16:21,752 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.018e+02 8.627e+02 1.543e+03 2.178e+03 4.695e+03, threshold=3.085e+03, percent-clipped=15.0 2023-06-26 00:16:36,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-26 00:16:39,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2101266.0, ans=0.1 2023-06-26 00:16:51,578 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-26 00:17:16,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.67 vs. limit=15.0 2023-06-26 00:17:38,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.91 vs. limit=15.0 2023-06-26 00:17:40,014 INFO [train.py:996] (0/4) Epoch 12, batch 14800, loss[loss=0.2556, simple_loss=0.3339, pruned_loss=0.08866, over 21256.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3264, pruned_loss=0.08277, over 4270553.49 frames. ], batch size: 549, lr: 2.42e-03, grad_scale: 32.0 2023-06-26 00:17:40,380 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:17:52,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2101446.0, ans=0.125 2023-06-26 00:17:53,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-26 00:18:03,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=2101506.0, ans=0.2 2023-06-26 00:18:35,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-26 00:18:59,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2101626.0, ans=0.0 2023-06-26 00:19:24,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2101686.0, ans=0.2 2023-06-26 00:19:28,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.87 vs. limit=15.0 2023-06-26 00:19:31,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2101746.0, ans=0.0 2023-06-26 00:19:32,063 INFO [train.py:996] (0/4) Epoch 12, batch 14850, loss[loss=0.2111, simple_loss=0.2757, pruned_loss=0.07323, over 21868.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3182, pruned_loss=0.08188, over 4268195.87 frames. ], batch size: 107, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:20:14,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-26 00:20:15,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2101806.0, ans=0.125 2023-06-26 00:20:16,295 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.961e+02 9.941e+02 1.371e+03 2.279e+03 6.206e+03, threshold=2.743e+03, percent-clipped=9.0 2023-06-26 00:20:38,864 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:20:47,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2101926.0, ans=0.125 2023-06-26 00:21:32,277 INFO [train.py:996] (0/4) Epoch 12, batch 14900, loss[loss=0.2545, simple_loss=0.3232, pruned_loss=0.09292, over 21428.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3184, pruned_loss=0.08253, over 4271079.01 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:22:25,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2102166.0, ans=0.1 2023-06-26 00:22:27,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2102166.0, ans=0.125 2023-06-26 00:23:28,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2102346.0, ans=0.125 2023-06-26 00:23:29,190 INFO [train.py:996] (0/4) Epoch 12, batch 14950, loss[loss=0.217, simple_loss=0.2986, pruned_loss=0.0677, over 21248.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3198, pruned_loss=0.08284, over 4268089.76 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:24:02,319 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.432e+02 8.420e+02 1.194e+03 1.609e+03 3.792e+03, threshold=2.388e+03, percent-clipped=5.0 2023-06-26 00:24:17,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2102466.0, ans=0.125 2023-06-26 00:25:18,801 INFO [train.py:996] (0/4) Epoch 12, batch 15000, loss[loss=0.2351, simple_loss=0.307, pruned_loss=0.08159, over 21490.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3206, pruned_loss=0.08338, over 4272746.41 frames. ], batch size: 548, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:25:18,814 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 00:25:33,332 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.2703, 4.2943, 3.9540, 3.9567], device='cuda:0') 2023-06-26 00:25:42,269 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2582, simple_loss=0.348, pruned_loss=0.08425, over 1796401.00 frames. 2023-06-26 00:25:42,270 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-26 00:25:49,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2102646.0, ans=0.2 2023-06-26 00:26:22,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-26 00:27:11,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2102886.0, ans=0.2 2023-06-26 00:27:29,938 INFO [train.py:996] (0/4) Epoch 12, batch 15050, loss[loss=0.232, simple_loss=0.3126, pruned_loss=0.07571, over 21295.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3227, pruned_loss=0.08495, over 4273363.18 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:27:56,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.042e+02 9.488e+02 1.374e+03 1.981e+03 5.080e+03, threshold=2.749e+03, percent-clipped=16.0 2023-06-26 00:28:20,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-26 00:29:01,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2103186.0, ans=0.125 2023-06-26 00:29:05,229 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:29:16,835 INFO [train.py:996] (0/4) Epoch 12, batch 15100, loss[loss=0.2659, simple_loss=0.3636, pruned_loss=0.08413, over 20688.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3244, pruned_loss=0.08428, over 4272776.50 frames. ], batch size: 608, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:30:06,519 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-26 00:30:13,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2103366.0, ans=0.025 2023-06-26 00:31:08,002 INFO [train.py:996] (0/4) Epoch 12, batch 15150, loss[loss=0.2294, simple_loss=0.2919, pruned_loss=0.08344, over 21740.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3205, pruned_loss=0.08452, over 4275667.12 frames. ], batch size: 102, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:31:12,852 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-26 00:31:44,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.420e+02 7.677e+02 1.025e+03 1.452e+03 2.770e+03, threshold=2.050e+03, percent-clipped=1.0 2023-06-26 00:31:52,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.12 vs. limit=10.0 2023-06-26 00:31:55,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2103666.0, ans=0.0 2023-06-26 00:32:06,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2103666.0, ans=0.0 2023-06-26 00:32:06,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2103666.0, ans=0.1 2023-06-26 00:32:26,471 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-26 00:32:57,464 INFO [train.py:996] (0/4) Epoch 12, batch 15200, loss[loss=0.1712, simple_loss=0.2825, pruned_loss=0.02998, over 20872.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3102, pruned_loss=0.08077, over 4273188.49 frames. ], batch size: 609, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:33:04,486 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:33:15,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=12.0 2023-06-26 00:33:15,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=22.5 2023-06-26 00:33:25,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2103906.0, ans=0.0 2023-06-26 00:33:51,746 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:34:26,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2104026.0, ans=0.125 2023-06-26 00:34:44,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2104146.0, ans=0.2 2023-06-26 00:34:45,763 INFO [train.py:996] (0/4) Epoch 12, batch 15250, loss[loss=0.2452, simple_loss=0.3106, pruned_loss=0.08994, over 21244.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3043, pruned_loss=0.07933, over 4262651.94 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:35:33,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.968e+02 8.853e+02 1.364e+03 1.967e+03 5.293e+03, threshold=2.727e+03, percent-clipped=20.0 2023-06-26 00:36:25,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2104386.0, ans=0.125 2023-06-26 00:36:25,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2104386.0, ans=0.2 2023-06-26 00:36:30,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2104386.0, ans=0.1 2023-06-26 00:36:35,468 INFO [train.py:996] (0/4) Epoch 12, batch 15300, loss[loss=0.2647, simple_loss=0.3502, pruned_loss=0.08963, over 17005.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3079, pruned_loss=0.08276, over 4263772.57 frames. ], batch size: 60, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:36:36,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2104446.0, ans=0.1 2023-06-26 00:37:07,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2104506.0, ans=0.0 2023-06-26 00:37:17,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2104506.0, ans=0.0 2023-06-26 00:37:44,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2104566.0, ans=0.125 2023-06-26 00:37:55,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2104626.0, ans=0.0 2023-06-26 00:38:07,955 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-26 00:38:20,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2104686.0, ans=0.0 2023-06-26 00:38:22,873 INFO [train.py:996] (0/4) Epoch 12, batch 15350, loss[loss=0.2306, simple_loss=0.3282, pruned_loss=0.06649, over 21844.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3135, pruned_loss=0.08448, over 4267462.95 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:38:52,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2104806.0, ans=0.125 2023-06-26 00:39:10,398 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.965e+02 8.644e+02 1.167e+03 1.665e+03 4.882e+03, threshold=2.334e+03, percent-clipped=9.0 2023-06-26 00:39:14,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2104866.0, ans=0.125 2023-06-26 00:39:35,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2104926.0, ans=0.125 2023-06-26 00:40:01,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2104986.0, ans=0.125 2023-06-26 00:40:09,360 INFO [train.py:996] (0/4) Epoch 12, batch 15400, loss[loss=0.212, simple_loss=0.2977, pruned_loss=0.06313, over 21806.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3149, pruned_loss=0.0828, over 4262498.62 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:40:15,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2105046.0, ans=0.125 2023-06-26 00:40:59,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2105166.0, ans=0.5 2023-06-26 00:41:58,349 INFO [train.py:996] (0/4) Epoch 12, batch 15450, loss[loss=0.2464, simple_loss=0.3234, pruned_loss=0.08469, over 21770.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3122, pruned_loss=0.08145, over 4249915.58 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:42:40,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.072e+02 7.775e+02 1.085e+03 1.677e+03 3.243e+03, threshold=2.170e+03, percent-clipped=5.0 2023-06-26 00:43:47,776 INFO [train.py:996] (0/4) Epoch 12, batch 15500, loss[loss=0.2763, simple_loss=0.3498, pruned_loss=0.1014, over 21769.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3153, pruned_loss=0.08135, over 4259662.32 frames. ], batch size: 124, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:43:53,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2105646.0, ans=0.0 2023-06-26 00:44:17,005 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.20 vs. limit=15.0 2023-06-26 00:45:09,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2105826.0, ans=0.0 2023-06-26 00:45:32,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-26 00:45:43,193 INFO [train.py:996] (0/4) Epoch 12, batch 15550, loss[loss=0.1967, simple_loss=0.2793, pruned_loss=0.05708, over 21728.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3146, pruned_loss=0.07928, over 4260733.51 frames. ], batch size: 124, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:45:53,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-26 00:46:06,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2106006.0, ans=0.0 2023-06-26 00:46:34,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 1.052e+03 1.309e+03 1.870e+03 4.327e+03, threshold=2.618e+03, percent-clipped=17.0 2023-06-26 00:46:55,876 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:47:05,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2106126.0, ans=0.125 2023-06-26 00:47:07,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2106126.0, ans=0.1 2023-06-26 00:47:41,701 INFO [train.py:996] (0/4) Epoch 12, batch 15600, loss[loss=0.2058, simple_loss=0.2681, pruned_loss=0.0717, over 21666.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.308, pruned_loss=0.07707, over 4254391.88 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:47:47,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2106246.0, ans=0.125 2023-06-26 00:48:39,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2106366.0, ans=0.125 2023-06-26 00:48:43,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2106426.0, ans=0.1 2023-06-26 00:48:47,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-26 00:49:00,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2106486.0, ans=0.0 2023-06-26 00:49:12,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-26 00:49:20,609 INFO [train.py:996] (0/4) Epoch 12, batch 15650, loss[loss=0.2235, simple_loss=0.2956, pruned_loss=0.07568, over 21624.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.307, pruned_loss=0.07605, over 4262321.20 frames. ], batch size: 298, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:49:31,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2106546.0, ans=0.125 2023-06-26 00:49:32,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-26 00:49:37,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2106546.0, ans=0.125 2023-06-26 00:50:07,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.319e+02 9.009e+02 1.257e+03 1.963e+03 4.655e+03, threshold=2.515e+03, percent-clipped=11.0 2023-06-26 00:50:18,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2106666.0, ans=0.125 2023-06-26 00:50:40,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2106726.0, ans=0.125 2023-06-26 00:50:48,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2106786.0, ans=0.125 2023-06-26 00:50:49,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.38 vs. limit=10.0 2023-06-26 00:51:12,665 INFO [train.py:996] (0/4) Epoch 12, batch 15700, loss[loss=0.2271, simple_loss=0.2986, pruned_loss=0.07784, over 21462.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3036, pruned_loss=0.0746, over 4254762.80 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:51:16,964 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:51:23,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2106846.0, ans=0.1 2023-06-26 00:51:23,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-26 00:52:00,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2106966.0, ans=0.0 2023-06-26 00:52:39,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-26 00:52:54,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2107146.0, ans=0.05 2023-06-26 00:52:55,636 INFO [train.py:996] (0/4) Epoch 12, batch 15750, loss[loss=0.2145, simple_loss=0.2819, pruned_loss=0.07359, over 21397.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3013, pruned_loss=0.07504, over 4258477.27 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:52:59,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2107146.0, ans=0.125 2023-06-26 00:53:06,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2107146.0, ans=0.125 2023-06-26 00:53:37,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2107206.0, ans=0.2 2023-06-26 00:53:38,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.389e+02 8.722e+02 1.383e+03 2.030e+03 4.451e+03, threshold=2.767e+03, percent-clipped=16.0 2023-06-26 00:54:40,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2107446.0, ans=0.125 2023-06-26 00:54:41,597 INFO [train.py:996] (0/4) Epoch 12, batch 15800, loss[loss=0.2261, simple_loss=0.2891, pruned_loss=0.08148, over 21704.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.295, pruned_loss=0.0746, over 4255831.86 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:54:42,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2107446.0, ans=0.125 2023-06-26 00:55:01,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2107446.0, ans=0.0 2023-06-26 00:55:18,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2107506.0, ans=0.125 2023-06-26 00:55:26,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-26 00:55:43,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2107626.0, ans=10.0 2023-06-26 00:56:30,884 INFO [train.py:996] (0/4) Epoch 12, batch 15850, loss[loss=0.2816, simple_loss=0.3441, pruned_loss=0.1096, over 21425.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2967, pruned_loss=0.07692, over 4257560.05 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:56:55,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2107806.0, ans=0.125 2023-06-26 00:56:55,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2107806.0, ans=0.0 2023-06-26 00:56:56,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=12.0 2023-06-26 00:57:10,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.897e+02 9.764e+02 1.470e+03 2.269e+03 4.632e+03, threshold=2.939e+03, percent-clipped=9.0 2023-06-26 00:57:16,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2107866.0, ans=0.125 2023-06-26 00:57:16,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=22.5 2023-06-26 00:57:19,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2107866.0, ans=0.125 2023-06-26 00:57:29,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=2107926.0, ans=0.025 2023-06-26 00:57:38,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2107926.0, ans=0.125 2023-06-26 00:57:51,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2107986.0, ans=0.125 2023-06-26 00:58:11,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2107986.0, ans=0.125 2023-06-26 00:58:14,240 INFO [train.py:996] (0/4) Epoch 12, batch 15900, loss[loss=0.2035, simple_loss=0.2742, pruned_loss=0.06642, over 21196.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2956, pruned_loss=0.07803, over 4256680.42 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:58:16,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2108046.0, ans=0.04949747468305833 2023-06-26 00:58:56,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-26 00:59:17,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2108226.0, ans=0.125 2023-06-26 00:59:44,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-26 00:59:51,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2108286.0, ans=0.125 2023-06-26 00:59:56,221 INFO [train.py:996] (0/4) Epoch 12, batch 15950, loss[loss=0.237, simple_loss=0.3248, pruned_loss=0.07455, over 21611.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2986, pruned_loss=0.07662, over 4258635.10 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 01:00:11,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2108346.0, ans=0.2 2023-06-26 01:00:36,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2108406.0, ans=0.0 2023-06-26 01:00:41,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.627e+02 7.587e+02 1.024e+03 1.341e+03 2.731e+03, threshold=2.049e+03, percent-clipped=0.0 2023-06-26 01:01:25,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2108586.0, ans=0.2 2023-06-26 01:01:28,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-26 01:01:32,813 INFO [train.py:996] (0/4) Epoch 12, batch 16000, loss[loss=0.2253, simple_loss=0.3232, pruned_loss=0.06369, over 21806.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2982, pruned_loss=0.07344, over 4264014.54 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:01:33,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2108646.0, ans=0.125 2023-06-26 01:02:16,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2108706.0, ans=0.1 2023-06-26 01:03:09,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2108886.0, ans=0.2 2023-06-26 01:03:12,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2108886.0, ans=0.125 2023-06-26 01:03:26,084 INFO [train.py:996] (0/4) Epoch 12, batch 16050, loss[loss=0.1658, simple_loss=0.2467, pruned_loss=0.04244, over 21818.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2997, pruned_loss=0.07145, over 4263361.03 frames. ], batch size: 102, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:03:42,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2108946.0, ans=0.125 2023-06-26 01:03:43,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-26 01:03:45,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2108946.0, ans=0.95 2023-06-26 01:04:04,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2109006.0, ans=0.1 2023-06-26 01:04:18,277 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 8.237e+02 1.456e+03 2.962e+03 6.704e+03, threshold=2.913e+03, percent-clipped=34.0 2023-06-26 01:04:27,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2109066.0, ans=0.125 2023-06-26 01:05:16,012 INFO [train.py:996] (0/4) Epoch 12, batch 16100, loss[loss=0.2242, simple_loss=0.2963, pruned_loss=0.07608, over 21851.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3049, pruned_loss=0.07249, over 4271800.96 frames. ], batch size: 298, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:05:16,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2109246.0, ans=0.125 2023-06-26 01:06:09,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2109366.0, ans=0.125 2023-06-26 01:06:40,443 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:07:03,537 INFO [train.py:996] (0/4) Epoch 12, batch 16150, loss[loss=0.2667, simple_loss=0.3443, pruned_loss=0.0946, over 17499.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3037, pruned_loss=0.07431, over 4272516.73 frames. ], batch size: 60, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:07:22,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2109546.0, ans=0.2 2023-06-26 01:07:58,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.216e+02 8.957e+02 1.137e+03 1.849e+03 5.347e+03, threshold=2.275e+03, percent-clipped=8.0 2023-06-26 01:08:00,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2109666.0, ans=0.1 2023-06-26 01:08:04,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2109666.0, ans=0.125 2023-06-26 01:08:12,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2109666.0, ans=0.0 2023-06-26 01:08:16,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2109726.0, ans=0.2 2023-06-26 01:08:56,299 INFO [train.py:996] (0/4) Epoch 12, batch 16200, loss[loss=0.2528, simple_loss=0.337, pruned_loss=0.08429, over 21773.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3109, pruned_loss=0.0767, over 4276803.92 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:09:37,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2109906.0, ans=0.2 2023-06-26 01:09:43,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=8.0 2023-06-26 01:09:47,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2109966.0, ans=0.125 2023-06-26 01:09:59,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2109966.0, ans=0.0 2023-06-26 01:10:29,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.39 vs. limit=22.5 2023-06-26 01:10:32,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2110086.0, ans=0.1 2023-06-26 01:10:52,258 INFO [train.py:996] (0/4) Epoch 12, batch 16250, loss[loss=0.174, simple_loss=0.2562, pruned_loss=0.04585, over 21525.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3113, pruned_loss=0.07678, over 4269089.80 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:11:01,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2110146.0, ans=0.125 2023-06-26 01:11:15,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.37 vs. limit=15.0 2023-06-26 01:11:31,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.426e+02 1.023e+03 1.432e+03 2.136e+03 5.202e+03, threshold=2.864e+03, percent-clipped=19.0 2023-06-26 01:11:58,996 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:12:34,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.16 vs. limit=15.0 2023-06-26 01:12:34,513 INFO [train.py:996] (0/4) Epoch 12, batch 16300, loss[loss=0.1766, simple_loss=0.2694, pruned_loss=0.04188, over 21772.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3046, pruned_loss=0.07327, over 4260518.85 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 01:13:37,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2110626.0, ans=0.0 2023-06-26 01:14:25,788 INFO [train.py:996] (0/4) Epoch 12, batch 16350, loss[loss=0.2604, simple_loss=0.3272, pruned_loss=0.09678, over 21415.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3035, pruned_loss=0.07402, over 4261550.99 frames. ], batch size: 211, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:15:07,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.769e+02 8.768e+02 1.143e+03 1.640e+03 3.455e+03, threshold=2.286e+03, percent-clipped=4.0 2023-06-26 01:15:33,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2110926.0, ans=0.0 2023-06-26 01:15:44,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=12.0 2023-06-26 01:16:21,200 INFO [train.py:996] (0/4) Epoch 12, batch 16400, loss[loss=0.2424, simple_loss=0.3046, pruned_loss=0.09003, over 21458.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3058, pruned_loss=0.07447, over 4263000.19 frames. ], batch size: 177, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:16:34,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2111046.0, ans=0.125 2023-06-26 01:16:50,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2111106.0, ans=0.0 2023-06-26 01:17:47,779 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-26 01:18:03,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2111346.0, ans=0.0 2023-06-26 01:18:04,830 INFO [train.py:996] (0/4) Epoch 12, batch 16450, loss[loss=0.2265, simple_loss=0.301, pruned_loss=0.076, over 21845.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3053, pruned_loss=0.07542, over 4263691.72 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:18:14,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2111346.0, ans=0.035 2023-06-26 01:18:14,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2111346.0, ans=0.0 2023-06-26 01:18:16,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2111346.0, ans=0.125 2023-06-26 01:18:39,421 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.567e+02 7.458e+02 1.043e+03 1.518e+03 3.613e+03, threshold=2.086e+03, percent-clipped=5.0 2023-06-26 01:19:16,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-26 01:19:52,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2111586.0, ans=0.2 2023-06-26 01:19:54,989 INFO [train.py:996] (0/4) Epoch 12, batch 16500, loss[loss=0.216, simple_loss=0.3284, pruned_loss=0.05178, over 20862.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3043, pruned_loss=0.07601, over 4257206.77 frames. ], batch size: 608, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:19:59,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2111646.0, ans=0.125 2023-06-26 01:21:46,610 INFO [train.py:996] (0/4) Epoch 12, batch 16550, loss[loss=0.3071, simple_loss=0.3745, pruned_loss=0.1199, over 21422.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3037, pruned_loss=0.07342, over 4261114.71 frames. ], batch size: 507, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:22:00,985 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-352000.pt 2023-06-26 01:22:24,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2112006.0, ans=0.125 2023-06-26 01:22:28,603 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.316e+02 1.128e+03 1.805e+03 2.872e+03 7.168e+03, threshold=3.610e+03, percent-clipped=40.0 2023-06-26 01:23:00,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2112126.0, ans=0.125 2023-06-26 01:23:15,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2112126.0, ans=0.125 2023-06-26 01:23:39,438 INFO [train.py:996] (0/4) Epoch 12, batch 16600, loss[loss=0.2705, simple_loss=0.3773, pruned_loss=0.08181, over 21660.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3116, pruned_loss=0.07651, over 4263180.78 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:23:43,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2112246.0, ans=0.125 2023-06-26 01:24:04,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=22.5 2023-06-26 01:25:23,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.48 vs. limit=15.0 2023-06-26 01:25:29,298 INFO [train.py:996] (0/4) Epoch 12, batch 16650, loss[loss=0.227, simple_loss=0.3054, pruned_loss=0.07432, over 20615.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3231, pruned_loss=0.08015, over 4265081.92 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:26:02,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2112606.0, ans=0.125 2023-06-26 01:26:29,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.918e+02 9.386e+02 1.441e+03 2.110e+03 3.541e+03, threshold=2.881e+03, percent-clipped=0.0 2023-06-26 01:26:40,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2112666.0, ans=0.125 2023-06-26 01:26:51,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2112726.0, ans=0.2 2023-06-26 01:27:27,189 INFO [train.py:996] (0/4) Epoch 12, batch 16700, loss[loss=0.2244, simple_loss=0.3031, pruned_loss=0.07288, over 21770.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3247, pruned_loss=0.08172, over 4265357.28 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:27:27,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2112846.0, ans=0.0 2023-06-26 01:27:41,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2112846.0, ans=0.125 2023-06-26 01:27:41,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2112846.0, ans=0.0 2023-06-26 01:28:17,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-26 01:28:21,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2112966.0, ans=0.125 2023-06-26 01:29:20,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-26 01:29:38,428 INFO [train.py:996] (0/4) Epoch 12, batch 16750, loss[loss=0.2701, simple_loss=0.3368, pruned_loss=0.1017, over 21576.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3264, pruned_loss=0.08432, over 4268968.36 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:30:05,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-26 01:30:21,366 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.824e+02 1.154e+03 1.795e+03 2.444e+03 4.443e+03, threshold=3.590e+03, percent-clipped=18.0 2023-06-26 01:30:28,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2113266.0, ans=0.125 2023-06-26 01:30:30,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2113266.0, ans=0.0 2023-06-26 01:30:48,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2113326.0, ans=0.125 2023-06-26 01:30:51,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2113326.0, ans=0.125 2023-06-26 01:31:24,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2113386.0, ans=0.125 2023-06-26 01:31:30,351 INFO [train.py:996] (0/4) Epoch 12, batch 16800, loss[loss=0.2127, simple_loss=0.2997, pruned_loss=0.06285, over 21741.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3307, pruned_loss=0.0845, over 4263111.26 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:31:38,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2113446.0, ans=0.0 2023-06-26 01:31:43,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2113446.0, ans=0.125 2023-06-26 01:31:46,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2113506.0, ans=0.125 2023-06-26 01:32:14,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2113566.0, ans=0.1 2023-06-26 01:32:30,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2113626.0, ans=0.125 2023-06-26 01:33:02,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2113686.0, ans=0.015 2023-06-26 01:33:10,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2113686.0, ans=0.0 2023-06-26 01:33:18,864 INFO [train.py:996] (0/4) Epoch 12, batch 16850, loss[loss=0.2233, simple_loss=0.2922, pruned_loss=0.07724, over 21678.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3272, pruned_loss=0.08465, over 4267730.83 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:33:19,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2113746.0, ans=0.125 2023-06-26 01:33:36,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2113806.0, ans=0.0 2023-06-26 01:33:47,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=15.0 2023-06-26 01:33:54,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2113806.0, ans=0.0 2023-06-26 01:33:56,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-06-26 01:33:56,647 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-26 01:34:00,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.998e+02 1.125e+03 1.896e+03 2.663e+03 4.313e+03, threshold=3.792e+03, percent-clipped=10.0 2023-06-26 01:35:02,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-26 01:35:04,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=2113986.0, ans=0.95 2023-06-26 01:35:07,208 INFO [train.py:996] (0/4) Epoch 12, batch 16900, loss[loss=0.2386, simple_loss=0.3194, pruned_loss=0.07886, over 21765.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3214, pruned_loss=0.08269, over 4277062.80 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:35:46,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2114166.0, ans=0.0 2023-06-26 01:35:55,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.69 vs. limit=22.5 2023-06-26 01:36:39,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-26 01:36:51,875 INFO [train.py:996] (0/4) Epoch 12, batch 16950, loss[loss=0.2184, simple_loss=0.2877, pruned_loss=0.07455, over 21945.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3135, pruned_loss=0.08063, over 4281379.09 frames. ], batch size: 316, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:37:33,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.589e+02 8.373e+02 1.013e+03 1.291e+03 3.071e+03, threshold=2.026e+03, percent-clipped=0.0 2023-06-26 01:37:34,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=12.0 2023-06-26 01:38:41,256 INFO [train.py:996] (0/4) Epoch 12, batch 17000, loss[loss=0.2426, simple_loss=0.3087, pruned_loss=0.0882, over 21963.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.31, pruned_loss=0.08094, over 4284018.60 frames. ], batch size: 333, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:38:45,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2114646.0, ans=0.125 2023-06-26 01:39:11,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2114706.0, ans=0.0 2023-06-26 01:39:13,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2114706.0, ans=0.125 2023-06-26 01:39:13,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-26 01:39:41,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2114766.0, ans=0.2 2023-06-26 01:40:00,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2114826.0, ans=0.0 2023-06-26 01:40:13,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-26 01:40:17,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-26 01:40:21,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2114886.0, ans=0.125 2023-06-26 01:40:25,775 INFO [train.py:996] (0/4) Epoch 12, batch 17050, loss[loss=0.2523, simple_loss=0.3309, pruned_loss=0.08688, over 21362.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3172, pruned_loss=0.08334, over 4283164.59 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:40:47,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2115006.0, ans=0.5 2023-06-26 01:40:58,365 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:41:05,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.789e+02 8.713e+02 1.354e+03 1.958e+03 3.911e+03, threshold=2.708e+03, percent-clipped=23.0 2023-06-26 01:42:13,586 INFO [train.py:996] (0/4) Epoch 12, batch 17100, loss[loss=0.2209, simple_loss=0.2853, pruned_loss=0.07827, over 21388.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3161, pruned_loss=0.08393, over 4287951.10 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:42:32,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2115246.0, ans=0.0 2023-06-26 01:42:48,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-26 01:43:26,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2115426.0, ans=0.1 2023-06-26 01:44:02,148 INFO [train.py:996] (0/4) Epoch 12, batch 17150, loss[loss=0.1997, simple_loss=0.2732, pruned_loss=0.06316, over 21472.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.312, pruned_loss=0.08309, over 4288201.88 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:44:14,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2115546.0, ans=0.125 2023-06-26 01:44:19,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2115546.0, ans=0.0 2023-06-26 01:44:43,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.448e+02 7.663e+02 1.094e+03 1.326e+03 2.492e+03, threshold=2.188e+03, percent-clipped=0.0 2023-06-26 01:44:54,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2115666.0, ans=0.125 2023-06-26 01:45:17,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2115726.0, ans=0.0 2023-06-26 01:45:33,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2115786.0, ans=0.0 2023-06-26 01:45:52,539 INFO [train.py:996] (0/4) Epoch 12, batch 17200, loss[loss=0.2701, simple_loss=0.3392, pruned_loss=0.1005, over 21396.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3119, pruned_loss=0.08289, over 4292529.35 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 01:46:09,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=2115906.0, ans=0.02 2023-06-26 01:46:24,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2115906.0, ans=0.0 2023-06-26 01:47:14,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2116026.0, ans=0.125 2023-06-26 01:47:45,191 INFO [train.py:996] (0/4) Epoch 12, batch 17250, loss[loss=0.2903, simple_loss=0.3545, pruned_loss=0.113, over 21791.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3159, pruned_loss=0.08521, over 4288474.99 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:47:52,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2116146.0, ans=0.2 2023-06-26 01:48:42,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.045e+02 8.687e+02 1.216e+03 1.753e+03 5.268e+03, threshold=2.433e+03, percent-clipped=15.0 2023-06-26 01:48:54,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2116326.0, ans=0.125 2023-06-26 01:49:06,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2116326.0, ans=0.0 2023-06-26 01:49:24,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.71 vs. limit=15.0 2023-06-26 01:49:35,557 INFO [train.py:996] (0/4) Epoch 12, batch 17300, loss[loss=0.2721, simple_loss=0.337, pruned_loss=0.1037, over 21835.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3237, pruned_loss=0.08817, over 4290325.22 frames. ], batch size: 118, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:50:52,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2116626.0, ans=0.1 2023-06-26 01:51:29,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.98 vs. limit=15.0 2023-06-26 01:51:43,240 INFO [train.py:996] (0/4) Epoch 12, batch 17350, loss[loss=0.2744, simple_loss=0.3656, pruned_loss=0.09163, over 21475.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.325, pruned_loss=0.0882, over 4281127.41 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:51:43,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2116746.0, ans=0.125 2023-06-26 01:51:53,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2116746.0, ans=0.2 2023-06-26 01:52:25,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2116866.0, ans=0.0 2023-06-26 01:52:33,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.805e+02 1.057e+03 1.442e+03 1.846e+03 4.357e+03, threshold=2.883e+03, percent-clipped=11.0 2023-06-26 01:53:12,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-26 01:53:38,850 INFO [train.py:996] (0/4) Epoch 12, batch 17400, loss[loss=0.2357, simple_loss=0.3246, pruned_loss=0.07345, over 21641.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3208, pruned_loss=0.08388, over 4280942.96 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:53:58,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-26 01:54:08,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2117106.0, ans=0.1 2023-06-26 01:54:16,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2117166.0, ans=0.125 2023-06-26 01:54:21,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2117166.0, ans=0.0 2023-06-26 01:54:32,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2117166.0, ans=0.125 2023-06-26 01:55:26,427 INFO [train.py:996] (0/4) Epoch 12, batch 17450, loss[loss=0.2187, simple_loss=0.3156, pruned_loss=0.06091, over 21229.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3149, pruned_loss=0.0812, over 4274876.54 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:55:28,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2117346.0, ans=0.04949747468305833 2023-06-26 01:56:07,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2117466.0, ans=0.125 2023-06-26 01:56:07,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2117466.0, ans=0.125 2023-06-26 01:56:11,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.427e+02 9.633e+02 1.716e+03 2.627e+03 5.192e+03, threshold=3.432e+03, percent-clipped=16.0 2023-06-26 01:56:51,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2117586.0, ans=0.125 2023-06-26 01:56:52,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2117586.0, ans=0.125 2023-06-26 01:57:03,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-06-26 01:57:16,268 INFO [train.py:996] (0/4) Epoch 12, batch 17500, loss[loss=0.2335, simple_loss=0.2999, pruned_loss=0.08357, over 21826.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3098, pruned_loss=0.07805, over 4272794.57 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:57:42,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2117706.0, ans=0.125 2023-06-26 01:57:56,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=22.5 2023-06-26 01:58:18,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2117826.0, ans=0.1 2023-06-26 01:58:19,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-26 01:58:35,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=22.5 2023-06-26 01:58:36,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2117826.0, ans=0.125 2023-06-26 01:58:50,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-26 01:59:02,076 INFO [train.py:996] (0/4) Epoch 12, batch 17550, loss[loss=0.2023, simple_loss=0.2965, pruned_loss=0.05406, over 21417.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3096, pruned_loss=0.07627, over 4271721.00 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:59:41,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2118066.0, ans=0.125 2023-06-26 01:59:44,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.370e+02 8.396e+02 1.240e+03 1.763e+03 3.484e+03, threshold=2.480e+03, percent-clipped=3.0 2023-06-26 01:59:45,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2118066.0, ans=0.125 2023-06-26 02:00:47,366 INFO [train.py:996] (0/4) Epoch 12, batch 17600, loss[loss=0.2262, simple_loss=0.3058, pruned_loss=0.07332, over 20739.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3129, pruned_loss=0.07623, over 4268114.20 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:01:01,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2118246.0, ans=0.1 2023-06-26 02:01:45,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-26 02:02:00,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2118426.0, ans=0.0 2023-06-26 02:02:05,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2118426.0, ans=0.125 2023-06-26 02:02:32,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-26 02:02:36,281 INFO [train.py:996] (0/4) Epoch 12, batch 17650, loss[loss=0.2476, simple_loss=0.3377, pruned_loss=0.07877, over 20015.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3137, pruned_loss=0.07782, over 4266145.76 frames. ], batch size: 702, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:03:23,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2118666.0, ans=0.0 2023-06-26 02:03:27,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.781e+02 8.671e+02 1.388e+03 2.220e+03 4.878e+03, threshold=2.775e+03, percent-clipped=22.0 2023-06-26 02:04:14,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2118786.0, ans=0.2 2023-06-26 02:04:31,209 INFO [train.py:996] (0/4) Epoch 12, batch 17700, loss[loss=0.1773, simple_loss=0.2523, pruned_loss=0.05113, over 21487.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3062, pruned_loss=0.07467, over 4268306.04 frames. ], batch size: 212, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:04:50,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2118906.0, ans=0.125 2023-06-26 02:05:05,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2118906.0, ans=0.1 2023-06-26 02:05:35,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2118966.0, ans=0.0 2023-06-26 02:05:45,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-26 02:05:55,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2119026.0, ans=0.04949747468305833 2023-06-26 02:06:14,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2119086.0, ans=0.125 2023-06-26 02:06:20,351 INFO [train.py:996] (0/4) Epoch 12, batch 17750, loss[loss=0.2532, simple_loss=0.3327, pruned_loss=0.08682, over 21721.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3137, pruned_loss=0.07793, over 4266418.09 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:06:23,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-26 02:07:10,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-26 02:07:18,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.321e+02 9.910e+02 1.512e+03 2.052e+03 5.083e+03, threshold=3.025e+03, percent-clipped=13.0 2023-06-26 02:07:32,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2119326.0, ans=0.5 2023-06-26 02:08:08,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2119386.0, ans=0.0 2023-06-26 02:08:11,446 INFO [train.py:996] (0/4) Epoch 12, batch 17800, loss[loss=0.2139, simple_loss=0.2817, pruned_loss=0.07301, over 21570.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3122, pruned_loss=0.07686, over 4269042.78 frames. ], batch size: 112, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:09:09,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2119566.0, ans=0.0 2023-06-26 02:09:13,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2119566.0, ans=0.125 2023-06-26 02:09:37,462 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=8.0 2023-06-26 02:10:01,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.27 vs. limit=6.0 2023-06-26 02:10:07,320 INFO [train.py:996] (0/4) Epoch 12, batch 17850, loss[loss=0.249, simple_loss=0.3194, pruned_loss=0.08925, over 21410.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3137, pruned_loss=0.07777, over 4268494.71 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:10:33,627 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:10:55,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2119866.0, ans=0.125 2023-06-26 02:10:58,033 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.213e+02 1.072e+03 1.677e+03 2.667e+03 5.853e+03, threshold=3.353e+03, percent-clipped=21.0 2023-06-26 02:10:58,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2119866.0, ans=0.0 2023-06-26 02:11:20,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=2119926.0, ans=10.0 2023-06-26 02:12:03,565 INFO [train.py:996] (0/4) Epoch 12, batch 17900, loss[loss=0.2444, simple_loss=0.3471, pruned_loss=0.0708, over 21831.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.319, pruned_loss=0.08027, over 4271890.90 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:12:04,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2120046.0, ans=0.2 2023-06-26 02:12:04,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2120046.0, ans=0.125 2023-06-26 02:12:22,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2120046.0, ans=0.125 2023-06-26 02:12:23,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2120046.0, ans=0.125 2023-06-26 02:12:55,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-26 02:13:59,319 INFO [train.py:996] (0/4) Epoch 12, batch 17950, loss[loss=0.2009, simple_loss=0.3006, pruned_loss=0.05056, over 21800.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3188, pruned_loss=0.07703, over 4273301.13 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:14:31,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-26 02:14:44,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.194e+02 1.009e+03 1.518e+03 2.027e+03 4.283e+03, threshold=3.036e+03, percent-clipped=1.0 2023-06-26 02:14:45,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-26 02:14:57,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2120466.0, ans=0.1 2023-06-26 02:15:21,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2120526.0, ans=0.2 2023-06-26 02:15:35,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2120586.0, ans=0.0 2023-06-26 02:15:40,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2120586.0, ans=0.125 2023-06-26 02:15:49,544 INFO [train.py:996] (0/4) Epoch 12, batch 18000, loss[loss=0.1769, simple_loss=0.2508, pruned_loss=0.05146, over 21627.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3119, pruned_loss=0.0756, over 4272249.43 frames. ], batch size: 332, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 02:15:49,546 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 02:16:04,053 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.5785, 5.5930, 5.2134, 5.2044], device='cuda:0') 2023-06-26 02:16:07,682 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.258, simple_loss=0.3529, pruned_loss=0.08158, over 1796401.00 frames. 2023-06-26 02:16:07,683 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-26 02:16:09,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2120646.0, ans=0.2 2023-06-26 02:16:14,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-26 02:16:16,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2120646.0, ans=0.125 2023-06-26 02:16:59,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2120766.0, ans=0.125 2023-06-26 02:16:59,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2120766.0, ans=0.125 2023-06-26 02:17:02,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2120766.0, ans=0.125 2023-06-26 02:17:15,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2120826.0, ans=0.125 2023-06-26 02:17:55,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-26 02:17:55,809 INFO [train.py:996] (0/4) Epoch 12, batch 18050, loss[loss=0.2434, simple_loss=0.3183, pruned_loss=0.08425, over 21406.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3055, pruned_loss=0.07458, over 4271519.81 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:18:13,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2120946.0, ans=0.125 2023-06-26 02:18:28,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2121006.0, ans=0.125 2023-06-26 02:18:43,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2121066.0, ans=0.125 2023-06-26 02:18:55,873 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.179e+02 8.208e+02 1.156e+03 1.748e+03 3.501e+03, threshold=2.312e+03, percent-clipped=3.0 2023-06-26 02:18:56,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2121066.0, ans=0.0 2023-06-26 02:19:50,065 INFO [train.py:996] (0/4) Epoch 12, batch 18100, loss[loss=0.2605, simple_loss=0.3549, pruned_loss=0.083, over 21634.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3094, pruned_loss=0.07671, over 4261394.81 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:20:17,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2121306.0, ans=0.125 2023-06-26 02:21:05,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2121426.0, ans=0.125 2023-06-26 02:21:25,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2121486.0, ans=0.125 2023-06-26 02:21:26,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2121486.0, ans=0.035 2023-06-26 02:21:29,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-26 02:21:36,988 INFO [train.py:996] (0/4) Epoch 12, batch 18150, loss[loss=0.2046, simple_loss=0.2806, pruned_loss=0.06426, over 21460.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3114, pruned_loss=0.07691, over 4265185.98 frames. ], batch size: 195, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:22:36,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.496e+02 9.455e+02 1.549e+03 2.118e+03 3.915e+03, threshold=3.098e+03, percent-clipped=17.0 2023-06-26 02:22:40,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2121666.0, ans=0.0 2023-06-26 02:23:04,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2121786.0, ans=0.125 2023-06-26 02:23:24,021 INFO [train.py:996] (0/4) Epoch 12, batch 18200, loss[loss=0.2204, simple_loss=0.2811, pruned_loss=0.07984, over 21392.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3057, pruned_loss=0.07677, over 4274004.27 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:23:29,347 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:23:51,478 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-06-26 02:23:56,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2121906.0, ans=0.125 2023-06-26 02:24:07,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2121966.0, ans=0.2 2023-06-26 02:24:32,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-26 02:25:01,800 INFO [train.py:996] (0/4) Epoch 12, batch 18250, loss[loss=0.2036, simple_loss=0.2705, pruned_loss=0.06838, over 21730.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2979, pruned_loss=0.07396, over 4270824.19 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:25:54,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.441e+02 8.526e+02 1.132e+03 1.532e+03 3.016e+03, threshold=2.265e+03, percent-clipped=0.0 2023-06-26 02:26:44,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2122386.0, ans=0.0 2023-06-26 02:26:44,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-26 02:26:48,438 INFO [train.py:996] (0/4) Epoch 12, batch 18300, loss[loss=0.2542, simple_loss=0.3187, pruned_loss=0.0949, over 21758.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2987, pruned_loss=0.07394, over 4259149.26 frames. ], batch size: 112, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:27:24,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-26 02:27:32,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2122566.0, ans=0.125 2023-06-26 02:28:10,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-26 02:28:16,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2122686.0, ans=0.125 2023-06-26 02:28:18,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2122686.0, ans=0.2 2023-06-26 02:28:27,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-26 02:28:34,404 INFO [train.py:996] (0/4) Epoch 12, batch 18350, loss[loss=0.2016, simple_loss=0.2648, pruned_loss=0.0692, over 21164.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3055, pruned_loss=0.07458, over 4258138.21 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:28:35,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.54 vs. limit=15.0 2023-06-26 02:29:31,210 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.251e+02 1.279e+03 1.909e+03 2.961e+03 4.815e+03, threshold=3.819e+03, percent-clipped=39.0 2023-06-26 02:29:53,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2122926.0, ans=0.0 2023-06-26 02:30:26,203 INFO [train.py:996] (0/4) Epoch 12, batch 18400, loss[loss=0.2435, simple_loss=0.3051, pruned_loss=0.09097, over 21298.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3001, pruned_loss=0.0725, over 4257956.61 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:31:03,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2123106.0, ans=0.125 2023-06-26 02:31:20,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2123166.0, ans=0.0 2023-06-26 02:31:29,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2123166.0, ans=0.125 2023-06-26 02:32:02,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-26 02:32:10,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2123286.0, ans=0.125 2023-06-26 02:32:10,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2123286.0, ans=0.1 2023-06-26 02:32:13,318 INFO [train.py:996] (0/4) Epoch 12, batch 18450, loss[loss=0.201, simple_loss=0.2783, pruned_loss=0.06187, over 21689.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2962, pruned_loss=0.06908, over 4250010.44 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:32:33,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2123346.0, ans=0.125 2023-06-26 02:33:07,644 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.751e+02 8.419e+02 1.219e+03 1.812e+03 4.554e+03, threshold=2.437e+03, percent-clipped=1.0 2023-06-26 02:33:30,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-26 02:34:00,018 INFO [train.py:996] (0/4) Epoch 12, batch 18500, loss[loss=0.186, simple_loss=0.2562, pruned_loss=0.05795, over 21825.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2922, pruned_loss=0.06782, over 4244550.10 frames. ], batch size: 352, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:34:02,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2123646.0, ans=0.0 2023-06-26 02:35:18,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2123826.0, ans=0.2 2023-06-26 02:35:50,213 INFO [train.py:996] (0/4) Epoch 12, batch 18550, loss[loss=0.2064, simple_loss=0.2801, pruned_loss=0.06631, over 21782.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2891, pruned_loss=0.067, over 4237424.34 frames. ], batch size: 351, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:36:08,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2123946.0, ans=0.0 2023-06-26 02:36:28,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2124006.0, ans=0.0 2023-06-26 02:36:39,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2124066.0, ans=0.0 2023-06-26 02:36:50,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2124066.0, ans=0.1 2023-06-26 02:36:59,193 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.320e+02 1.196e+03 2.047e+03 2.769e+03 5.158e+03, threshold=4.094e+03, percent-clipped=37.0 2023-06-26 02:37:25,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2124186.0, ans=0.0 2023-06-26 02:37:47,742 INFO [train.py:996] (0/4) Epoch 12, batch 18600, loss[loss=0.2558, simple_loss=0.3417, pruned_loss=0.08494, over 21699.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2895, pruned_loss=0.06948, over 4220226.83 frames. ], batch size: 415, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:37:55,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2124246.0, ans=0.0 2023-06-26 02:39:08,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.58 vs. limit=10.0 2023-06-26 02:39:31,011 INFO [train.py:996] (0/4) Epoch 12, batch 18650, loss[loss=0.2085, simple_loss=0.2744, pruned_loss=0.07126, over 21679.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2881, pruned_loss=0.06914, over 4230587.18 frames. ], batch size: 333, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:40:13,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2124606.0, ans=0.125 2023-06-26 02:40:30,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.144e+02 8.452e+02 1.273e+03 1.829e+03 4.021e+03, threshold=2.546e+03, percent-clipped=0.0 2023-06-26 02:40:50,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2124726.0, ans=0.1 2023-06-26 02:41:20,345 INFO [train.py:996] (0/4) Epoch 12, batch 18700, loss[loss=0.194, simple_loss=0.281, pruned_loss=0.05348, over 21604.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.286, pruned_loss=0.07075, over 4219715.70 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:42:30,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2124966.0, ans=0.1 2023-06-26 02:42:41,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2125026.0, ans=0.0 2023-06-26 02:42:42,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2125026.0, ans=0.125 2023-06-26 02:42:44,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2125026.0, ans=0.1 2023-06-26 02:42:56,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2125086.0, ans=0.125 2023-06-26 02:43:09,795 INFO [train.py:996] (0/4) Epoch 12, batch 18750, loss[loss=0.2094, simple_loss=0.271, pruned_loss=0.07391, over 21263.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2896, pruned_loss=0.07377, over 4242595.17 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:43:36,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2125206.0, ans=0.1 2023-06-26 02:44:08,697 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.434e+02 9.819e+02 1.428e+03 2.632e+03 5.661e+03, threshold=2.856e+03, percent-clipped=25.0 2023-06-26 02:44:12,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2125266.0, ans=0.125 2023-06-26 02:44:16,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2125326.0, ans=0.125 2023-06-26 02:44:37,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2125326.0, ans=0.0 2023-06-26 02:44:47,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2125386.0, ans=0.95 2023-06-26 02:44:56,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2125446.0, ans=0.1 2023-06-26 02:44:57,212 INFO [train.py:996] (0/4) Epoch 12, batch 18800, loss[loss=0.217, simple_loss=0.3262, pruned_loss=0.05393, over 20781.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2964, pruned_loss=0.07528, over 4240505.18 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 02:45:34,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2125506.0, ans=0.125 2023-06-26 02:46:44,217 INFO [train.py:996] (0/4) Epoch 12, batch 18850, loss[loss=0.1723, simple_loss=0.26, pruned_loss=0.04229, over 21505.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.293, pruned_loss=0.07093, over 4241837.08 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:47:13,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-26 02:47:15,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-06-26 02:47:26,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-26 02:47:45,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 8.171e+02 1.119e+03 1.966e+03 5.674e+03, threshold=2.238e+03, percent-clipped=7.0 2023-06-26 02:48:31,744 INFO [train.py:996] (0/4) Epoch 12, batch 18900, loss[loss=0.2844, simple_loss=0.3218, pruned_loss=0.1235, over 21439.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2886, pruned_loss=0.07025, over 4250168.61 frames. ], batch size: 508, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:48:35,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2126046.0, ans=0.07 2023-06-26 02:49:37,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.60 vs. limit=10.0 2023-06-26 02:50:19,406 INFO [train.py:996] (0/4) Epoch 12, batch 18950, loss[loss=0.2523, simple_loss=0.3245, pruned_loss=0.09009, over 21765.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2876, pruned_loss=0.07178, over 4258348.43 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:51:21,184 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.969e+02 7.601e+02 1.025e+03 1.566e+03 4.291e+03, threshold=2.050e+03, percent-clipped=8.0 2023-06-26 02:51:50,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2126586.0, ans=0.1 2023-06-26 02:51:52,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2126586.0, ans=0.1 2023-06-26 02:52:07,275 INFO [train.py:996] (0/4) Epoch 12, batch 19000, loss[loss=0.2493, simple_loss=0.3416, pruned_loss=0.07852, over 21755.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2998, pruned_loss=0.07411, over 4270978.01 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:52:11,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2126646.0, ans=0.5 2023-06-26 02:52:47,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2126706.0, ans=0.1 2023-06-26 02:52:50,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2126706.0, ans=0.125 2023-06-26 02:53:25,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2126826.0, ans=0.125 2023-06-26 02:53:48,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-26 02:53:51,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2126886.0, ans=0.5 2023-06-26 02:53:55,658 INFO [train.py:996] (0/4) Epoch 12, batch 19050, loss[loss=0.234, simple_loss=0.2995, pruned_loss=0.08429, over 21816.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.305, pruned_loss=0.07698, over 4275551.35 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:55:00,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+02 8.291e+02 1.142e+03 1.622e+03 3.399e+03, threshold=2.283e+03, percent-clipped=17.0 2023-06-26 02:55:46,494 INFO [train.py:996] (0/4) Epoch 12, batch 19100, loss[loss=0.2247, simple_loss=0.2924, pruned_loss=0.07845, over 21757.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3035, pruned_loss=0.07829, over 4277729.48 frames. ], batch size: 112, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:55:55,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2127246.0, ans=0.125 2023-06-26 02:56:59,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.58 vs. limit=12.0 2023-06-26 02:57:49,121 INFO [train.py:996] (0/4) Epoch 12, batch 19150, loss[loss=0.2514, simple_loss=0.3345, pruned_loss=0.08409, over 21429.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3055, pruned_loss=0.07902, over 4274727.29 frames. ], batch size: 211, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:58:07,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-26 02:58:38,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-26 02:58:50,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.298e+02 9.303e+02 1.285e+03 2.071e+03 6.086e+03, threshold=2.570e+03, percent-clipped=20.0 2023-06-26 02:59:48,665 INFO [train.py:996] (0/4) Epoch 12, batch 19200, loss[loss=0.2401, simple_loss=0.3426, pruned_loss=0.06879, over 21711.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3152, pruned_loss=0.07946, over 4270816.74 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:59:53,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.03 vs. limit=5.0 2023-06-26 03:00:34,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2127966.0, ans=0.125 2023-06-26 03:00:35,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2127966.0, ans=0.2 2023-06-26 03:00:52,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2128026.0, ans=0.125 2023-06-26 03:01:36,767 INFO [train.py:996] (0/4) Epoch 12, batch 19250, loss[loss=0.1807, simple_loss=0.2765, pruned_loss=0.04244, over 21784.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3159, pruned_loss=0.07509, over 4276162.90 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 03:01:55,561 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-26 03:02:08,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2128206.0, ans=0.0 2023-06-26 03:02:28,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.722e+02 8.401e+02 1.172e+03 1.980e+03 3.719e+03, threshold=2.345e+03, percent-clipped=11.0 2023-06-26 03:03:18,860 INFO [train.py:996] (0/4) Epoch 12, batch 19300, loss[loss=0.1957, simple_loss=0.2838, pruned_loss=0.05385, over 21767.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3126, pruned_loss=0.07435, over 4281754.85 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:03:55,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2128506.0, ans=0.2 2023-06-26 03:04:43,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2128686.0, ans=0.1 2023-06-26 03:05:09,209 INFO [train.py:996] (0/4) Epoch 12, batch 19350, loss[loss=0.1972, simple_loss=0.2887, pruned_loss=0.05281, over 21750.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3072, pruned_loss=0.07112, over 4286601.26 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:05:23,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2128746.0, ans=0.125 2023-06-26 03:06:02,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.957e+02 9.080e+02 1.347e+03 2.318e+03 4.849e+03, threshold=2.694e+03, percent-clipped=24.0 2023-06-26 03:06:11,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2128926.0, ans=0.0 2023-06-26 03:06:12,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-26 03:06:18,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2128926.0, ans=0.0 2023-06-26 03:06:45,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2128986.0, ans=0.1 2023-06-26 03:06:57,158 INFO [train.py:996] (0/4) Epoch 12, batch 19400, loss[loss=0.231, simple_loss=0.3101, pruned_loss=0.07598, over 21823.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3042, pruned_loss=0.07035, over 4288433.95 frames. ], batch size: 391, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:07:41,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.09 vs. limit=15.0 2023-06-26 03:08:35,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2129286.0, ans=0.2 2023-06-26 03:08:45,437 INFO [train.py:996] (0/4) Epoch 12, batch 19450, loss[loss=0.2424, simple_loss=0.2869, pruned_loss=0.09894, over 21554.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3015, pruned_loss=0.07204, over 4289465.66 frames. ], batch size: 511, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:08:48,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2129346.0, ans=0.2 2023-06-26 03:09:38,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.528e+02 8.932e+02 1.244e+03 1.603e+03 3.427e+03, threshold=2.488e+03, percent-clipped=5.0 2023-06-26 03:09:52,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2129526.0, ans=0.125 2023-06-26 03:10:08,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2129586.0, ans=0.125 2023-06-26 03:10:32,571 INFO [train.py:996] (0/4) Epoch 12, batch 19500, loss[loss=0.1893, simple_loss=0.244, pruned_loss=0.06726, over 21932.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2956, pruned_loss=0.07265, over 4282768.61 frames. ], batch size: 103, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:10:54,083 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:11:11,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2129766.0, ans=0.2 2023-06-26 03:11:18,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2129766.0, ans=0.125 2023-06-26 03:12:13,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-26 03:12:14,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2129886.0, ans=0.125 2023-06-26 03:12:21,303 INFO [train.py:996] (0/4) Epoch 12, batch 19550, loss[loss=0.2306, simple_loss=0.3249, pruned_loss=0.06812, over 21746.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2923, pruned_loss=0.07114, over 4278191.42 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:12:37,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2130006.0, ans=0.125 2023-06-26 03:12:45,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2130006.0, ans=0.0 2023-06-26 03:12:59,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-26 03:13:00,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2130066.0, ans=0.0 2023-06-26 03:13:06,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=12.0 2023-06-26 03:13:14,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-26 03:13:15,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.254e+02 9.637e+02 1.286e+03 1.805e+03 3.756e+03, threshold=2.572e+03, percent-clipped=14.0 2023-06-26 03:13:17,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=2130126.0, ans=10.0 2023-06-26 03:13:19,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2130126.0, ans=0.125 2023-06-26 03:13:42,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2130126.0, ans=0.125 2023-06-26 03:14:09,966 INFO [train.py:996] (0/4) Epoch 12, batch 19600, loss[loss=0.215, simple_loss=0.297, pruned_loss=0.06646, over 21659.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2946, pruned_loss=0.07193, over 4281392.67 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:14:49,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.97 vs. limit=15.0 2023-06-26 03:15:00,535 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:15:50,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2130486.0, ans=0.125 2023-06-26 03:15:52,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2130486.0, ans=0.125 2023-06-26 03:16:00,218 INFO [train.py:996] (0/4) Epoch 12, batch 19650, loss[loss=0.2295, simple_loss=0.3162, pruned_loss=0.07141, over 20090.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2994, pruned_loss=0.07566, over 4283556.62 frames. ], batch size: 704, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:16:15,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2130546.0, ans=0.0 2023-06-26 03:16:30,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-26 03:17:06,809 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.323e+02 8.252e+02 1.350e+03 1.732e+03 4.354e+03, threshold=2.700e+03, percent-clipped=5.0 2023-06-26 03:17:17,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2130726.0, ans=0.1 2023-06-26 03:17:28,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2130726.0, ans=0.125 2023-06-26 03:17:51,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=22.5 2023-06-26 03:18:00,750 INFO [train.py:996] (0/4) Epoch 12, batch 19700, loss[loss=0.243, simple_loss=0.3363, pruned_loss=0.07487, over 21516.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3028, pruned_loss=0.07631, over 4289767.34 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:18:03,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2130846.0, ans=0.125 2023-06-26 03:18:05,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2130846.0, ans=0.0 2023-06-26 03:18:09,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-26 03:19:19,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2131026.0, ans=0.0 2023-06-26 03:19:42,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2131086.0, ans=0.0 2023-06-26 03:19:51,704 INFO [train.py:996] (0/4) Epoch 12, batch 19750, loss[loss=0.2651, simple_loss=0.3513, pruned_loss=0.0895, over 21735.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3119, pruned_loss=0.07793, over 4281160.10 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:20:09,275 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:20:17,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2131206.0, ans=0.1 2023-06-26 03:20:59,650 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.834e+02 9.206e+02 1.478e+03 2.437e+03 4.883e+03, threshold=2.956e+03, percent-clipped=21.0 2023-06-26 03:21:28,974 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=12.0 2023-06-26 03:21:41,952 INFO [train.py:996] (0/4) Epoch 12, batch 19800, loss[loss=0.1792, simple_loss=0.2592, pruned_loss=0.04964, over 21733.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3133, pruned_loss=0.07928, over 4285767.86 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:21:51,890 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-26 03:22:30,990 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.22 vs. limit=10.0 2023-06-26 03:22:44,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2131566.0, ans=0.0 2023-06-26 03:22:53,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2131626.0, ans=0.125 2023-06-26 03:23:15,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2131686.0, ans=0.0 2023-06-26 03:23:33,450 INFO [train.py:996] (0/4) Epoch 12, batch 19850, loss[loss=0.1738, simple_loss=0.2444, pruned_loss=0.05163, over 21382.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3059, pruned_loss=0.07482, over 4284278.96 frames. ], batch size: 131, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:23:59,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2131746.0, ans=0.125 2023-06-26 03:24:41,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 9.499e+02 1.416e+03 2.010e+03 4.711e+03, threshold=2.833e+03, percent-clipped=4.0 2023-06-26 03:24:53,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-26 03:25:00,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2131926.0, ans=0.2 2023-06-26 03:25:29,035 INFO [train.py:996] (0/4) Epoch 12, batch 19900, loss[loss=0.216, simple_loss=0.3202, pruned_loss=0.05589, over 21303.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.306, pruned_loss=0.07184, over 4272144.79 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:25:29,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2132046.0, ans=0.125 2023-06-26 03:25:30,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.24 vs. limit=10.0 2023-06-26 03:26:35,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2132166.0, ans=0.125 2023-06-26 03:27:24,350 INFO [train.py:996] (0/4) Epoch 12, batch 19950, loss[loss=0.1982, simple_loss=0.2605, pruned_loss=0.06797, over 21192.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3001, pruned_loss=0.07092, over 4260526.44 frames. ], batch size: 143, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:27:26,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-26 03:27:29,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-26 03:27:43,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2132346.0, ans=0.0 2023-06-26 03:28:29,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.346e+02 8.702e+02 1.204e+03 1.763e+03 4.092e+03, threshold=2.408e+03, percent-clipped=5.0 2023-06-26 03:29:08,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2132586.0, ans=0.1 2023-06-26 03:29:17,280 INFO [train.py:996] (0/4) Epoch 12, batch 20000, loss[loss=0.2328, simple_loss=0.3243, pruned_loss=0.07063, over 21713.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3022, pruned_loss=0.07251, over 4268405.58 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:30:13,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2132766.0, ans=0.0 2023-06-26 03:30:18,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2132826.0, ans=0.0 2023-06-26 03:30:18,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2132826.0, ans=0.125 2023-06-26 03:30:23,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2132826.0, ans=0.1 2023-06-26 03:31:06,177 INFO [train.py:996] (0/4) Epoch 12, batch 20050, loss[loss=0.2208, simple_loss=0.2899, pruned_loss=0.07585, over 21544.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3039, pruned_loss=0.07428, over 4266415.35 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:31:06,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2132946.0, ans=0.125 2023-06-26 03:31:24,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2133006.0, ans=0.0 2023-06-26 03:32:07,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.733e+02 9.408e+02 1.346e+03 1.718e+03 3.911e+03, threshold=2.692e+03, percent-clipped=11.0 2023-06-26 03:32:55,549 INFO [train.py:996] (0/4) Epoch 12, batch 20100, loss[loss=0.2258, simple_loss=0.2971, pruned_loss=0.07723, over 21451.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3066, pruned_loss=0.07639, over 4273883.99 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:34:15,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2133426.0, ans=0.125 2023-06-26 03:34:46,180 INFO [train.py:996] (0/4) Epoch 12, batch 20150, loss[loss=0.2559, simple_loss=0.33, pruned_loss=0.09093, over 21763.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3156, pruned_loss=0.08, over 4277616.37 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:36:04,687 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.317e+02 8.895e+02 1.198e+03 1.726e+03 5.010e+03, threshold=2.397e+03, percent-clipped=8.0 2023-06-26 03:36:14,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2133726.0, ans=0.125 2023-06-26 03:36:16,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=2133726.0, ans=15.0 2023-06-26 03:36:36,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2133786.0, ans=0.125 2023-06-26 03:36:53,596 INFO [train.py:996] (0/4) Epoch 12, batch 20200, loss[loss=0.2271, simple_loss=0.3114, pruned_loss=0.07142, over 21682.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3201, pruned_loss=0.08186, over 4272239.79 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:36:54,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2133846.0, ans=0.125 2023-06-26 03:37:11,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2133906.0, ans=0.125 2023-06-26 03:37:30,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2133906.0, ans=0.05 2023-06-26 03:38:03,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-06-26 03:38:05,577 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:38:15,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2134086.0, ans=0.2 2023-06-26 03:38:43,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2134086.0, ans=0.2 2023-06-26 03:38:46,329 INFO [train.py:996] (0/4) Epoch 12, batch 20250, loss[loss=0.2227, simple_loss=0.3088, pruned_loss=0.06834, over 21788.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3205, pruned_loss=0.08004, over 4274354.72 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:39:49,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.636e+02 8.204e+02 1.224e+03 2.058e+03 5.091e+03, threshold=2.449e+03, percent-clipped=18.0 2023-06-26 03:40:11,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2134326.0, ans=0.0 2023-06-26 03:40:38,154 INFO [train.py:996] (0/4) Epoch 12, batch 20300, loss[loss=0.1929, simple_loss=0.2617, pruned_loss=0.06204, over 21786.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.319, pruned_loss=0.07773, over 4268616.58 frames. ], batch size: 124, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:40:43,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2134446.0, ans=0.125 2023-06-26 03:41:24,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2134566.0, ans=0.125 2023-06-26 03:41:39,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2134626.0, ans=0.125 2023-06-26 03:42:28,315 INFO [train.py:996] (0/4) Epoch 12, batch 20350, loss[loss=0.2267, simple_loss=0.3034, pruned_loss=0.07501, over 21797.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3188, pruned_loss=0.07763, over 4262762.21 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:42:47,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.35 vs. limit=10.0 2023-06-26 03:42:57,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2134806.0, ans=0.2 2023-06-26 03:43:31,517 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 1.038e+03 1.419e+03 2.103e+03 3.160e+03, threshold=2.839e+03, percent-clipped=11.0 2023-06-26 03:43:38,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2134926.0, ans=0.125 2023-06-26 03:43:54,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2134986.0, ans=0.1 2023-06-26 03:44:17,356 INFO [train.py:996] (0/4) Epoch 12, batch 20400, loss[loss=0.2808, simple_loss=0.3523, pruned_loss=0.1047, over 21913.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3214, pruned_loss=0.08035, over 4266112.26 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:44:27,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-06-26 03:44:31,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2135046.0, ans=0.125 2023-06-26 03:45:07,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2135166.0, ans=0.2 2023-06-26 03:45:09,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135166.0, ans=0.1 2023-06-26 03:45:49,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2135286.0, ans=0.125 2023-06-26 03:46:02,289 INFO [train.py:996] (0/4) Epoch 12, batch 20450, loss[loss=0.2725, simple_loss=0.3316, pruned_loss=0.1067, over 21826.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3218, pruned_loss=0.08304, over 4263211.05 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:46:24,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-26 03:46:34,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2135406.0, ans=0.0 2023-06-26 03:47:05,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.613e+02 8.799e+02 1.280e+03 2.012e+03 4.043e+03, threshold=2.560e+03, percent-clipped=9.0 2023-06-26 03:47:48,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2135586.0, ans=0.1 2023-06-26 03:47:52,251 INFO [train.py:996] (0/4) Epoch 12, batch 20500, loss[loss=0.228, simple_loss=0.2941, pruned_loss=0.08096, over 21790.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3184, pruned_loss=0.08345, over 4245410.05 frames. ], batch size: 333, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:49:41,375 INFO [train.py:996] (0/4) Epoch 12, batch 20550, loss[loss=0.2065, simple_loss=0.2601, pruned_loss=0.07641, over 20883.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3107, pruned_loss=0.08227, over 4246723.70 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:50:01,475 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-356000.pt 2023-06-26 03:50:36,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2136066.0, ans=0.1 2023-06-26 03:50:48,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.549e+02 9.305e+02 1.211e+03 1.898e+03 4.893e+03, threshold=2.421e+03, percent-clipped=7.0 2023-06-26 03:51:32,372 INFO [train.py:996] (0/4) Epoch 12, batch 20600, loss[loss=0.2744, simple_loss=0.3336, pruned_loss=0.1076, over 21329.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3132, pruned_loss=0.08102, over 4250048.80 frames. ], batch size: 143, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:52:21,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2136366.0, ans=0.125 2023-06-26 03:52:37,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2136426.0, ans=0.0 2023-06-26 03:53:02,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2136486.0, ans=0.1 2023-06-26 03:53:18,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2136546.0, ans=0.09899494936611666 2023-06-26 03:53:19,379 INFO [train.py:996] (0/4) Epoch 12, batch 20650, loss[loss=0.1763, simple_loss=0.2477, pruned_loss=0.05244, over 21639.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3094, pruned_loss=0.0808, over 4249598.82 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:53:30,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2136546.0, ans=0.05 2023-06-26 03:54:03,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2136666.0, ans=0.125 2023-06-26 03:54:08,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2136666.0, ans=0.125 2023-06-26 03:54:10,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2136666.0, ans=0.2 2023-06-26 03:54:23,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.638e+02 7.911e+02 1.093e+03 1.392e+03 2.795e+03, threshold=2.187e+03, percent-clipped=3.0 2023-06-26 03:55:06,918 INFO [train.py:996] (0/4) Epoch 12, batch 20700, loss[loss=0.222, simple_loss=0.3318, pruned_loss=0.05606, over 20742.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.302, pruned_loss=0.07771, over 4252607.68 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:55:15,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2136846.0, ans=0.04949747468305833 2023-06-26 03:55:43,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2136906.0, ans=0.0 2023-06-26 03:56:50,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2137086.0, ans=0.125 2023-06-26 03:56:57,169 INFO [train.py:996] (0/4) Epoch 12, batch 20750, loss[loss=0.3302, simple_loss=0.4184, pruned_loss=0.121, over 21575.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3072, pruned_loss=0.07759, over 4251787.21 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:57:10,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2137146.0, ans=0.0 2023-06-26 03:57:27,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2137206.0, ans=0.1 2023-06-26 03:57:28,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2137206.0, ans=0.0 2023-06-26 03:57:39,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2137206.0, ans=0.07 2023-06-26 03:58:14,462 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.764e+02 1.037e+03 1.665e+03 2.328e+03 7.151e+03, threshold=3.329e+03, percent-clipped=27.0 2023-06-26 03:58:34,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2137386.0, ans=0.0 2023-06-26 03:58:47,667 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-26 03:58:51,610 INFO [train.py:996] (0/4) Epoch 12, batch 20800, loss[loss=0.2212, simple_loss=0.2905, pruned_loss=0.07599, over 20691.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.31, pruned_loss=0.0782, over 4252025.62 frames. ], batch size: 607, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:59:07,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2137446.0, ans=0.0 2023-06-26 03:59:12,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2137506.0, ans=0.125 2023-06-26 03:59:18,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2137506.0, ans=0.1 2023-06-26 03:59:37,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2137566.0, ans=0.0 2023-06-26 03:59:41,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-26 03:59:57,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2137626.0, ans=0.0 2023-06-26 04:00:09,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2137626.0, ans=0.0 2023-06-26 04:00:39,500 INFO [train.py:996] (0/4) Epoch 12, batch 20850, loss[loss=0.2279, simple_loss=0.2944, pruned_loss=0.08069, over 21823.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3015, pruned_loss=0.0758, over 4257183.89 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:00:55,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2137746.0, ans=0.0 2023-06-26 04:01:49,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.795e+02 7.660e+02 1.067e+03 1.526e+03 3.659e+03, threshold=2.133e+03, percent-clipped=1.0 2023-06-26 04:02:10,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-26 04:02:27,181 INFO [train.py:996] (0/4) Epoch 12, batch 20900, loss[loss=0.2287, simple_loss=0.3009, pruned_loss=0.07823, over 21253.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3017, pruned_loss=0.07662, over 4267147.39 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:02:36,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2138046.0, ans=0.125 2023-06-26 04:02:41,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2138046.0, ans=0.0 2023-06-26 04:02:53,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2138106.0, ans=0.0 2023-06-26 04:03:06,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-26 04:03:17,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2138166.0, ans=0.125 2023-06-26 04:03:29,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2138166.0, ans=0.1 2023-06-26 04:03:46,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-26 04:03:52,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-26 04:04:06,629 INFO [train.py:996] (0/4) Epoch 12, batch 20950, loss[loss=0.2227, simple_loss=0.2929, pruned_loss=0.07621, over 21796.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2987, pruned_loss=0.07372, over 4256863.57 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:04:07,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-26 04:04:14,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2138346.0, ans=0.1 2023-06-26 04:04:39,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-06-26 04:05:18,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.308e+02 8.469e+02 1.535e+03 2.193e+03 7.053e+03, threshold=3.069e+03, percent-clipped=28.0 2023-06-26 04:05:32,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2138526.0, ans=10.0 2023-06-26 04:05:34,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2138526.0, ans=0.125 2023-06-26 04:05:37,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2138586.0, ans=0.125 2023-06-26 04:05:37,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-26 04:05:47,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2138586.0, ans=0.0 2023-06-26 04:05:53,927 INFO [train.py:996] (0/4) Epoch 12, batch 21000, loss[loss=0.224, simple_loss=0.2884, pruned_loss=0.07978, over 21567.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2977, pruned_loss=0.07457, over 4265340.28 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:05:53,928 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 04:06:05,767 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.4423, 3.7872, 3.9979, 4.1571], device='cuda:0') 2023-06-26 04:06:16,506 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2617, simple_loss=0.359, pruned_loss=0.08218, over 1796401.00 frames. 2023-06-26 04:06:16,507 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-26 04:06:19,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2138646.0, ans=0.2 2023-06-26 04:07:18,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2138826.0, ans=0.0 2023-06-26 04:07:52,153 INFO [train.py:996] (0/4) Epoch 12, batch 21050, loss[loss=0.1887, simple_loss=0.2557, pruned_loss=0.06092, over 21757.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2949, pruned_loss=0.07394, over 4270575.09 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:07:52,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2138946.0, ans=0.0 2023-06-26 04:07:59,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2138946.0, ans=0.1 2023-06-26 04:08:01,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2138946.0, ans=0.0 2023-06-26 04:08:16,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2139006.0, ans=0.0 2023-06-26 04:08:49,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-26 04:08:56,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.279e+02 7.328e+02 1.060e+03 1.380e+03 3.297e+03, threshold=2.119e+03, percent-clipped=1.0 2023-06-26 04:09:12,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-26 04:09:26,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-26 04:09:37,028 INFO [train.py:996] (0/4) Epoch 12, batch 21100, loss[loss=0.192, simple_loss=0.2582, pruned_loss=0.06287, over 21321.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2919, pruned_loss=0.07382, over 4272736.18 frames. ], batch size: 177, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:10:10,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 04:11:08,397 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-26 04:11:19,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-26 04:11:22,950 INFO [train.py:996] (0/4) Epoch 12, batch 21150, loss[loss=0.1979, simple_loss=0.2498, pruned_loss=0.07301, over 20801.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.287, pruned_loss=0.07405, over 4264819.31 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:12:15,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2139666.0, ans=0.0 2023-06-26 04:12:25,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2139726.0, ans=0.2 2023-06-26 04:12:27,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2139726.0, ans=0.125 2023-06-26 04:12:28,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.534e+02 8.733e+02 1.101e+03 1.484e+03 2.918e+03, threshold=2.203e+03, percent-clipped=8.0 2023-06-26 04:13:08,703 INFO [train.py:996] (0/4) Epoch 12, batch 21200, loss[loss=0.2082, simple_loss=0.2699, pruned_loss=0.07323, over 21305.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2832, pruned_loss=0.07278, over 4263869.81 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:13:12,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2139846.0, ans=0.125 2023-06-26 04:13:43,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2139906.0, ans=0.2 2023-06-26 04:14:05,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.73 vs. limit=10.0 2023-06-26 04:14:51,533 INFO [train.py:996] (0/4) Epoch 12, batch 21250, loss[loss=0.2048, simple_loss=0.2611, pruned_loss=0.07424, over 21333.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2812, pruned_loss=0.07227, over 4249464.43 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:15:25,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-26 04:16:07,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.686e+02 9.766e+02 1.414e+03 1.888e+03 3.901e+03, threshold=2.827e+03, percent-clipped=19.0 2023-06-26 04:16:24,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2140386.0, ans=0.5 2023-06-26 04:16:26,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-26 04:16:39,725 INFO [train.py:996] (0/4) Epoch 12, batch 21300, loss[loss=0.2601, simple_loss=0.3263, pruned_loss=0.09692, over 21909.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2897, pruned_loss=0.07547, over 4255065.57 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:17:41,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2140566.0, ans=0.0 2023-06-26 04:18:32,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2140686.0, ans=0.125 2023-06-26 04:18:36,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2140746.0, ans=0.125 2023-06-26 04:18:37,772 INFO [train.py:996] (0/4) Epoch 12, batch 21350, loss[loss=0.2094, simple_loss=0.3041, pruned_loss=0.05739, over 21835.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2942, pruned_loss=0.07607, over 4267476.44 frames. ], batch size: 333, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:18:50,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2140746.0, ans=0.0 2023-06-26 04:19:16,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2140806.0, ans=0.125 2023-06-26 04:19:42,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2140866.0, ans=0.125 2023-06-26 04:19:43,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2140866.0, ans=0.0 2023-06-26 04:19:52,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.340e+02 8.083e+02 1.206e+03 1.998e+03 5.884e+03, threshold=2.412e+03, percent-clipped=11.0 2023-06-26 04:20:03,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2140926.0, ans=0.125 2023-06-26 04:20:15,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2140986.0, ans=0.05 2023-06-26 04:20:34,830 INFO [train.py:996] (0/4) Epoch 12, batch 21400, loss[loss=0.222, simple_loss=0.3011, pruned_loss=0.07141, over 21788.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2984, pruned_loss=0.07603, over 4274016.73 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:21:08,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2141106.0, ans=0.125 2023-06-26 04:22:02,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2141286.0, ans=0.09899494936611666 2023-06-26 04:22:23,080 INFO [train.py:996] (0/4) Epoch 12, batch 21450, loss[loss=0.2096, simple_loss=0.2868, pruned_loss=0.06614, over 21553.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3015, pruned_loss=0.07665, over 4282501.09 frames. ], batch size: 131, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:22:49,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2141406.0, ans=0.09899494936611666 2023-06-26 04:22:52,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2141406.0, ans=0.0 2023-06-26 04:22:54,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2141406.0, ans=0.0 2023-06-26 04:22:55,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2141406.0, ans=0.125 2023-06-26 04:22:57,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2141406.0, ans=0.125 2023-06-26 04:23:28,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.921e+02 8.614e+02 1.088e+03 1.616e+03 2.799e+03, threshold=2.175e+03, percent-clipped=3.0 2023-06-26 04:23:33,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2141526.0, ans=0.125 2023-06-26 04:24:11,514 INFO [train.py:996] (0/4) Epoch 12, batch 21500, loss[loss=0.2378, simple_loss=0.2928, pruned_loss=0.09137, over 21735.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2997, pruned_loss=0.07796, over 4286562.32 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:24:18,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2141646.0, ans=0.2 2023-06-26 04:24:44,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2141706.0, ans=0.125 2023-06-26 04:25:17,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2141826.0, ans=0.125 2023-06-26 04:25:58,246 INFO [train.py:996] (0/4) Epoch 12, batch 21550, loss[loss=0.1759, simple_loss=0.2504, pruned_loss=0.05069, over 21644.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2921, pruned_loss=0.07546, over 4278890.13 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:26:07,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2141946.0, ans=0.0 2023-06-26 04:26:56,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2142066.0, ans=0.05 2023-06-26 04:27:09,921 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.662e+02 8.216e+02 1.112e+03 1.423e+03 3.148e+03, threshold=2.223e+03, percent-clipped=7.0 2023-06-26 04:27:50,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-26 04:27:50,970 INFO [train.py:996] (0/4) Epoch 12, batch 21600, loss[loss=0.2072, simple_loss=0.2723, pruned_loss=0.07101, over 21632.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2876, pruned_loss=0.07356, over 4278003.30 frames. ], batch size: 264, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:28:37,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2142306.0, ans=0.125 2023-06-26 04:29:33,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2142486.0, ans=0.125 2023-06-26 04:29:43,353 INFO [train.py:996] (0/4) Epoch 12, batch 21650, loss[loss=0.1956, simple_loss=0.2886, pruned_loss=0.05133, over 21263.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2945, pruned_loss=0.07183, over 4274531.75 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:30:12,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2142606.0, ans=0.0 2023-06-26 04:30:28,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2142666.0, ans=0.04949747468305833 2023-06-26 04:30:56,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 7.832e+02 1.347e+03 2.277e+03 5.515e+03, threshold=2.694e+03, percent-clipped=27.0 2023-06-26 04:31:30,525 INFO [train.py:996] (0/4) Epoch 12, batch 21700, loss[loss=0.1977, simple_loss=0.2905, pruned_loss=0.0524, over 21593.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2948, pruned_loss=0.07019, over 4266578.30 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:31:32,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2142846.0, ans=0.125 2023-06-26 04:33:20,630 INFO [train.py:996] (0/4) Epoch 12, batch 21750, loss[loss=0.1805, simple_loss=0.2361, pruned_loss=0.06243, over 20782.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2892, pruned_loss=0.07007, over 4275051.36 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:33:27,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-26 04:33:29,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-26 04:33:43,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2143206.0, ans=0.125 2023-06-26 04:34:30,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.472e+02 7.650e+02 1.064e+03 1.573e+03 4.038e+03, threshold=2.129e+03, percent-clipped=2.0 2023-06-26 04:35:08,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2143386.0, ans=0.035 2023-06-26 04:35:12,911 INFO [train.py:996] (0/4) Epoch 12, batch 21800, loss[loss=0.267, simple_loss=0.3464, pruned_loss=0.09381, over 21450.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.286, pruned_loss=0.07107, over 4273091.21 frames. ], batch size: 473, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:35:41,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-26 04:36:02,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2143566.0, ans=0.125 2023-06-26 04:36:53,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2143686.0, ans=0.09899494936611666 2023-06-26 04:37:09,714 INFO [train.py:996] (0/4) Epoch 12, batch 21850, loss[loss=0.2092, simple_loss=0.2859, pruned_loss=0.06625, over 21804.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2933, pruned_loss=0.07177, over 4279742.13 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:37:27,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2143806.0, ans=0.125 2023-06-26 04:38:17,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.530e+02 9.705e+02 1.349e+03 2.023e+03 4.101e+03, threshold=2.697e+03, percent-clipped=20.0 2023-06-26 04:38:59,623 INFO [train.py:996] (0/4) Epoch 12, batch 21900, loss[loss=0.2092, simple_loss=0.2853, pruned_loss=0.06657, over 21552.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2934, pruned_loss=0.07269, over 4262673.89 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:39:10,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2144046.0, ans=0.0 2023-06-26 04:39:10,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2144046.0, ans=0.125 2023-06-26 04:39:21,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2144106.0, ans=10.0 2023-06-26 04:40:09,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-26 04:40:41,124 INFO [train.py:996] (0/4) Epoch 12, batch 21950, loss[loss=0.1734, simple_loss=0.2644, pruned_loss=0.04124, over 21198.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2875, pruned_loss=0.07143, over 4252978.69 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:41:30,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2144466.0, ans=0.07 2023-06-26 04:41:57,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.471e+02 7.504e+02 1.022e+03 1.599e+03 3.109e+03, threshold=2.043e+03, percent-clipped=2.0 2023-06-26 04:42:33,047 INFO [train.py:996] (0/4) Epoch 12, batch 22000, loss[loss=0.2278, simple_loss=0.2928, pruned_loss=0.08143, over 21723.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2821, pruned_loss=0.06872, over 4257424.92 frames. ], batch size: 333, lr: 2.40e-03, grad_scale: 32.0 2023-06-26 04:44:30,597 INFO [train.py:996] (0/4) Epoch 12, batch 22050, loss[loss=0.251, simple_loss=0.3308, pruned_loss=0.08561, over 21622.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2881, pruned_loss=0.07104, over 4258389.32 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:45:34,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2145126.0, ans=0.2 2023-06-26 04:45:35,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2145126.0, ans=0.125 2023-06-26 04:45:47,961 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.424e+02 9.779e+02 1.604e+03 2.153e+03 5.995e+03, threshold=3.207e+03, percent-clipped=28.0 2023-06-26 04:46:20,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2145246.0, ans=0.0 2023-06-26 04:46:21,816 INFO [train.py:996] (0/4) Epoch 12, batch 22100, loss[loss=0.2641, simple_loss=0.3337, pruned_loss=0.09725, over 21652.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2991, pruned_loss=0.07599, over 4258232.30 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:46:51,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2145306.0, ans=0.0 2023-06-26 04:46:53,556 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-26 04:47:40,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2145426.0, ans=0.0 2023-06-26 04:48:09,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2145486.0, ans=0.2 2023-06-26 04:48:11,845 INFO [train.py:996] (0/4) Epoch 12, batch 22150, loss[loss=0.249, simple_loss=0.3141, pruned_loss=0.09199, over 21791.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3007, pruned_loss=0.07709, over 4269182.21 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:48:31,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.89 vs. limit=10.0 2023-06-26 04:49:05,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2145666.0, ans=0.125 2023-06-26 04:49:27,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.601e+02 7.424e+02 1.002e+03 1.530e+03 2.924e+03, threshold=2.004e+03, percent-clipped=0.0 2023-06-26 04:49:29,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2145726.0, ans=0.0 2023-06-26 04:49:31,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2145726.0, ans=0.0 2023-06-26 04:49:36,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2145786.0, ans=0.1 2023-06-26 04:50:01,396 INFO [train.py:996] (0/4) Epoch 12, batch 22200, loss[loss=0.2593, simple_loss=0.3178, pruned_loss=0.1004, over 22030.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3021, pruned_loss=0.07756, over 4277417.66 frames. ], batch size: 416, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:50:09,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2145846.0, ans=0.0 2023-06-26 04:50:11,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2145846.0, ans=0.07 2023-06-26 04:50:14,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2145846.0, ans=0.125 2023-06-26 04:50:54,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2145966.0, ans=0.5 2023-06-26 04:51:06,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2145966.0, ans=0.125 2023-06-26 04:51:26,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-06-26 04:51:41,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2146086.0, ans=0.95 2023-06-26 04:51:52,896 INFO [train.py:996] (0/4) Epoch 12, batch 22250, loss[loss=0.2728, simple_loss=0.3423, pruned_loss=0.1016, over 21399.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3091, pruned_loss=0.07902, over 4281410.28 frames. ], batch size: 159, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:52:12,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2146146.0, ans=0.0 2023-06-26 04:52:28,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2146206.0, ans=0.2 2023-06-26 04:52:29,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2146206.0, ans=0.0 2023-06-26 04:53:01,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2146326.0, ans=0.2 2023-06-26 04:53:04,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2146326.0, ans=0.05 2023-06-26 04:53:10,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.929e+02 1.009e+03 1.467e+03 2.182e+03 5.502e+03, threshold=2.934e+03, percent-clipped=31.0 2023-06-26 04:53:36,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2146386.0, ans=0.05 2023-06-26 04:53:36,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=22.5 2023-06-26 04:53:42,562 INFO [train.py:996] (0/4) Epoch 12, batch 22300, loss[loss=0.2251, simple_loss=0.2927, pruned_loss=0.07878, over 21531.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3105, pruned_loss=0.08111, over 4291739.96 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:54:06,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-26 04:54:25,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-26 04:54:33,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=2146566.0, ans=22.5 2023-06-26 04:54:51,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2146626.0, ans=0.0 2023-06-26 04:54:55,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2146626.0, ans=0.125 2023-06-26 04:54:55,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-26 04:55:17,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=2146686.0, ans=0.02 2023-06-26 04:55:32,917 INFO [train.py:996] (0/4) Epoch 12, batch 22350, loss[loss=0.2377, simple_loss=0.3101, pruned_loss=0.08261, over 21864.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3079, pruned_loss=0.08172, over 4302309.46 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:56:30,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2146866.0, ans=0.125 2023-06-26 04:56:33,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2146866.0, ans=0.125 2023-06-26 04:56:35,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2146866.0, ans=0.04949747468305833 2023-06-26 04:56:36,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2146866.0, ans=0.05 2023-06-26 04:56:57,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.431e+02 9.155e+02 1.132e+03 1.676e+03 3.302e+03, threshold=2.265e+03, percent-clipped=3.0 2023-06-26 04:57:23,263 INFO [train.py:996] (0/4) Epoch 12, batch 22400, loss[loss=0.2315, simple_loss=0.2967, pruned_loss=0.0831, over 21760.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3047, pruned_loss=0.07844, over 4299989.92 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:57:51,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2147106.0, ans=0.0 2023-06-26 04:58:10,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2147106.0, ans=0.2 2023-06-26 04:59:05,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2147286.0, ans=0.125 2023-06-26 04:59:13,358 INFO [train.py:996] (0/4) Epoch 12, batch 22450, loss[loss=0.1874, simple_loss=0.2368, pruned_loss=0.069, over 20774.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2994, pruned_loss=0.07807, over 4286668.51 frames. ], batch size: 608, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:59:32,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2147346.0, ans=0.125 2023-06-26 05:00:37,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.261e+02 8.888e+02 1.140e+03 1.642e+03 4.602e+03, threshold=2.279e+03, percent-clipped=11.0 2023-06-26 05:00:38,388 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-26 05:01:13,404 INFO [train.py:996] (0/4) Epoch 12, batch 22500, loss[loss=0.2971, simple_loss=0.3745, pruned_loss=0.1099, over 21574.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2961, pruned_loss=0.07791, over 4274460.45 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:01:19,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2147646.0, ans=0.0 2023-06-26 05:01:50,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2147706.0, ans=0.125 2023-06-26 05:02:40,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2147886.0, ans=0.5 2023-06-26 05:03:03,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2147946.0, ans=0.1 2023-06-26 05:03:04,507 INFO [train.py:996] (0/4) Epoch 12, batch 22550, loss[loss=0.2269, simple_loss=0.3042, pruned_loss=0.07476, over 21909.00 frames. ], tot_loss[loss=0.229, simple_loss=0.302, pruned_loss=0.07803, over 4281750.53 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:03:17,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2147946.0, ans=0.0 2023-06-26 05:03:39,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2148006.0, ans=0.125 2023-06-26 05:04:03,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2148066.0, ans=0.1 2023-06-26 05:04:18,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.902e+02 9.532e+02 1.362e+03 1.911e+03 4.517e+03, threshold=2.723e+03, percent-clipped=17.0 2023-06-26 05:05:00,754 INFO [train.py:996] (0/4) Epoch 12, batch 22600, loss[loss=0.2472, simple_loss=0.3368, pruned_loss=0.07878, over 21681.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3048, pruned_loss=0.07863, over 4287745.86 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:05:13,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2148246.0, ans=0.0 2023-06-26 05:05:49,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2148366.0, ans=0.0 2023-06-26 05:06:45,103 INFO [train.py:996] (0/4) Epoch 12, batch 22650, loss[loss=0.2291, simple_loss=0.3462, pruned_loss=0.05603, over 19774.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3012, pruned_loss=0.0783, over 4281502.81 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:06:56,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2148546.0, ans=0.1 2023-06-26 05:07:13,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2148606.0, ans=0.0 2023-06-26 05:07:18,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2148606.0, ans=0.125 2023-06-26 05:07:29,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.58 vs. limit=10.0 2023-06-26 05:07:44,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2148726.0, ans=0.125 2023-06-26 05:07:52,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.795e+02 9.798e+02 1.388e+03 1.950e+03 5.687e+03, threshold=2.777e+03, percent-clipped=13.0 2023-06-26 05:08:31,896 INFO [train.py:996] (0/4) Epoch 12, batch 22700, loss[loss=0.2549, simple_loss=0.3069, pruned_loss=0.1014, over 21536.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2947, pruned_loss=0.07744, over 4276746.25 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:08:59,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2148906.0, ans=0.025 2023-06-26 05:09:10,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2148906.0, ans=0.0 2023-06-26 05:09:10,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2148906.0, ans=0.125 2023-06-26 05:09:29,418 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-26 05:09:48,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-26 05:09:58,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2149086.0, ans=0.04949747468305833 2023-06-26 05:10:24,741 INFO [train.py:996] (0/4) Epoch 12, batch 22750, loss[loss=0.2401, simple_loss=0.3132, pruned_loss=0.08345, over 21952.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2958, pruned_loss=0.07908, over 4283025.37 frames. ], batch size: 372, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:10:32,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=2149146.0, ans=15.0 2023-06-26 05:10:37,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2149146.0, ans=0.125 2023-06-26 05:11:16,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2149266.0, ans=0.125 2023-06-26 05:11:20,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2149326.0, ans=0.1 2023-06-26 05:11:44,539 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.150e+02 7.737e+02 1.012e+03 1.483e+03 2.915e+03, threshold=2.025e+03, percent-clipped=0.0 2023-06-26 05:12:14,846 INFO [train.py:996] (0/4) Epoch 12, batch 22800, loss[loss=0.2569, simple_loss=0.323, pruned_loss=0.09537, over 21897.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3011, pruned_loss=0.08141, over 4279784.44 frames. ], batch size: 316, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:12:46,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2149506.0, ans=0.0 2023-06-26 05:12:50,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2149506.0, ans=0.04949747468305833 2023-06-26 05:14:00,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-26 05:14:03,888 INFO [train.py:996] (0/4) Epoch 12, batch 22850, loss[loss=0.213, simple_loss=0.2722, pruned_loss=0.07691, over 21546.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2979, pruned_loss=0.08068, over 4273538.25 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:14:11,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2149746.0, ans=0.0 2023-06-26 05:15:23,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.516e+02 9.685e+02 1.570e+03 2.544e+03 4.880e+03, threshold=3.139e+03, percent-clipped=35.0 2023-06-26 05:15:25,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2149926.0, ans=0.125 2023-06-26 05:15:54,415 INFO [train.py:996] (0/4) Epoch 12, batch 22900, loss[loss=0.2182, simple_loss=0.3293, pruned_loss=0.0536, over 21690.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3001, pruned_loss=0.07976, over 4271322.47 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:16:12,924 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-26 05:17:27,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-26 05:17:53,798 INFO [train.py:996] (0/4) Epoch 12, batch 22950, loss[loss=0.2319, simple_loss=0.2879, pruned_loss=0.08793, over 20356.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3116, pruned_loss=0.07774, over 4267841.69 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:18:02,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2150346.0, ans=0.125 2023-06-26 05:18:17,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2150406.0, ans=0.125 2023-06-26 05:18:26,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-26 05:18:54,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2150526.0, ans=0.125 2023-06-26 05:18:58,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2150526.0, ans=0.0 2023-06-26 05:19:12,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.963e+02 8.459e+02 1.367e+03 2.050e+03 4.078e+03, threshold=2.734e+03, percent-clipped=4.0 2023-06-26 05:19:27,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2150586.0, ans=0.2 2023-06-26 05:19:42,505 INFO [train.py:996] (0/4) Epoch 12, batch 23000, loss[loss=0.2247, simple_loss=0.3047, pruned_loss=0.07236, over 21910.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.31, pruned_loss=0.07539, over 4275888.32 frames. ], batch size: 333, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:19:53,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2150646.0, ans=0.125 2023-06-26 05:19:57,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2150646.0, ans=0.1 2023-06-26 05:20:07,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-26 05:20:47,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.38 vs. limit=22.5 2023-06-26 05:21:26,352 INFO [train.py:996] (0/4) Epoch 12, batch 23050, loss[loss=0.2427, simple_loss=0.3152, pruned_loss=0.08514, over 21444.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3114, pruned_loss=0.07744, over 4275550.32 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:21:52,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2151006.0, ans=0.0 2023-06-26 05:22:05,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2151006.0, ans=0.015 2023-06-26 05:22:09,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2151006.0, ans=0.1 2023-06-26 05:22:24,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-26 05:22:54,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.118e+02 9.048e+02 1.306e+03 1.787e+03 2.952e+03, threshold=2.611e+03, percent-clipped=5.0 2023-06-26 05:23:19,190 INFO [train.py:996] (0/4) Epoch 12, batch 23100, loss[loss=0.1784, simple_loss=0.2465, pruned_loss=0.05511, over 21787.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.308, pruned_loss=0.07881, over 4275307.89 frames. ], batch size: 317, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:23:54,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2151306.0, ans=0.125 2023-06-26 05:23:54,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2151306.0, ans=0.0 2023-06-26 05:24:01,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2151306.0, ans=0.125 2023-06-26 05:24:29,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2151426.0, ans=0.1 2023-06-26 05:24:54,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-26 05:25:08,274 INFO [train.py:996] (0/4) Epoch 12, batch 23150, loss[loss=0.1907, simple_loss=0.2568, pruned_loss=0.06225, over 21519.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3021, pruned_loss=0.07794, over 4283151.05 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:26:25,821 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.610e+02 9.726e+02 1.363e+03 3.124e+03, threshold=1.945e+03, percent-clipped=3.0 2023-06-26 05:26:37,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-26 05:26:44,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2151786.0, ans=0.125 2023-06-26 05:26:55,442 INFO [train.py:996] (0/4) Epoch 12, batch 23200, loss[loss=0.2345, simple_loss=0.2954, pruned_loss=0.08685, over 21825.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2998, pruned_loss=0.07774, over 4285925.22 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 05:27:35,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2151906.0, ans=0.125 2023-06-26 05:28:41,943 INFO [train.py:996] (0/4) Epoch 12, batch 23250, loss[loss=0.2873, simple_loss=0.3403, pruned_loss=0.1172, over 21640.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2992, pruned_loss=0.07835, over 4286086.22 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:28:42,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2152146.0, ans=0.125 2023-06-26 05:30:12,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.710e+02 8.107e+02 1.074e+03 1.814e+03 3.794e+03, threshold=2.148e+03, percent-clipped=19.0 2023-06-26 05:30:35,137 INFO [train.py:996] (0/4) Epoch 12, batch 23300, loss[loss=0.2631, simple_loss=0.3688, pruned_loss=0.07872, over 21676.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3068, pruned_loss=0.07989, over 4290973.25 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:30:39,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-26 05:30:39,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-06-26 05:30:43,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2152446.0, ans=15.0 2023-06-26 05:32:17,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2152686.0, ans=0.0 2023-06-26 05:32:19,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2152686.0, ans=0.1 2023-06-26 05:32:31,089 INFO [train.py:996] (0/4) Epoch 12, batch 23350, loss[loss=0.1768, simple_loss=0.2577, pruned_loss=0.04796, over 21753.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3098, pruned_loss=0.07837, over 4288778.05 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:32:45,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2152746.0, ans=0.1 2023-06-26 05:33:54,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.865e+02 9.116e+02 1.311e+03 1.870e+03 4.347e+03, threshold=2.623e+03, percent-clipped=16.0 2023-06-26 05:34:21,120 INFO [train.py:996] (0/4) Epoch 12, batch 23400, loss[loss=0.1942, simple_loss=0.2973, pruned_loss=0.04557, over 20779.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3044, pruned_loss=0.07532, over 4281316.14 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:34:22,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2153046.0, ans=0.1 2023-06-26 05:34:45,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2153046.0, ans=0.0 2023-06-26 05:34:45,836 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:35:02,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2153106.0, ans=0.125 2023-06-26 05:35:18,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2153166.0, ans=0.0 2023-06-26 05:36:04,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2153286.0, ans=0.125 2023-06-26 05:36:11,444 INFO [train.py:996] (0/4) Epoch 12, batch 23450, loss[loss=0.2495, simple_loss=0.322, pruned_loss=0.08848, over 21894.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3065, pruned_loss=0.07833, over 4280480.72 frames. ], batch size: 371, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:37:08,477 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=12.0 2023-06-26 05:37:31,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.599e+02 8.250e+02 1.125e+03 1.414e+03 2.942e+03, threshold=2.251e+03, percent-clipped=1.0 2023-06-26 05:37:34,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2153526.0, ans=0.09899494936611666 2023-06-26 05:37:50,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2153586.0, ans=0.125 2023-06-26 05:38:03,790 INFO [train.py:996] (0/4) Epoch 12, batch 23500, loss[loss=0.19, simple_loss=0.2853, pruned_loss=0.04739, over 19926.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3046, pruned_loss=0.07907, over 4275449.04 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:38:33,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2153706.0, ans=0.125 2023-06-26 05:38:54,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2153766.0, ans=0.2 2023-06-26 05:39:30,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 05:39:32,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2153886.0, ans=0.125 2023-06-26 05:39:52,423 INFO [train.py:996] (0/4) Epoch 12, batch 23550, loss[loss=0.1883, simple_loss=0.2369, pruned_loss=0.0698, over 20709.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3002, pruned_loss=0.07908, over 4276472.26 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:40:36,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-26 05:41:09,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.684e+02 8.231e+02 1.204e+03 1.979e+03 6.408e+03, threshold=2.407e+03, percent-clipped=19.0 2023-06-26 05:41:48,764 INFO [train.py:996] (0/4) Epoch 12, batch 23600, loss[loss=0.2672, simple_loss=0.3393, pruned_loss=0.09755, over 21574.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3012, pruned_loss=0.07891, over 4269851.95 frames. ], batch size: 415, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 05:43:34,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2154486.0, ans=0.125 2023-06-26 05:43:38,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-26 05:43:38,468 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-26 05:43:40,738 INFO [train.py:996] (0/4) Epoch 12, batch 23650, loss[loss=0.2372, simple_loss=0.3174, pruned_loss=0.07854, over 21758.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3028, pruned_loss=0.07777, over 4267792.44 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:43:42,223 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-26 05:43:57,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2154546.0, ans=0.125 2023-06-26 05:44:09,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2154606.0, ans=0.125 2023-06-26 05:44:31,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2154666.0, ans=0.1 2023-06-26 05:45:11,514 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.083e+02 1.007e+03 1.364e+03 1.897e+03 4.250e+03, threshold=2.728e+03, percent-clipped=14.0 2023-06-26 05:45:36,470 INFO [train.py:996] (0/4) Epoch 12, batch 23700, loss[loss=0.2067, simple_loss=0.2933, pruned_loss=0.06006, over 20673.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3052, pruned_loss=0.07721, over 4271744.74 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:45:42,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2154846.0, ans=0.95 2023-06-26 05:45:49,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2154846.0, ans=0.125 2023-06-26 05:46:32,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-26 05:47:25,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-26 05:47:26,997 INFO [train.py:996] (0/4) Epoch 12, batch 23750, loss[loss=0.2557, simple_loss=0.3373, pruned_loss=0.08707, over 21843.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3096, pruned_loss=0.07874, over 4278462.88 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:48:45,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2155326.0, ans=0.0 2023-06-26 05:48:53,519 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-26 05:48:55,967 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 9.334e+02 1.253e+03 1.718e+03 3.362e+03, threshold=2.506e+03, percent-clipped=5.0 2023-06-26 05:49:12,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-26 05:49:21,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-26 05:49:21,872 INFO [train.py:996] (0/4) Epoch 12, batch 23800, loss[loss=0.2966, simple_loss=0.394, pruned_loss=0.09958, over 21455.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3084, pruned_loss=0.07603, over 4275098.54 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:50:18,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2155566.0, ans=0.125 2023-06-26 05:50:57,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2155686.0, ans=0.0 2023-06-26 05:51:01,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2155686.0, ans=0.1 2023-06-26 05:51:01,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2155686.0, ans=0.05 2023-06-26 05:51:05,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2023-06-26 05:51:16,628 INFO [train.py:996] (0/4) Epoch 12, batch 23850, loss[loss=0.246, simple_loss=0.3303, pruned_loss=0.08086, over 21624.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3148, pruned_loss=0.07765, over 4273776.40 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:52:05,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2155806.0, ans=0.1 2023-06-26 05:52:12,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2155866.0, ans=0.5 2023-06-26 05:52:16,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2155866.0, ans=0.0 2023-06-26 05:52:38,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2155926.0, ans=0.125 2023-06-26 05:52:49,989 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.758e+02 1.136e+03 1.834e+03 2.559e+03 6.160e+03, threshold=3.668e+03, percent-clipped=28.0 2023-06-26 05:52:59,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-06-26 05:53:04,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2155986.0, ans=0.125 2023-06-26 05:53:14,987 INFO [train.py:996] (0/4) Epoch 12, batch 23900, loss[loss=0.2574, simple_loss=0.331, pruned_loss=0.09193, over 21734.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3214, pruned_loss=0.07966, over 4275943.33 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:53:46,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2156106.0, ans=0.0 2023-06-26 05:53:51,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.13 vs. limit=10.0 2023-06-26 05:53:57,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2156106.0, ans=0.0 2023-06-26 05:54:09,603 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:54:40,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2156286.0, ans=0.125 2023-06-26 05:55:04,329 INFO [train.py:996] (0/4) Epoch 12, batch 23950, loss[loss=0.1934, simple_loss=0.2628, pruned_loss=0.06207, over 21739.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3152, pruned_loss=0.0794, over 4269164.47 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:55:23,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2156346.0, ans=0.125 2023-06-26 05:55:30,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2156346.0, ans=0.125 2023-06-26 05:55:35,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=22.5 2023-06-26 05:55:39,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2156406.0, ans=0.125 2023-06-26 05:56:23,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2156526.0, ans=0.125 2023-06-26 05:56:32,157 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.839e+02 7.798e+02 1.081e+03 1.467e+03 3.120e+03, threshold=2.162e+03, percent-clipped=0.0 2023-06-26 05:56:33,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2156586.0, ans=0.07 2023-06-26 05:56:36,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2156586.0, ans=0.2 2023-06-26 05:57:06,281 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:57:07,636 INFO [train.py:996] (0/4) Epoch 12, batch 24000, loss[loss=0.2874, simple_loss=0.3612, pruned_loss=0.1068, over 21748.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3162, pruned_loss=0.08222, over 4258245.21 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:57:07,637 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 05:57:21,786 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.0247, 4.3296, 2.7337, 2.1234], device='cuda:0') 2023-06-26 05:57:25,647 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2659, simple_loss=0.36, pruned_loss=0.08593, over 1796401.00 frames. 2023-06-26 05:57:25,648 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-26 05:58:02,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2156706.0, ans=0.0 2023-06-26 05:58:05,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2156706.0, ans=0.2 2023-06-26 05:59:15,059 INFO [train.py:996] (0/4) Epoch 12, batch 24050, loss[loss=0.2041, simple_loss=0.2928, pruned_loss=0.05766, over 21688.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3153, pruned_loss=0.08124, over 4257878.70 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:59:25,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2156946.0, ans=0.1 2023-06-26 05:59:43,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2157006.0, ans=0.2 2023-06-26 06:00:45,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.641e+02 9.752e+02 1.281e+03 1.752e+03 4.034e+03, threshold=2.563e+03, percent-clipped=15.0 2023-06-26 06:01:05,425 INFO [train.py:996] (0/4) Epoch 12, batch 24100, loss[loss=0.2509, simple_loss=0.3207, pruned_loss=0.09057, over 21419.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3148, pruned_loss=0.0795, over 4255102.33 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:02:00,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-06-26 06:02:57,854 INFO [train.py:996] (0/4) Epoch 12, batch 24150, loss[loss=0.1912, simple_loss=0.2616, pruned_loss=0.06035, over 21499.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.315, pruned_loss=0.08121, over 4268583.36 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:02:59,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-26 06:03:35,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2157606.0, ans=0.5 2023-06-26 06:03:46,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2157666.0, ans=0.07 2023-06-26 06:04:04,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2157666.0, ans=0.2 2023-06-26 06:04:30,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.135e+02 1.050e+03 1.414e+03 1.892e+03 4.392e+03, threshold=2.829e+03, percent-clipped=9.0 2023-06-26 06:04:55,068 INFO [train.py:996] (0/4) Epoch 12, batch 24200, loss[loss=0.277, simple_loss=0.3634, pruned_loss=0.09524, over 21624.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3192, pruned_loss=0.08369, over 4270095.43 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:06:00,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2157966.0, ans=0.025 2023-06-26 06:06:11,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2158026.0, ans=0.0 2023-06-26 06:06:47,671 INFO [train.py:996] (0/4) Epoch 12, batch 24250, loss[loss=0.1767, simple_loss=0.2679, pruned_loss=0.04272, over 21235.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3159, pruned_loss=0.0782, over 4274978.55 frames. ], batch size: 159, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:07:23,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2158206.0, ans=0.125 2023-06-26 06:07:27,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=2158206.0, ans=0.5 2023-06-26 06:08:18,616 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.845e+02 9.840e+02 1.673e+03 2.724e+03 4.672e+03, threshold=3.346e+03, percent-clipped=24.0 2023-06-26 06:08:23,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=12.0 2023-06-26 06:08:37,624 INFO [train.py:996] (0/4) Epoch 12, batch 24300, loss[loss=0.1602, simple_loss=0.2466, pruned_loss=0.03693, over 21710.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3103, pruned_loss=0.07333, over 4272931.11 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:08:38,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2158446.0, ans=0.0 2023-06-26 06:09:26,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2158506.0, ans=0.125 2023-06-26 06:10:28,377 INFO [train.py:996] (0/4) Epoch 12, batch 24350, loss[loss=0.275, simple_loss=0.3342, pruned_loss=0.1079, over 21237.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.307, pruned_loss=0.0731, over 4276573.59 frames. ], batch size: 143, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:11:08,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=2158806.0, ans=0.1 2023-06-26 06:11:51,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2158926.0, ans=0.125 2023-06-26 06:12:01,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 8.199e+02 1.178e+03 1.878e+03 3.422e+03, threshold=2.355e+03, percent-clipped=1.0 2023-06-26 06:12:24,458 INFO [train.py:996] (0/4) Epoch 12, batch 24400, loss[loss=0.2417, simple_loss=0.3179, pruned_loss=0.08279, over 21708.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3109, pruned_loss=0.07613, over 4276065.41 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:13:09,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2159106.0, ans=0.0 2023-06-26 06:13:16,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2159166.0, ans=0.125 2023-06-26 06:14:00,193 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-26 06:14:22,179 INFO [train.py:996] (0/4) Epoch 12, batch 24450, loss[loss=0.2112, simple_loss=0.2911, pruned_loss=0.06564, over 21345.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3124, pruned_loss=0.07786, over 4273753.71 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:14:42,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2159346.0, ans=0.125 2023-06-26 06:15:06,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-26 06:15:23,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2159526.0, ans=0.0 2023-06-26 06:15:42,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.744e+02 8.987e+02 1.393e+03 2.049e+03 5.528e+03, threshold=2.786e+03, percent-clipped=20.0 2023-06-26 06:16:05,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-26 06:16:11,957 INFO [train.py:996] (0/4) Epoch 12, batch 24500, loss[loss=0.2309, simple_loss=0.3019, pruned_loss=0.07997, over 21942.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3129, pruned_loss=0.07855, over 4269395.21 frames. ], batch size: 316, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:16:28,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2159646.0, ans=0.125 2023-06-26 06:17:52,464 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:18:10,457 INFO [train.py:996] (0/4) Epoch 12, batch 24550, loss[loss=0.2744, simple_loss=0.3585, pruned_loss=0.0952, over 21817.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3148, pruned_loss=0.08027, over 4274265.17 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:18:14,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2159946.0, ans=0.0 2023-06-26 06:18:19,174 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-26 06:18:24,952 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-360000.pt 2023-06-26 06:18:27,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-26 06:18:50,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-26 06:18:55,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2160066.0, ans=0.0 2023-06-26 06:19:24,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-26 06:19:40,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.015e+02 1.028e+03 1.361e+03 2.117e+03 3.876e+03, threshold=2.722e+03, percent-clipped=8.0 2023-06-26 06:19:55,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2160186.0, ans=0.125 2023-06-26 06:19:55,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2160186.0, ans=0.125 2023-06-26 06:20:00,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2160186.0, ans=0.125 2023-06-26 06:20:03,147 INFO [train.py:996] (0/4) Epoch 12, batch 24600, loss[loss=0.1894, simple_loss=0.2555, pruned_loss=0.0616, over 21583.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3124, pruned_loss=0.08069, over 4271612.71 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:20:07,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2160246.0, ans=0.1 2023-06-26 06:20:19,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2160306.0, ans=0.0 2023-06-26 06:20:44,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2160366.0, ans=0.125 2023-06-26 06:21:01,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-26 06:21:43,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-26 06:21:53,020 INFO [train.py:996] (0/4) Epoch 12, batch 24650, loss[loss=0.224, simple_loss=0.288, pruned_loss=0.08002, over 21800.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3055, pruned_loss=0.07944, over 4273211.91 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:22:24,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-26 06:22:45,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2160666.0, ans=0.125 2023-06-26 06:23:01,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-26 06:23:08,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2160726.0, ans=0.04949747468305833 2023-06-26 06:23:21,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.831e+02 9.882e+02 1.634e+03 2.648e+03 5.082e+03, threshold=3.268e+03, percent-clipped=24.0 2023-06-26 06:23:45,431 INFO [train.py:996] (0/4) Epoch 12, batch 24700, loss[loss=0.2271, simple_loss=0.2869, pruned_loss=0.08368, over 21492.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3027, pruned_loss=0.07757, over 4261508.63 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:25:34,915 INFO [train.py:996] (0/4) Epoch 12, batch 24750, loss[loss=0.2133, simple_loss=0.2778, pruned_loss=0.0744, over 22036.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2977, pruned_loss=0.07515, over 4256414.76 frames. ], batch size: 103, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:26:38,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2161326.0, ans=10.0 2023-06-26 06:26:55,422 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.337e+02 7.152e+02 1.055e+03 1.533e+03 3.030e+03, threshold=2.111e+03, percent-clipped=0.0 2023-06-26 06:27:24,934 INFO [train.py:996] (0/4) Epoch 12, batch 24800, loss[loss=0.239, simple_loss=0.2991, pruned_loss=0.08941, over 21472.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2922, pruned_loss=0.0749, over 4268892.48 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 06:29:09,399 INFO [train.py:996] (0/4) Epoch 12, batch 24850, loss[loss=0.2029, simple_loss=0.2734, pruned_loss=0.06624, over 21672.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.291, pruned_loss=0.07538, over 4272581.29 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:30:00,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=22.5 2023-06-26 06:30:20,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2161926.0, ans=0.125 2023-06-26 06:30:26,981 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-26 06:30:45,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.468e+02 9.714e+02 1.514e+03 2.212e+03 4.137e+03, threshold=3.028e+03, percent-clipped=27.0 2023-06-26 06:30:46,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2161986.0, ans=0.0 2023-06-26 06:31:07,662 INFO [train.py:996] (0/4) Epoch 12, batch 24900, loss[loss=0.2448, simple_loss=0.3194, pruned_loss=0.08513, over 21443.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2952, pruned_loss=0.0764, over 4274215.10 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:31:43,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2162106.0, ans=0.0 2023-06-26 06:31:52,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2162166.0, ans=0.2 2023-06-26 06:31:56,616 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:32:05,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2162166.0, ans=0.2 2023-06-26 06:32:27,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2162226.0, ans=0.125 2023-06-26 06:32:32,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2162226.0, ans=0.125 2023-06-26 06:33:08,362 INFO [train.py:996] (0/4) Epoch 12, batch 24950, loss[loss=0.3069, simple_loss=0.3714, pruned_loss=0.1212, over 21596.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3019, pruned_loss=0.07943, over 4269322.63 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:33:26,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2162406.0, ans=0.125 2023-06-26 06:34:28,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2162526.0, ans=0.04949747468305833 2023-06-26 06:34:34,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2162526.0, ans=0.04949747468305833 2023-06-26 06:34:43,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.936e+02 9.070e+02 1.289e+03 1.993e+03 4.042e+03, threshold=2.579e+03, percent-clipped=8.0 2023-06-26 06:34:44,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2162586.0, ans=0.0 2023-06-26 06:34:57,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-26 06:34:59,521 INFO [train.py:996] (0/4) Epoch 12, batch 25000, loss[loss=0.2681, simple_loss=0.3505, pruned_loss=0.09282, over 21838.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3094, pruned_loss=0.08176, over 4275362.26 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:35:31,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-26 06:35:36,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-26 06:36:13,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2162826.0, ans=0.0 2023-06-26 06:36:14,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2162826.0, ans=0.1 2023-06-26 06:36:49,065 INFO [train.py:996] (0/4) Epoch 12, batch 25050, loss[loss=0.1902, simple_loss=0.2601, pruned_loss=0.0602, over 21579.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3026, pruned_loss=0.08043, over 4270207.40 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:36:58,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-26 06:37:22,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2163006.0, ans=0.125 2023-06-26 06:37:36,744 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-26 06:37:52,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2163126.0, ans=0.0 2023-06-26 06:38:18,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-26 06:38:26,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.221e+02 8.095e+02 1.236e+03 1.556e+03 3.258e+03, threshold=2.471e+03, percent-clipped=4.0 2023-06-26 06:38:42,005 INFO [train.py:996] (0/4) Epoch 12, batch 25100, loss[loss=0.2246, simple_loss=0.3145, pruned_loss=0.06735, over 21665.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2962, pruned_loss=0.07898, over 4277685.29 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:39:43,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2163426.0, ans=0.125 2023-06-26 06:40:23,743 INFO [train.py:996] (0/4) Epoch 12, batch 25150, loss[loss=0.2282, simple_loss=0.3294, pruned_loss=0.06351, over 21661.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2998, pruned_loss=0.07654, over 4277942.83 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 06:40:29,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2163546.0, ans=0.0 2023-06-26 06:41:06,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2163666.0, ans=0.125 2023-06-26 06:41:46,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2163786.0, ans=0.0 2023-06-26 06:41:48,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.684e+02 1.120e+03 1.527e+03 4.461e+03, threshold=2.241e+03, percent-clipped=8.0 2023-06-26 06:42:06,954 INFO [train.py:996] (0/4) Epoch 12, batch 25200, loss[loss=0.2228, simple_loss=0.3252, pruned_loss=0.06026, over 21612.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3, pruned_loss=0.07518, over 4267668.01 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:42:08,951 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:42:22,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.31 vs. limit=22.5 2023-06-26 06:42:51,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2163966.0, ans=0.125 2023-06-26 06:43:56,909 INFO [train.py:996] (0/4) Epoch 12, batch 25250, loss[loss=0.2015, simple_loss=0.2681, pruned_loss=0.06741, over 20233.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2973, pruned_loss=0.07343, over 4257638.49 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:44:04,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2164146.0, ans=0.125 2023-06-26 06:44:07,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2164146.0, ans=0.0 2023-06-26 06:44:55,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2164326.0, ans=0.2 2023-06-26 06:45:00,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2164326.0, ans=0.125 2023-06-26 06:45:25,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 7.134e+02 9.127e+02 1.634e+03 5.272e+03, threshold=1.825e+03, percent-clipped=13.0 2023-06-26 06:45:26,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2164386.0, ans=0.0 2023-06-26 06:45:37,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2164386.0, ans=0.125 2023-06-26 06:45:45,408 INFO [train.py:996] (0/4) Epoch 12, batch 25300, loss[loss=0.1762, simple_loss=0.2581, pruned_loss=0.04715, over 21639.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2936, pruned_loss=0.07249, over 4248144.16 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:46:15,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2164506.0, ans=0.1 2023-06-26 06:46:29,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2164566.0, ans=0.2 2023-06-26 06:46:31,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2164566.0, ans=0.0 2023-06-26 06:47:22,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.49 vs. limit=10.0 2023-06-26 06:47:23,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2164686.0, ans=0.125 2023-06-26 06:47:35,408 INFO [train.py:996] (0/4) Epoch 12, batch 25350, loss[loss=0.2776, simple_loss=0.3524, pruned_loss=0.1014, over 21450.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2965, pruned_loss=0.07226, over 4237097.23 frames. ], batch size: 507, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:47:44,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2164746.0, ans=0.5 2023-06-26 06:47:54,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=2164746.0, ans=0.025 2023-06-26 06:48:09,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2164806.0, ans=0.1 2023-06-26 06:48:45,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.36 vs. limit=15.0 2023-06-26 06:49:04,252 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.341e+02 9.859e+02 1.527e+03 2.457e+03 4.731e+03, threshold=3.054e+03, percent-clipped=38.0 2023-06-26 06:49:17,672 INFO [train.py:996] (0/4) Epoch 12, batch 25400, loss[loss=0.1991, simple_loss=0.2685, pruned_loss=0.06482, over 21349.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2913, pruned_loss=0.07114, over 4241653.99 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:50:13,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2165226.0, ans=0.125 2023-06-26 06:50:23,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=2165226.0, ans=15.0 2023-06-26 06:50:30,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2165226.0, ans=0.0 2023-06-26 06:51:02,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2165286.0, ans=0.1 2023-06-26 06:51:05,433 INFO [train.py:996] (0/4) Epoch 12, batch 25450, loss[loss=0.2416, simple_loss=0.3121, pruned_loss=0.08549, over 21470.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2936, pruned_loss=0.07383, over 4251343.56 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:51:10,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-26 06:51:13,715 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-26 06:51:14,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2165346.0, ans=0.125 2023-06-26 06:52:05,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-26 06:52:43,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.438e+02 1.024e+03 1.496e+03 3.770e+03, threshold=2.047e+03, percent-clipped=1.0 2023-06-26 06:52:43,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2165586.0, ans=0.0 2023-06-26 06:52:48,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.49 vs. limit=22.5 2023-06-26 06:52:55,775 INFO [train.py:996] (0/4) Epoch 12, batch 25500, loss[loss=0.2379, simple_loss=0.3221, pruned_loss=0.07683, over 21792.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2944, pruned_loss=0.0708, over 4253475.63 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 06:53:29,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2165706.0, ans=0.0 2023-06-26 06:53:49,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-26 06:53:52,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2165826.0, ans=0.125 2023-06-26 06:54:14,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2165826.0, ans=0.125 2023-06-26 06:54:53,051 INFO [train.py:996] (0/4) Epoch 12, batch 25550, loss[loss=0.2119, simple_loss=0.317, pruned_loss=0.05343, over 21737.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3023, pruned_loss=0.07208, over 4265633.50 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 06:55:08,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2165946.0, ans=0.125 2023-06-26 06:55:18,777 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:55:30,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.40 vs. limit=10.0 2023-06-26 06:56:11,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2166126.0, ans=0.1 2023-06-26 06:56:13,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2166126.0, ans=15.0 2023-06-26 06:56:15,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2166126.0, ans=0.025 2023-06-26 06:56:31,794 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.571e+02 9.605e+02 1.487e+03 2.203e+03 4.525e+03, threshold=2.973e+03, percent-clipped=31.0 2023-06-26 06:56:41,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=2166186.0, ans=15.0 2023-06-26 06:56:43,675 INFO [train.py:996] (0/4) Epoch 12, batch 25600, loss[loss=0.2862, simple_loss=0.3547, pruned_loss=0.1088, over 21779.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3073, pruned_loss=0.07422, over 4269774.96 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:57:09,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2166306.0, ans=0.125 2023-06-26 06:58:02,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2166426.0, ans=0.125 2023-06-26 06:58:27,621 INFO [train.py:996] (0/4) Epoch 12, batch 25650, loss[loss=0.2107, simple_loss=0.2726, pruned_loss=0.07438, over 21571.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3074, pruned_loss=0.07628, over 4268375.82 frames. ], batch size: 415, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:00:05,373 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.624e+02 9.508e+02 1.344e+03 1.911e+03 3.919e+03, threshold=2.688e+03, percent-clipped=6.0 2023-06-26 07:00:17,988 INFO [train.py:996] (0/4) Epoch 12, batch 25700, loss[loss=0.2312, simple_loss=0.3185, pruned_loss=0.07197, over 21459.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3053, pruned_loss=0.07666, over 4271208.83 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:01:11,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2166966.0, ans=0.125 2023-06-26 07:01:17,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-26 07:01:54,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2167086.0, ans=0.125 2023-06-26 07:02:06,693 INFO [train.py:996] (0/4) Epoch 12, batch 25750, loss[loss=0.1917, simple_loss=0.2532, pruned_loss=0.06508, over 21050.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3085, pruned_loss=0.07886, over 4271814.94 frames. ], batch size: 608, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:02:21,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2167146.0, ans=0.0 2023-06-26 07:02:58,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2167266.0, ans=0.1 2023-06-26 07:03:14,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2167266.0, ans=0.05 2023-06-26 07:03:43,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-06-26 07:03:47,664 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.983e+02 9.621e+02 1.292e+03 1.996e+03 6.312e+03, threshold=2.583e+03, percent-clipped=12.0 2023-06-26 07:04:04,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2167446.0, ans=0.125 2023-06-26 07:04:10,198 INFO [train.py:996] (0/4) Epoch 12, batch 25800, loss[loss=0.2574, simple_loss=0.3359, pruned_loss=0.08943, over 21677.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3213, pruned_loss=0.08379, over 4268506.49 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:04:11,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-26 07:04:24,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2167446.0, ans=0.125 2023-06-26 07:04:31,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=2167506.0, ans=10.0 2023-06-26 07:04:40,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2167506.0, ans=0.1 2023-06-26 07:05:18,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2167626.0, ans=0.2 2023-06-26 07:05:41,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2167686.0, ans=0.0 2023-06-26 07:06:05,652 INFO [train.py:996] (0/4) Epoch 12, batch 25850, loss[loss=0.2238, simple_loss=0.3001, pruned_loss=0.0738, over 21795.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.322, pruned_loss=0.08281, over 4269546.86 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:06:08,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2167746.0, ans=0.125 2023-06-26 07:06:09,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.42 vs. limit=10.0 2023-06-26 07:06:16,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=2167746.0, ans=0.025 2023-06-26 07:06:42,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2167806.0, ans=0.125 2023-06-26 07:07:03,737 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:07:48,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.401e+02 9.971e+02 1.375e+03 1.928e+03 5.111e+03, threshold=2.750e+03, percent-clipped=7.0 2023-06-26 07:07:52,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-26 07:08:04,023 INFO [train.py:996] (0/4) Epoch 12, batch 25900, loss[loss=0.2476, simple_loss=0.3392, pruned_loss=0.07801, over 21820.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3248, pruned_loss=0.08354, over 4276499.27 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:08:25,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2168106.0, ans=0.0 2023-06-26 07:08:56,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2168166.0, ans=0.125 2023-06-26 07:09:24,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2168226.0, ans=0.0 2023-06-26 07:09:35,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2168286.0, ans=0.125 2023-06-26 07:09:55,092 INFO [train.py:996] (0/4) Epoch 12, batch 25950, loss[loss=0.2367, simple_loss=0.3164, pruned_loss=0.07852, over 21594.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3281, pruned_loss=0.08493, over 4275784.47 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:09:56,288 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.63 vs. limit=10.0 2023-06-26 07:09:58,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2168346.0, ans=0.1 2023-06-26 07:10:00,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2168346.0, ans=0.0 2023-06-26 07:10:54,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2168466.0, ans=0.125 2023-06-26 07:11:35,004 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.821e+02 8.955e+02 1.246e+03 1.921e+03 4.535e+03, threshold=2.491e+03, percent-clipped=11.0 2023-06-26 07:11:45,170 INFO [train.py:996] (0/4) Epoch 12, batch 26000, loss[loss=0.2754, simple_loss=0.3534, pruned_loss=0.0987, over 21480.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3271, pruned_loss=0.0832, over 4274295.68 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:11:47,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2168646.0, ans=0.125 2023-06-26 07:12:15,465 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:12:40,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2168766.0, ans=0.125 2023-06-26 07:12:53,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2168766.0, ans=10.0 2023-06-26 07:13:34,717 INFO [train.py:996] (0/4) Epoch 12, batch 26050, loss[loss=0.2478, simple_loss=0.3177, pruned_loss=0.08891, over 21868.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3278, pruned_loss=0.08418, over 4275548.30 frames. ], batch size: 118, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:14:00,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-26 07:14:48,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-26 07:15:10,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2169186.0, ans=0.0 2023-06-26 07:15:11,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.629e+02 1.039e+03 1.480e+03 2.039e+03 5.924e+03, threshold=2.960e+03, percent-clipped=13.0 2023-06-26 07:15:17,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2169186.0, ans=0.0 2023-06-26 07:15:22,454 INFO [train.py:996] (0/4) Epoch 12, batch 26100, loss[loss=0.2187, simple_loss=0.2895, pruned_loss=0.074, over 20963.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3221, pruned_loss=0.08428, over 4279227.80 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:16:25,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2169366.0, ans=0.0 2023-06-26 07:16:31,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2169426.0, ans=0.2 2023-06-26 07:16:57,105 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-26 07:17:11,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.75 vs. limit=15.0 2023-06-26 07:17:13,276 INFO [train.py:996] (0/4) Epoch 12, batch 26150, loss[loss=0.2134, simple_loss=0.2855, pruned_loss=0.07068, over 21708.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3187, pruned_loss=0.08416, over 4289054.00 frames. ], batch size: 230, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:17:17,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-26 07:17:28,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-26 07:18:18,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=2169666.0, ans=6.0 2023-06-26 07:18:34,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2169726.0, ans=0.125 2023-06-26 07:18:52,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 8.636e+02 1.260e+03 1.715e+03 2.544e+03, threshold=2.520e+03, percent-clipped=0.0 2023-06-26 07:19:03,281 INFO [train.py:996] (0/4) Epoch 12, batch 26200, loss[loss=0.2769, simple_loss=0.374, pruned_loss=0.08986, over 21860.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3196, pruned_loss=0.08236, over 4294936.86 frames. ], batch size: 371, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:19:42,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2169906.0, ans=0.2 2023-06-26 07:19:54,813 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:20:05,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2169966.0, ans=0.125 2023-06-26 07:20:52,551 INFO [train.py:996] (0/4) Epoch 12, batch 26250, loss[loss=0.2135, simple_loss=0.332, pruned_loss=0.04745, over 20722.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3221, pruned_loss=0.08145, over 4293902.76 frames. ], batch size: 608, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:21:25,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2170206.0, ans=0.2 2023-06-26 07:22:31,925 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.498e+02 1.014e+03 1.360e+03 1.996e+03 4.754e+03, threshold=2.720e+03, percent-clipped=15.0 2023-06-26 07:22:42,524 INFO [train.py:996] (0/4) Epoch 12, batch 26300, loss[loss=0.2482, simple_loss=0.3082, pruned_loss=0.09405, over 21556.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3197, pruned_loss=0.08187, over 4298620.37 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:23:20,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2170506.0, ans=0.1 2023-06-26 07:24:05,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2170626.0, ans=0.125 2023-06-26 07:24:40,565 INFO [train.py:996] (0/4) Epoch 12, batch 26350, loss[loss=0.2525, simple_loss=0.3198, pruned_loss=0.09266, over 21451.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.319, pruned_loss=0.08302, over 4292452.22 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:25:56,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2170926.0, ans=0.125 2023-06-26 07:26:10,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.697e+02 8.952e+02 1.093e+03 1.459e+03 3.186e+03, threshold=2.186e+03, percent-clipped=0.0 2023-06-26 07:26:10,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2170986.0, ans=0.125 2023-06-26 07:26:12,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2170986.0, ans=0.1 2023-06-26 07:26:25,978 INFO [train.py:996] (0/4) Epoch 12, batch 26400, loss[loss=0.1933, simple_loss=0.2549, pruned_loss=0.06588, over 21256.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3142, pruned_loss=0.08345, over 4292656.09 frames. ], batch size: 549, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:27:18,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2171166.0, ans=0.125 2023-06-26 07:27:29,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-26 07:27:38,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2171226.0, ans=0.04949747468305833 2023-06-26 07:27:44,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-26 07:27:53,439 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-26 07:28:30,884 INFO [train.py:996] (0/4) Epoch 12, batch 26450, loss[loss=0.2375, simple_loss=0.3283, pruned_loss=0.07338, over 21755.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3166, pruned_loss=0.0842, over 4289905.51 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:28:37,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2171346.0, ans=0.09899494936611666 2023-06-26 07:29:10,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2171466.0, ans=0.125 2023-06-26 07:29:18,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2171466.0, ans=0.125 2023-06-26 07:29:45,994 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-26 07:29:46,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-26 07:30:11,351 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.179e+02 1.055e+03 2.037e+03 2.907e+03 6.247e+03, threshold=4.074e+03, percent-clipped=46.0 2023-06-26 07:30:13,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2171586.0, ans=0.125 2023-06-26 07:30:13,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2171586.0, ans=0.1 2023-06-26 07:30:19,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2171646.0, ans=0.0 2023-06-26 07:30:19,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2171646.0, ans=0.0 2023-06-26 07:30:20,922 INFO [train.py:996] (0/4) Epoch 12, batch 26500, loss[loss=0.1839, simple_loss=0.2461, pruned_loss=0.06081, over 21325.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3176, pruned_loss=0.08214, over 4288798.27 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:30:55,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-26 07:31:24,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2171766.0, ans=0.0 2023-06-26 07:31:40,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2171826.0, ans=0.125 2023-06-26 07:32:13,345 INFO [train.py:996] (0/4) Epoch 12, batch 26550, loss[loss=0.1955, simple_loss=0.313, pruned_loss=0.03897, over 20824.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3161, pruned_loss=0.07935, over 4276709.84 frames. ], batch size: 608, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:32:27,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2171946.0, ans=0.125 2023-06-26 07:32:31,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2172006.0, ans=0.125 2023-06-26 07:32:55,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2172066.0, ans=0.125 2023-06-26 07:33:12,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2172066.0, ans=0.125 2023-06-26 07:33:13,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2172066.0, ans=0.125 2023-06-26 07:33:48,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.879e+02 9.613e+02 1.344e+03 2.252e+03 5.344e+03, threshold=2.687e+03, percent-clipped=4.0 2023-06-26 07:33:57,132 INFO [train.py:996] (0/4) Epoch 12, batch 26600, loss[loss=0.204, simple_loss=0.2843, pruned_loss=0.06182, over 21603.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3142, pruned_loss=0.07681, over 4270175.30 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:34:06,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2172246.0, ans=0.07 2023-06-26 07:34:12,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2172306.0, ans=0.1 2023-06-26 07:34:16,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2172306.0, ans=0.125 2023-06-26 07:34:37,730 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:35:45,173 INFO [train.py:996] (0/4) Epoch 12, batch 26650, loss[loss=0.1761, simple_loss=0.2672, pruned_loss=0.0425, over 21578.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3071, pruned_loss=0.07492, over 4258547.55 frames. ], batch size: 442, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:35:46,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.01 vs. limit=12.0 2023-06-26 07:35:54,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-26 07:35:55,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2172546.0, ans=0.1 2023-06-26 07:36:05,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2172606.0, ans=0.0 2023-06-26 07:36:42,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2172666.0, ans=0.04949747468305833 2023-06-26 07:36:55,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2172726.0, ans=0.125 2023-06-26 07:37:15,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2172786.0, ans=0.0 2023-06-26 07:37:20,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.879e+02 7.426e+02 9.644e+02 1.380e+03 2.955e+03, threshold=1.929e+03, percent-clipped=1.0 2023-06-26 07:37:21,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2172786.0, ans=0.125 2023-06-26 07:37:27,389 INFO [train.py:996] (0/4) Epoch 12, batch 26700, loss[loss=0.1768, simple_loss=0.2443, pruned_loss=0.05467, over 21245.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3002, pruned_loss=0.07233, over 4264103.42 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:37:36,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=2172846.0, ans=22.5 2023-06-26 07:37:41,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-26 07:38:57,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2173026.0, ans=0.125 2023-06-26 07:39:02,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2173086.0, ans=0.125 2023-06-26 07:39:17,615 INFO [train.py:996] (0/4) Epoch 12, batch 26750, loss[loss=0.2027, simple_loss=0.2932, pruned_loss=0.05612, over 21421.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2998, pruned_loss=0.07087, over 4265865.24 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:39:52,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2173206.0, ans=0.125 2023-06-26 07:40:47,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-26 07:41:00,874 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.608e+02 7.962e+02 9.923e+02 1.450e+03 3.695e+03, threshold=1.985e+03, percent-clipped=8.0 2023-06-26 07:41:18,498 INFO [train.py:996] (0/4) Epoch 12, batch 26800, loss[loss=0.2381, simple_loss=0.3135, pruned_loss=0.08138, over 20646.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3071, pruned_loss=0.07526, over 4262242.22 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:41:31,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2173446.0, ans=0.0 2023-06-26 07:41:40,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2173506.0, ans=0.125 2023-06-26 07:43:10,041 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.39 vs. limit=15.0 2023-06-26 07:43:10,560 INFO [train.py:996] (0/4) Epoch 12, batch 26850, loss[loss=0.2431, simple_loss=0.2953, pruned_loss=0.09547, over 21124.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3085, pruned_loss=0.07762, over 4263065.58 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:43:21,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.59 vs. limit=15.0 2023-06-26 07:43:47,282 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-26 07:44:40,221 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 7.931e+02 1.142e+03 1.522e+03 3.683e+03, threshold=2.283e+03, percent-clipped=9.0 2023-06-26 07:44:53,083 INFO [train.py:996] (0/4) Epoch 12, batch 26900, loss[loss=0.2334, simple_loss=0.2926, pruned_loss=0.08704, over 21341.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2993, pruned_loss=0.0765, over 4264826.87 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:46:41,255 INFO [train.py:996] (0/4) Epoch 12, batch 26950, loss[loss=0.3484, simple_loss=0.4112, pruned_loss=0.1428, over 21477.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2995, pruned_loss=0.07716, over 4265258.99 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:47:02,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2174406.0, ans=0.04949747468305833 2023-06-26 07:47:14,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2174466.0, ans=0.2 2023-06-26 07:47:46,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2174526.0, ans=0.125 2023-06-26 07:47:50,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2174586.0, ans=0.125 2023-06-26 07:48:17,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.465e+02 7.709e+02 1.449e+03 2.208e+03 6.166e+03, threshold=2.897e+03, percent-clipped=23.0 2023-06-26 07:48:22,384 INFO [train.py:996] (0/4) Epoch 12, batch 27000, loss[loss=0.229, simple_loss=0.3336, pruned_loss=0.06218, over 21177.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3006, pruned_loss=0.07492, over 4270078.28 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:48:22,385 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 07:48:40,348 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2401, simple_loss=0.3367, pruned_loss=0.07176, over 1796401.00 frames. 2023-06-26 07:48:40,349 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-26 07:48:40,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2174646.0, ans=0.125 2023-06-26 07:49:10,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2174706.0, ans=0.2 2023-06-26 07:49:16,869 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-26 07:49:41,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2174826.0, ans=0.2 2023-06-26 07:50:18,476 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-26 07:50:21,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=2174886.0, ans=22.5 2023-06-26 07:50:26,726 INFO [train.py:996] (0/4) Epoch 12, batch 27050, loss[loss=0.2526, simple_loss=0.3257, pruned_loss=0.08972, over 21803.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3033, pruned_loss=0.07266, over 4274297.83 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:50:32,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2174946.0, ans=0.125 2023-06-26 07:50:36,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2174946.0, ans=0.125 2023-06-26 07:52:05,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2175186.0, ans=0.2 2023-06-26 07:52:06,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.688e+02 1.022e+03 1.429e+03 2.477e+03 5.037e+03, threshold=2.858e+03, percent-clipped=17.0 2023-06-26 07:52:12,166 INFO [train.py:996] (0/4) Epoch 12, batch 27100, loss[loss=0.2408, simple_loss=0.3036, pruned_loss=0.08899, over 21606.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3043, pruned_loss=0.07391, over 4283387.41 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:53:37,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2175426.0, ans=0.125 2023-06-26 07:53:53,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.49 vs. limit=15.0 2023-06-26 07:54:02,041 INFO [train.py:996] (0/4) Epoch 12, batch 27150, loss[loss=0.3648, simple_loss=0.4402, pruned_loss=0.1447, over 21514.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3175, pruned_loss=0.07782, over 4285114.53 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:54:26,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2175606.0, ans=0.2 2023-06-26 07:54:52,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2175666.0, ans=0.125 2023-06-26 07:55:04,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2175666.0, ans=0.125 2023-06-26 07:55:12,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2175726.0, ans=0.0 2023-06-26 07:55:40,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.180e+02 9.529e+02 1.614e+03 2.368e+03 4.440e+03, threshold=3.227e+03, percent-clipped=12.0 2023-06-26 07:55:45,341 INFO [train.py:996] (0/4) Epoch 12, batch 27200, loss[loss=0.2615, simple_loss=0.3525, pruned_loss=0.08523, over 21327.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3266, pruned_loss=0.08027, over 4283575.94 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:55:45,957 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:55:47,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2175846.0, ans=0.125 2023-06-26 07:55:49,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2175846.0, ans=0.125 2023-06-26 07:55:51,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2175846.0, ans=0.125 2023-06-26 07:56:38,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2175906.0, ans=0.125 2023-06-26 07:56:53,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2175966.0, ans=0.125 2023-06-26 07:57:01,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2176026.0, ans=0.125 2023-06-26 07:57:11,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2176026.0, ans=0.125 2023-06-26 07:57:17,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2176086.0, ans=0.1 2023-06-26 07:57:33,895 INFO [train.py:996] (0/4) Epoch 12, batch 27250, loss[loss=0.2377, simple_loss=0.3163, pruned_loss=0.07961, over 21947.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.328, pruned_loss=0.08328, over 4279086.41 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:57:39,029 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-26 07:58:08,925 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-26 07:58:23,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2176266.0, ans=0.125 2023-06-26 07:58:27,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2176266.0, ans=0.125 2023-06-26 07:58:40,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-26 07:58:52,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2176326.0, ans=0.125 2023-06-26 07:58:52,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2176326.0, ans=0.125 2023-06-26 07:59:02,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2176386.0, ans=0.2 2023-06-26 07:59:24,079 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.268e+02 8.548e+02 1.157e+03 1.722e+03 4.031e+03, threshold=2.315e+03, percent-clipped=3.0 2023-06-26 07:59:32,600 INFO [train.py:996] (0/4) Epoch 12, batch 27300, loss[loss=0.2462, simple_loss=0.3273, pruned_loss=0.08253, over 21233.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3286, pruned_loss=0.08395, over 4283252.00 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:59:52,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2176446.0, ans=0.025 2023-06-26 07:59:52,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2176446.0, ans=0.0 2023-06-26 08:00:13,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-26 08:00:18,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2176566.0, ans=0.07 2023-06-26 08:00:25,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2176566.0, ans=0.0 2023-06-26 08:01:10,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2176686.0, ans=0.125 2023-06-26 08:01:21,486 INFO [train.py:996] (0/4) Epoch 12, batch 27350, loss[loss=0.2413, simple_loss=0.3256, pruned_loss=0.07853, over 21814.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3304, pruned_loss=0.08452, over 4288830.37 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:01:32,957 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=12.0 2023-06-26 08:01:40,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2176806.0, ans=0.0 2023-06-26 08:02:05,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2176866.0, ans=10.0 2023-06-26 08:02:53,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2176986.0, ans=0.125 2023-06-26 08:02:59,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.844e+02 9.303e+02 1.156e+03 1.696e+03 4.537e+03, threshold=2.312e+03, percent-clipped=11.0 2023-06-26 08:03:05,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2176986.0, ans=0.125 2023-06-26 08:03:08,455 INFO [train.py:996] (0/4) Epoch 12, batch 27400, loss[loss=0.2298, simple_loss=0.2871, pruned_loss=0.08622, over 21258.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3256, pruned_loss=0.08386, over 4281756.34 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:03:16,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2177046.0, ans=0.1 2023-06-26 08:03:30,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2177106.0, ans=0.09899494936611666 2023-06-26 08:03:39,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-26 08:04:12,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2177226.0, ans=0.125 2023-06-26 08:04:56,377 INFO [train.py:996] (0/4) Epoch 12, batch 27450, loss[loss=0.2248, simple_loss=0.2935, pruned_loss=0.07802, over 21071.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3186, pruned_loss=0.08251, over 4281074.32 frames. ], batch size: 143, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:05:29,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2177466.0, ans=0.125 2023-06-26 08:05:32,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2177466.0, ans=0.0 2023-06-26 08:05:45,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2177526.0, ans=0.2 2023-06-26 08:06:24,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2177586.0, ans=0.0 2023-06-26 08:06:35,009 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.529e+02 8.908e+02 1.196e+03 1.781e+03 4.590e+03, threshold=2.391e+03, percent-clipped=13.0 2023-06-26 08:06:37,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-26 08:06:38,346 INFO [train.py:996] (0/4) Epoch 12, batch 27500, loss[loss=0.2284, simple_loss=0.2999, pruned_loss=0.07839, over 21834.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.318, pruned_loss=0.08349, over 4278849.63 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:06:40,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2177646.0, ans=0.125 2023-06-26 08:06:57,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2177706.0, ans=0.0 2023-06-26 08:06:59,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2177706.0, ans=0.0 2023-06-26 08:07:22,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2177766.0, ans=0.125 2023-06-26 08:08:28,018 INFO [train.py:996] (0/4) Epoch 12, batch 27550, loss[loss=0.1984, simple_loss=0.2765, pruned_loss=0.06009, over 21780.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3143, pruned_loss=0.07993, over 4278897.99 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:08:32,047 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:09:09,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2178066.0, ans=0.125 2023-06-26 08:09:57,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2178186.0, ans=0.2 2023-06-26 08:10:02,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2178186.0, ans=0.05 2023-06-26 08:10:06,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2178186.0, ans=0.2 2023-06-26 08:10:12,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.726e+02 8.167e+02 1.305e+03 2.054e+03 4.575e+03, threshold=2.609e+03, percent-clipped=17.0 2023-06-26 08:10:15,780 INFO [train.py:996] (0/4) Epoch 12, batch 27600, loss[loss=0.2066, simple_loss=0.2761, pruned_loss=0.06855, over 21782.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3091, pruned_loss=0.07953, over 4270656.42 frames. ], batch size: 317, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:10:17,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2178246.0, ans=0.025 2023-06-26 08:10:51,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2178366.0, ans=0.0 2023-06-26 08:11:14,399 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-26 08:11:59,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2178546.0, ans=0.2 2023-06-26 08:12:00,891 INFO [train.py:996] (0/4) Epoch 12, batch 27650, loss[loss=0.1964, simple_loss=0.2734, pruned_loss=0.0597, over 21898.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3025, pruned_loss=0.07846, over 4271418.93 frames. ], batch size: 98, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:12:13,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2178546.0, ans=0.1 2023-06-26 08:12:24,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2178606.0, ans=0.0 2023-06-26 08:12:32,921 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:12:56,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2178726.0, ans=0.1 2023-06-26 08:13:03,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2178726.0, ans=0.125 2023-06-26 08:13:17,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2178726.0, ans=0.125 2023-06-26 08:13:46,688 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.067e+02 8.445e+02 1.173e+03 2.476e+03 5.396e+03, threshold=2.346e+03, percent-clipped=23.0 2023-06-26 08:13:49,053 INFO [train.py:996] (0/4) Epoch 12, batch 27700, loss[loss=0.2244, simple_loss=0.3017, pruned_loss=0.07355, over 21343.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3032, pruned_loss=0.07692, over 4270450.25 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:14:20,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2178906.0, ans=0.015 2023-06-26 08:15:33,974 INFO [train.py:996] (0/4) Epoch 12, batch 27750, loss[loss=0.2876, simple_loss=0.3468, pruned_loss=0.1142, over 21698.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3064, pruned_loss=0.07642, over 4269513.78 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:15:39,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2179146.0, ans=0.0 2023-06-26 08:15:48,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-26 08:16:37,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2179326.0, ans=0.1 2023-06-26 08:17:17,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.559e+02 9.501e+02 1.486e+03 2.223e+03 3.680e+03, threshold=2.972e+03, percent-clipped=21.0 2023-06-26 08:17:19,382 INFO [train.py:996] (0/4) Epoch 12, batch 27800, loss[loss=0.2303, simple_loss=0.2981, pruned_loss=0.08125, over 21473.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3034, pruned_loss=0.07643, over 4273473.53 frames. ], batch size: 194, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:17:33,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2179446.0, ans=0.1 2023-06-26 08:17:37,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2179506.0, ans=0.1 2023-06-26 08:18:53,664 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-26 08:19:07,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2179686.0, ans=0.125 2023-06-26 08:19:10,329 INFO [train.py:996] (0/4) Epoch 12, batch 27850, loss[loss=0.2408, simple_loss=0.3082, pruned_loss=0.08672, over 21781.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3025, pruned_loss=0.07748, over 4284449.41 frames. ], batch size: 112, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:19:19,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2179746.0, ans=0.0 2023-06-26 08:19:59,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2023-06-26 08:20:20,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2179926.0, ans=0.0 2023-06-26 08:20:42,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2179986.0, ans=0.2 2023-06-26 08:20:53,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.551e+02 9.151e+02 1.351e+03 2.039e+03 4.854e+03, threshold=2.703e+03, percent-clipped=13.0 2023-06-26 08:20:55,765 INFO [train.py:996] (0/4) Epoch 12, batch 27900, loss[loss=0.2325, simple_loss=0.3546, pruned_loss=0.05519, over 19809.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3121, pruned_loss=0.07909, over 4283740.33 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:21:35,516 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-26 08:22:09,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-26 08:22:46,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-26 08:22:47,184 INFO [train.py:996] (0/4) Epoch 12, batch 27950, loss[loss=0.199, simple_loss=0.292, pruned_loss=0.05297, over 21774.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.312, pruned_loss=0.0754, over 4283686.45 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:24:33,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.115e+02 7.857e+02 1.156e+03 1.726e+03 4.008e+03, threshold=2.312e+03, percent-clipped=7.0 2023-06-26 08:24:34,679 INFO [train.py:996] (0/4) Epoch 12, batch 28000, loss[loss=0.2066, simple_loss=0.2974, pruned_loss=0.05786, over 21728.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3094, pruned_loss=0.07387, over 4283242.35 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:24:41,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 08:25:06,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=2180706.0, ans=0.1 2023-06-26 08:25:11,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2180706.0, ans=0.1 2023-06-26 08:25:40,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-26 08:26:05,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2180886.0, ans=0.125 2023-06-26 08:26:22,272 INFO [train.py:996] (0/4) Epoch 12, batch 28050, loss[loss=0.2545, simple_loss=0.3387, pruned_loss=0.08519, over 21515.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3066, pruned_loss=0.07453, over 4281320.64 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:26:34,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2180946.0, ans=0.0 2023-06-26 08:26:48,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-26 08:27:04,300 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:27:27,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2181066.0, ans=0.125 2023-06-26 08:27:30,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-26 08:27:33,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2181066.0, ans=0.1 2023-06-26 08:27:41,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2181126.0, ans=0.125 2023-06-26 08:28:11,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.128e+02 1.251e+03 1.568e+03 2.445e+03 4.428e+03, threshold=3.136e+03, percent-clipped=24.0 2023-06-26 08:28:13,346 INFO [train.py:996] (0/4) Epoch 12, batch 28100, loss[loss=0.2182, simple_loss=0.2933, pruned_loss=0.07157, over 21551.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3051, pruned_loss=0.07502, over 4276482.47 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:28:32,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2181246.0, ans=0.1 2023-06-26 08:28:45,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2181306.0, ans=0.0 2023-06-26 08:29:10,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2181366.0, ans=0.125 2023-06-26 08:30:00,943 INFO [train.py:996] (0/4) Epoch 12, batch 28150, loss[loss=0.2525, simple_loss=0.3113, pruned_loss=0.09684, over 21827.00 frames. ], tot_loss[loss=0.224, simple_loss=0.299, pruned_loss=0.07448, over 4275911.80 frames. ], batch size: 98, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:30:36,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2181606.0, ans=0.07 2023-06-26 08:30:45,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2181606.0, ans=0.1 2023-06-26 08:30:50,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2181606.0, ans=0.0 2023-06-26 08:31:21,197 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:31:50,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.207e+02 9.538e+02 1.303e+03 2.099e+03 3.954e+03, threshold=2.605e+03, percent-clipped=5.0 2023-06-26 08:31:52,078 INFO [train.py:996] (0/4) Epoch 12, batch 28200, loss[loss=0.2507, simple_loss=0.3314, pruned_loss=0.08498, over 21827.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2957, pruned_loss=0.07573, over 4278053.88 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:32:11,744 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-26 08:32:24,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2181846.0, ans=0.125 2023-06-26 08:32:43,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2181966.0, ans=0.125 2023-06-26 08:33:06,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2182026.0, ans=0.2 2023-06-26 08:33:53,906 INFO [train.py:996] (0/4) Epoch 12, batch 28250, loss[loss=0.2371, simple_loss=0.293, pruned_loss=0.09062, over 21229.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3009, pruned_loss=0.08001, over 4282869.46 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:33:58,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2182146.0, ans=0.125 2023-06-26 08:34:00,509 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.74 vs. limit=15.0 2023-06-26 08:35:15,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=2182386.0, ans=0.5 2023-06-26 08:35:42,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.099e+02 9.506e+02 1.674e+03 2.975e+03 5.994e+03, threshold=3.348e+03, percent-clipped=30.0 2023-06-26 08:35:44,326 INFO [train.py:996] (0/4) Epoch 12, batch 28300, loss[loss=0.1757, simple_loss=0.2346, pruned_loss=0.05842, over 20655.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2971, pruned_loss=0.07735, over 4264726.53 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:35:52,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2182446.0, ans=0.125 2023-06-26 08:36:40,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2182626.0, ans=0.125 2023-06-26 08:36:56,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2182626.0, ans=0.2 2023-06-26 08:37:22,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2182686.0, ans=0.125 2023-06-26 08:37:41,268 INFO [train.py:996] (0/4) Epoch 12, batch 28350, loss[loss=0.2025, simple_loss=0.2933, pruned_loss=0.05582, over 21777.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2952, pruned_loss=0.07253, over 4266228.60 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:38:00,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2182806.0, ans=0.2 2023-06-26 08:38:02,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-26 08:38:04,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2182806.0, ans=0.0 2023-06-26 08:38:13,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2182866.0, ans=0.2 2023-06-26 08:38:21,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=2182866.0, ans=0.1 2023-06-26 08:38:55,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2182926.0, ans=0.1 2023-06-26 08:39:26,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2182986.0, ans=0.0 2023-06-26 08:39:27,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.238e+02 9.118e+02 1.283e+03 2.054e+03 3.751e+03, threshold=2.567e+03, percent-clipped=3.0 2023-06-26 08:39:29,429 INFO [train.py:996] (0/4) Epoch 12, batch 28400, loss[loss=0.1853, simple_loss=0.2483, pruned_loss=0.06112, over 21466.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2918, pruned_loss=0.07192, over 4264796.32 frames. ], batch size: 212, lr: 2.37e-03, grad_scale: 32.0 2023-06-26 08:39:34,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2183046.0, ans=0.125 2023-06-26 08:39:40,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2183046.0, ans=0.1 2023-06-26 08:39:42,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-26 08:39:49,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-26 08:39:57,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2183106.0, ans=0.2 2023-06-26 08:41:22,185 INFO [train.py:996] (0/4) Epoch 12, batch 28450, loss[loss=0.2788, simple_loss=0.3375, pruned_loss=0.1101, over 21584.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2966, pruned_loss=0.07505, over 4265450.14 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:41:45,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-26 08:41:46,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2183406.0, ans=0.125 2023-06-26 08:42:51,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2183526.0, ans=0.0 2023-06-26 08:43:12,238 INFO [train.py:996] (0/4) Epoch 12, batch 28500, loss[loss=0.2624, simple_loss=0.3322, pruned_loss=0.09631, over 21839.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2995, pruned_loss=0.07759, over 4267938.89 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:43:13,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.783e+02 7.825e+02 1.127e+03 1.628e+03 3.596e+03, threshold=2.254e+03, percent-clipped=4.0 2023-06-26 08:43:31,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2183706.0, ans=0.0 2023-06-26 08:43:49,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2183706.0, ans=15.0 2023-06-26 08:43:57,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2183766.0, ans=0.1 2023-06-26 08:44:50,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=2183886.0, ans=0.2 2023-06-26 08:44:55,218 INFO [train.py:996] (0/4) Epoch 12, batch 28550, loss[loss=0.2296, simple_loss=0.3026, pruned_loss=0.0783, over 21304.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3065, pruned_loss=0.07971, over 4271814.29 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:45:09,504 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/checkpoint-364000.pt 2023-06-26 08:45:41,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-26 08:46:06,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2184066.0, ans=0.0 2023-06-26 08:46:33,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2184186.0, ans=0.125 2023-06-26 08:46:45,258 INFO [train.py:996] (0/4) Epoch 12, batch 28600, loss[loss=0.2578, simple_loss=0.3309, pruned_loss=0.09231, over 21571.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3131, pruned_loss=0.08161, over 4273732.54 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:46:47,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.084e+02 9.095e+02 1.267e+03 2.017e+03 3.986e+03, threshold=2.534e+03, percent-clipped=14.0 2023-06-26 08:47:47,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2184366.0, ans=0.0 2023-06-26 08:48:39,285 INFO [train.py:996] (0/4) Epoch 12, batch 28650, loss[loss=0.2374, simple_loss=0.2988, pruned_loss=0.088, over 21179.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3074, pruned_loss=0.08081, over 4274576.58 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:48:56,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2184546.0, ans=0.0 2023-06-26 08:49:27,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-26 08:50:00,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2184726.0, ans=0.1 2023-06-26 08:50:36,118 INFO [train.py:996] (0/4) Epoch 12, batch 28700, loss[loss=0.245, simple_loss=0.3102, pruned_loss=0.08992, over 21607.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3068, pruned_loss=0.08186, over 4273250.84 frames. ], batch size: 230, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:50:37,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.799e+02 8.894e+02 1.316e+03 2.106e+03 4.413e+03, threshold=2.633e+03, percent-clipped=13.0 2023-06-26 08:50:39,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2184846.0, ans=0.0 2023-06-26 08:50:50,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2184846.0, ans=0.125 2023-06-26 08:50:58,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2184906.0, ans=0.2 2023-06-26 08:51:55,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-06-26 08:52:10,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2185086.0, ans=0.125 2023-06-26 08:52:24,663 INFO [train.py:996] (0/4) Epoch 12, batch 28750, loss[loss=0.2122, simple_loss=0.2995, pruned_loss=0.06245, over 21629.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3079, pruned_loss=0.08221, over 4278994.93 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:52:53,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2185206.0, ans=0.0 2023-06-26 08:54:04,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2185386.0, ans=0.2 2023-06-26 08:54:13,706 INFO [train.py:996] (0/4) Epoch 12, batch 28800, loss[loss=0.283, simple_loss=0.3542, pruned_loss=0.1059, over 21925.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3116, pruned_loss=0.08273, over 4282751.08 frames. ], batch size: 372, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:54:14,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2185446.0, ans=0.125 2023-06-26 08:54:15,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.541e+02 8.442e+02 1.126e+03 1.600e+03 2.728e+03, threshold=2.251e+03, percent-clipped=1.0 2023-06-26 08:55:53,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2185686.0, ans=0.2 2023-06-26 08:56:00,445 INFO [train.py:996] (0/4) Epoch 12, batch 28850, loss[loss=0.2186, simple_loss=0.2929, pruned_loss=0.07216, over 20804.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3132, pruned_loss=0.08406, over 4287421.90 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:57:41,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2185986.0, ans=0.125 2023-06-26 08:57:55,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2186046.0, ans=0.0 2023-06-26 08:57:57,029 INFO [train.py:996] (0/4) Epoch 12, batch 28900, loss[loss=0.2586, simple_loss=0.3365, pruned_loss=0.09035, over 20587.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3166, pruned_loss=0.08553, over 4287183.88 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:58:06,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.674e+02 9.203e+02 1.221e+03 1.627e+03 4.762e+03, threshold=2.441e+03, percent-clipped=10.0 2023-06-26 08:58:34,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2186106.0, ans=0.125 2023-06-26 08:59:01,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2186226.0, ans=0.125 2023-06-26 08:59:49,859 INFO [train.py:996] (0/4) Epoch 12, batch 28950, loss[loss=0.2254, simple_loss=0.3287, pruned_loss=0.06108, over 21691.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3176, pruned_loss=0.08481, over 4279915.03 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:59:55,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2186346.0, ans=0.2 2023-06-26 08:59:59,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2186346.0, ans=0.125 2023-06-26 09:00:05,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2186406.0, ans=0.2 2023-06-26 09:00:09,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2186406.0, ans=0.125 2023-06-26 09:00:38,861 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-26 09:01:39,457 INFO [train.py:996] (0/4) Epoch 12, batch 29000, loss[loss=0.2688, simple_loss=0.3365, pruned_loss=0.1005, over 21975.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3208, pruned_loss=0.08394, over 4271424.40 frames. ], batch size: 317, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:01:42,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.937e+02 9.875e+02 1.439e+03 2.075e+03 4.824e+03, threshold=2.879e+03, percent-clipped=20.0 2023-06-26 09:01:55,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2186706.0, ans=0.0 2023-06-26 09:02:23,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2186766.0, ans=0.125 2023-06-26 09:02:47,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2186826.0, ans=0.035 2023-06-26 09:03:25,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=12.0 2023-06-26 09:03:25,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-26 09:03:27,714 INFO [train.py:996] (0/4) Epoch 12, batch 29050, loss[loss=0.2142, simple_loss=0.287, pruned_loss=0.07072, over 21685.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3188, pruned_loss=0.08452, over 4276270.88 frames. ], batch size: 230, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:03:28,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2186946.0, ans=0.125 2023-06-26 09:04:09,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-26 09:04:16,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-26 09:04:31,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2187066.0, ans=0.1 2023-06-26 09:04:58,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2187186.0, ans=0.95 2023-06-26 09:05:05,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2187186.0, ans=0.125 2023-06-26 09:05:08,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2187186.0, ans=0.0 2023-06-26 09:05:17,253 INFO [train.py:996] (0/4) Epoch 12, batch 29100, loss[loss=0.191, simple_loss=0.2551, pruned_loss=0.06344, over 21759.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3116, pruned_loss=0.08246, over 4275788.37 frames. ], batch size: 124, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:05:20,348 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.010e+02 8.838e+02 1.250e+03 1.736e+03 3.044e+03, threshold=2.501e+03, percent-clipped=1.0 2023-06-26 09:05:45,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2187306.0, ans=0.125 2023-06-26 09:05:58,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.41 vs. limit=15.0 2023-06-26 09:06:31,555 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:06:48,238 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:07:03,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2187546.0, ans=0.125 2023-06-26 09:07:04,373 INFO [train.py:996] (0/4) Epoch 12, batch 29150, loss[loss=0.2097, simple_loss=0.2881, pruned_loss=0.06565, over 21222.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3088, pruned_loss=0.0798, over 4270867.92 frames. ], batch size: 159, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:08:51,329 INFO [train.py:996] (0/4) Epoch 12, batch 29200, loss[loss=0.2382, simple_loss=0.2897, pruned_loss=0.09337, over 21400.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3041, pruned_loss=0.07937, over 4273354.02 frames. ], batch size: 508, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:08:54,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.344e+02 9.543e+02 1.197e+03 1.931e+03 4.156e+03, threshold=2.395e+03, percent-clipped=13.0 2023-06-26 09:09:00,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2187846.0, ans=0.0 2023-06-26 09:09:50,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2187966.0, ans=0.95 2023-06-26 09:09:52,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.28 vs. limit=15.0 2023-06-26 09:10:05,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=12.0 2023-06-26 09:10:18,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2188026.0, ans=0.125 2023-06-26 09:10:26,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2188086.0, ans=0.0 2023-06-26 09:10:33,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2188086.0, ans=0.125 2023-06-26 09:10:36,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2188086.0, ans=0.125 2023-06-26 09:10:39,270 INFO [train.py:996] (0/4) Epoch 12, batch 29250, loss[loss=0.2221, simple_loss=0.305, pruned_loss=0.06964, over 21548.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3028, pruned_loss=0.07736, over 4264682.71 frames. ], batch size: 195, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:10:52,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2188146.0, ans=0.0 2023-06-26 09:11:05,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.76 vs. limit=22.5 2023-06-26 09:12:08,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.65 vs. limit=10.0 2023-06-26 09:12:24,385 INFO [train.py:996] (0/4) Epoch 12, batch 29300, loss[loss=0.1759, simple_loss=0.2513, pruned_loss=0.05022, over 15576.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3044, pruned_loss=0.07608, over 4263740.20 frames. ], batch size: 60, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:12:27,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.498e+02 8.933e+02 1.350e+03 1.847e+03 4.429e+03, threshold=2.700e+03, percent-clipped=13.0 2023-06-26 09:12:29,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2188446.0, ans=0.2 2023-06-26 09:12:32,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2188446.0, ans=0.125 2023-06-26 09:12:39,834 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:13:41,706 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:14:11,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2188746.0, ans=0.125 2023-06-26 09:14:12,351 INFO [train.py:996] (0/4) Epoch 12, batch 29350, loss[loss=0.2016, simple_loss=0.3027, pruned_loss=0.05025, over 21222.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.301, pruned_loss=0.07586, over 4260552.02 frames. ], batch size: 549, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:15:06,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-06-26 09:15:25,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2188866.0, ans=0.125 2023-06-26 09:15:53,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2188986.0, ans=0.2 2023-06-26 09:16:14,821 INFO [train.py:996] (0/4) Epoch 12, batch 29400, loss[loss=0.221, simple_loss=0.3058, pruned_loss=0.0681, over 21687.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3012, pruned_loss=0.07374, over 4262512.68 frames. ], batch size: 391, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:16:18,023 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.568e+02 9.109e+02 1.324e+03 1.907e+03 3.871e+03, threshold=2.647e+03, percent-clipped=5.0 2023-06-26 09:17:16,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2189166.0, ans=0.125 2023-06-26 09:17:18,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2189166.0, ans=0.04949747468305833 2023-06-26 09:18:07,020 INFO [train.py:996] (0/4) Epoch 12, batch 29450, loss[loss=0.2888, simple_loss=0.3565, pruned_loss=0.1105, over 21438.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3003, pruned_loss=0.07348, over 4273497.94 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:18:18,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2189346.0, ans=0.2 2023-06-26 09:19:01,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2189466.0, ans=0.2 2023-06-26 09:19:01,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2189466.0, ans=0.05 2023-06-26 09:19:09,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2189526.0, ans=0.0 2023-06-26 09:19:15,000 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:19:25,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2189526.0, ans=0.125 2023-06-26 09:19:48,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2189586.0, ans=0.1 2023-06-26 09:19:48,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2189586.0, ans=0.125 2023-06-26 09:19:55,300 INFO [train.py:996] (0/4) Epoch 12, batch 29500, loss[loss=0.2348, simple_loss=0.2972, pruned_loss=0.08621, over 21856.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3046, pruned_loss=0.0769, over 4279323.80 frames. ], batch size: 107, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:20:06,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 8.502e+02 1.287e+03 1.918e+03 4.991e+03, threshold=2.573e+03, percent-clipped=9.0 2023-06-26 09:20:26,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2189706.0, ans=0.1 2023-06-26 09:21:28,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-26 09:21:31,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2189886.0, ans=0.0 2023-06-26 09:21:45,271 INFO [train.py:996] (0/4) Epoch 12, batch 29550, loss[loss=0.2217, simple_loss=0.2957, pruned_loss=0.07381, over 21918.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3035, pruned_loss=0.07815, over 4282526.54 frames. ], batch size: 113, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:22:15,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=2190006.0, ans=0.02 2023-06-26 09:22:55,107 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-06-26 09:22:56,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2190126.0, ans=0.125 2023-06-26 09:23:49,838 INFO [train.py:996] (0/4) Epoch 12, batch 29600, loss[loss=0.2835, simple_loss=0.3649, pruned_loss=0.101, over 21720.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3102, pruned_loss=0.08071, over 4282643.06 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:23:50,545 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:23:55,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.495e+02 9.952e+02 1.397e+03 2.179e+03 6.553e+03, threshold=2.795e+03, percent-clipped=14.0 2023-06-26 09:24:00,650 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:24:15,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2190306.0, ans=0.125 2023-06-26 09:24:26,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-26 09:24:36,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=2190366.0, ans=0.02 2023-06-26 09:25:10,858 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.24 vs. limit=22.5 2023-06-26 09:25:35,915 INFO [train.py:996] (0/4) Epoch 12, batch 29650, loss[loss=0.2489, simple_loss=0.3164, pruned_loss=0.09068, over 21721.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3084, pruned_loss=0.07722, over 4279185.93 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:26:02,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=22.5 2023-06-26 09:27:22,333 INFO [train.py:996] (0/4) Epoch 12, batch 29700, loss[loss=0.2656, simple_loss=0.3758, pruned_loss=0.07767, over 21725.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3101, pruned_loss=0.0777, over 4281071.42 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:27:27,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.270e+02 1.083e+03 1.930e+03 2.745e+03 6.322e+03, threshold=3.861e+03, percent-clipped=21.0 2023-06-26 09:27:27,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2190846.0, ans=0.0 2023-06-26 09:27:46,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2190906.0, ans=10.0 2023-06-26 09:28:00,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2190966.0, ans=0.125 2023-06-26 09:29:04,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-26 09:29:06,921 INFO [train.py:996] (0/4) Epoch 12, batch 29750, loss[loss=0.2253, simple_loss=0.3002, pruned_loss=0.07519, over 21906.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3164, pruned_loss=0.07763, over 4282122.18 frames. ], batch size: 118, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:29:44,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2191266.0, ans=0.125 2023-06-26 09:30:19,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2191326.0, ans=0.125 2023-06-26 09:30:26,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-26 09:30:50,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2191386.0, ans=0.125 2023-06-26 09:30:53,161 INFO [train.py:996] (0/4) Epoch 12, batch 29800, loss[loss=0.232, simple_loss=0.3031, pruned_loss=0.08041, over 21404.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3161, pruned_loss=0.0778, over 4291161.45 frames. ], batch size: 194, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:30:58,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.420e+02 9.303e+02 1.235e+03 1.702e+03 3.129e+03, threshold=2.469e+03, percent-clipped=0.0 2023-06-26 09:31:30,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2191566.0, ans=0.1 2023-06-26 09:32:16,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=22.5 2023-06-26 09:32:33,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2191686.0, ans=0.125 2023-06-26 09:32:37,764 INFO [train.py:996] (0/4) Epoch 12, batch 29850, loss[loss=0.2414, simple_loss=0.3125, pruned_loss=0.08519, over 21845.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3119, pruned_loss=0.07569, over 4283313.52 frames. ], batch size: 414, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:32:48,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2191746.0, ans=0.5 2023-06-26 09:34:29,924 INFO [train.py:996] (0/4) Epoch 12, batch 29900, loss[loss=0.2545, simple_loss=0.3271, pruned_loss=0.09093, over 21379.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3092, pruned_loss=0.07655, over 4291456.59 frames. ], batch size: 131, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:34:35,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.187e+02 8.891e+02 1.162e+03 1.804e+03 4.059e+03, threshold=2.324e+03, percent-clipped=12.0 2023-06-26 09:34:45,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2192106.0, ans=0.0 2023-06-26 09:34:48,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2192106.0, ans=0.125 2023-06-26 09:34:52,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2192106.0, ans=0.1 2023-06-26 09:34:55,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2192106.0, ans=0.125 2023-06-26 09:35:27,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2192166.0, ans=0.125 2023-06-26 09:35:45,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2192226.0, ans=0.125 2023-06-26 09:36:19,870 INFO [train.py:996] (0/4) Epoch 12, batch 29950, loss[loss=0.2635, simple_loss=0.3494, pruned_loss=0.08876, over 21830.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3118, pruned_loss=0.07965, over 4283581.66 frames. ], batch size: 124, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:36:25,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2192346.0, ans=0.025 2023-06-26 09:36:26,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2192346.0, ans=6.0 2023-06-26 09:37:26,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.42 vs. limit=22.5 2023-06-26 09:37:36,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.39 vs. limit=10.0 2023-06-26 09:37:40,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-26 09:37:48,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2192526.0, ans=0.0 2023-06-26 09:37:51,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2192586.0, ans=0.2 2023-06-26 09:38:08,264 INFO [train.py:996] (0/4) Epoch 12, batch 30000, loss[loss=0.1975, simple_loss=0.2961, pruned_loss=0.04944, over 21756.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3124, pruned_loss=0.07923, over 4278317.50 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 32.0 2023-06-26 09:38:08,265 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 09:38:26,471 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2465, simple_loss=0.3441, pruned_loss=0.07444, over 1796401.00 frames. 2023-06-26 09:38:26,472 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 24431MB 2023-06-26 09:38:37,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.580e+02 9.008e+02 1.239e+03 1.696e+03 3.151e+03, threshold=2.479e+03, percent-clipped=8.0 2023-06-26 09:38:54,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2192706.0, ans=0.125 2023-06-26 09:39:16,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-06-26 09:39:28,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2192766.0, ans=0.125 2023-06-26 09:39:32,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2192766.0, ans=0.125 2023-06-26 09:40:32,847 INFO [train.py:996] (0/4) Epoch 12, batch 30050, loss[loss=0.2317, simple_loss=0.3182, pruned_loss=0.0726, over 21410.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3145, pruned_loss=0.07606, over 4271898.44 frames. ], batch size: 194, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:41:35,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2193066.0, ans=0.125 2023-06-26 09:42:30,783 INFO [train.py:996] (0/4) Epoch 12, batch 30100, loss[loss=0.258, simple_loss=0.3081, pruned_loss=0.1039, over 21325.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3149, pruned_loss=0.07655, over 4265615.27 frames. ], batch size: 507, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:42:32,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2193246.0, ans=0.1 2023-06-26 09:42:42,348 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.763e+02 1.437e+03 2.326e+03 3.330e+03 7.267e+03, threshold=4.652e+03, percent-clipped=46.0 2023-06-26 09:42:51,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2193306.0, ans=0.2 2023-06-26 09:43:00,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-26 09:43:08,523 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:43:24,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2193366.0, ans=0.0 2023-06-26 09:43:27,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2193426.0, ans=0.125 2023-06-26 09:44:19,989 INFO [train.py:996] (0/4) Epoch 12, batch 30150, loss[loss=0.2551, simple_loss=0.3235, pruned_loss=0.09336, over 21412.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3117, pruned_loss=0.07866, over 4265586.27 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:44:54,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=2193606.0, ans=10.0 2023-06-26 09:44:55,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2193666.0, ans=0.125 2023-06-26 09:45:06,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2193666.0, ans=0.0 2023-06-26 09:45:11,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-06-26 09:46:02,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2193786.0, ans=0.04949747468305833 2023-06-26 09:46:05,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2193846.0, ans=0.0 2023-06-26 09:46:06,644 INFO [train.py:996] (0/4) Epoch 12, batch 30200, loss[loss=0.2353, simple_loss=0.3349, pruned_loss=0.06786, over 21611.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3136, pruned_loss=0.07761, over 4269156.60 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:46:15,739 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.538e+02 9.559e+02 1.334e+03 1.956e+03 4.030e+03, threshold=2.668e+03, percent-clipped=0.0 2023-06-26 09:46:18,609 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-26 09:46:26,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2193906.0, ans=0.125 2023-06-26 09:47:21,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2194026.0, ans=0.5 2023-06-26 09:47:23,892 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-26 09:47:30,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2194026.0, ans=0.0 2023-06-26 09:47:51,665 INFO [train.py:996] (0/4) Epoch 12, batch 30250, loss[loss=0.2935, simple_loss=0.3936, pruned_loss=0.09674, over 21896.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3211, pruned_loss=0.07994, over 4272822.41 frames. ], batch size: 317, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:48:20,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=2194206.0, ans=15.0 2023-06-26 09:48:40,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2194206.0, ans=0.125 2023-06-26 09:49:23,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2194386.0, ans=0.0 2023-06-26 09:49:26,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2194386.0, ans=0.0 2023-06-26 09:49:39,823 INFO [train.py:996] (0/4) Epoch 12, batch 30300, loss[loss=0.2131, simple_loss=0.278, pruned_loss=0.07404, over 21791.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3184, pruned_loss=0.07984, over 4265941.19 frames. ], batch size: 317, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:49:48,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.277e+02 8.599e+02 1.166e+03 1.867e+03 3.678e+03, threshold=2.332e+03, percent-clipped=6.0 2023-06-26 09:49:59,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.73 vs. limit=22.5 2023-06-26 09:51:34,377 INFO [train.py:996] (0/4) Epoch 12, batch 30350, loss[loss=0.2879, simple_loss=0.382, pruned_loss=0.0969, over 21847.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3197, pruned_loss=0.08093, over 4272280.77 frames. ], batch size: 317, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:51:43,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2194746.0, ans=0.0 2023-06-26 09:52:03,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2194806.0, ans=0.1 2023-06-26 09:52:28,027 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=7.877e-03 2023-06-26 09:52:46,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=2194926.0, ans=0.02 2023-06-26 09:52:56,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2194986.0, ans=0.125 2023-06-26 09:52:57,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-26 09:53:10,547 INFO [train.py:996] (0/4) Epoch 12, batch 30400, loss[loss=0.2227, simple_loss=0.273, pruned_loss=0.08622, over 20335.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3149, pruned_loss=0.08011, over 4259852.14 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:53:18,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.851e+02 1.020e+03 1.381e+03 2.107e+03 4.576e+03, threshold=2.761e+03, percent-clipped=19.0 2023-06-26 09:53:42,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2195166.0, ans=0.125 2023-06-26 09:53:57,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2195166.0, ans=0.0 2023-06-26 09:54:01,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2195226.0, ans=0.125 2023-06-26 09:54:39,905 INFO [train.py:996] (0/4) Epoch 12, batch 30450, loss[loss=0.3037, simple_loss=0.4236, pruned_loss=0.09197, over 19804.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3162, pruned_loss=0.07923, over 4200746.25 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:54:51,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2195346.0, ans=0.2 2023-06-26 09:54:59,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2195406.0, ans=0.2 2023-06-26 09:55:52,911 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small_causal/epoch-12.pt 2023-06-26 09:55:54,087 INFO [train.py:1249] (0/4) Done!